[LU-8805] Failover: recovery-mds-scale test_failover_mds: test_failover_mds returned 4 Created: 07/Nov/16  Updated: 24/May/17  Resolved: 21/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

Failover: EL7 Server/Client
master, build# 3468


Issue Links:
Related
is related to LU-8226 t-f check_catastrophe() defect Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2.

The sub-test test_failover_mds failed with the following error:

test_failover_mds returned 4

test_log:

== recovery-mds-scale test failover_mds: failover MDS ================================================ 17:03:39 (1478131419)
Started client load: dd on onyx-40vm5
CMD: onyx-40vm5 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared_test/autotest/2016-11-02/153732-70163256913820/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= CLIENT_COUNT=2 LFS=/usr/bin/lfs run_dd.sh
Started client load: tar on onyx-40vm6
CMD: onyx-40vm6 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared_test/autotest/2016-11-02/153732-70163256913820/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= CLIENT_COUNT=2 LFS=/usr/bin/lfs run_tar.sh
client loads pids:
CMD: onyx-40vm5,onyx-40vm6 cat /tmp/client-load.pid
onyx-40vm6: 7449
onyx-40vm5: 7479
==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=1200
Client load failed on node onyx-40vm5, rc=1
2016-11-02 17:03:46 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           0 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 0 times
ost6: 0 times
ost7: 0 times
Status: FAIL: rc=4
CMD: onyx-40vm5,onyx-40vm6 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 22083 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS run_${load}.sh"
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 22277 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS run_${load}.sh"
Dumping lctl log to /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.*.1478131427.log
CMD: onyx-40vm3,onyx-40vm4,onyx-40vm7,onyx-40vm8 /usr/sbin/lctl dk > /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.debug_log.\$(hostname -s).1478131427.log;
         dmesg > /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.dmesg.\$(hostname -s).1478131427.log
onyx-40vm3: invalid parameter 'dump_kernel'
onyx-40vm3: open(dump_kernel) failed: No such file or directory
onyx-40vm4: invalid parameter 'dump_kernel'
onyx-40vm4: open(dump_kernel) failed: No such file or directory

Could not find another useful information.
Can be related to LU-5483



 Comments   
Comment by Peter Jones [ 10/Nov/16 ]

Hongchao

Could you please advise on this issue?

Thanks

Peter

Comment by Andreas Dilger [ 10/Nov/16 ]

These tests are failing in under 20s, so I'd suspect there is something broken in the test scripts, even before it is doing anything in the test. From the "dump_kernel" messages, it appears maybe even the Lustre modules are not loaded.

Comment by Gerrit Updater [ 11/Nov/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23717
Subject: LU-8805 test: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 312ab76e4049a443fc45ac8593825a722240003b

Comment by Hongchao Zhang [ 11/Nov/16 ]

According to the logs ( https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2), the failure should be caused
by the following script in "test-framework.sh"

check_client_load () {
        local client=$1
        local var=$(node_var_name $client)_load
        local testload=run_${!var}.sh
        
        ps -C $testload | grep $client || return 1      <--- this check failed.
    
        # bug 18914: try to connect several times not only when
        # check ps, but  while check_node_health also
        ...
}

the load has been started successfully at onyx-40vm5 and onyx-40vm6.
the debug patch http://review.whamcloud.com/23717 is created to collect more logs.

Comment by Elena Gryaznova [ 18/Nov/16 ]

regression is caused by http://review.whamcloud.com/#/c/20539/ :

commit 35119a60678b970b76dc13d8932f5a59a9d53996
Author:     Parinay Kondekar <parinay.kondekar@seagate.com>
AuthorDate: Thu Sep 29 12:50:28 2016 +0530
Commit:     Vitaly Fertman <vitaly.fertman@seagate.com>
CommitDate: Fri Oct 28 23:13:58 2016 +0300

    LU-8226 tests: Change check_catastrophe() to check_node_health()

by the proposed modification :

+       ps -C $testload | grep $client || return 1

testload is started on remote node, ps -C does not show it :

[root@fre813 ~]# ps aux | grep run_dd
root     13143  0.0  0.1 103952  1352 ?        Sl   17:54   0:00 /usr/bin/pdsh -R ssh -S -w fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE="/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114//kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh")
root     13152  0.0  0.2  58016  3316 ?        Ss   17:54   0:00 ssh -oConnectTimeout=10 -2 -a -x -lroot fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE="/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114//kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh")
root     15383  0.0  0.0 103236   840 pts/0    R+   17:55   0:00 grep run_dd


[root@fre813 ~]# ps -C run_dd.sh
  PID TTY          TIME CMD

Comment by Gerrit Updater [ 18/Nov/16 ]

Elena Gryaznova (elena.gryaznova@seagate.com) uploaded a new patch: http://review.whamcloud.com/23861
Subject: LU-8805 tests: fix defect introduced by LU-8226
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ef177978fb0028083e7452ee8ffe6d1cd5ac719b

Comment by Gerrit Updater [ 21/Nov/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23861/
Subject: LU-8805 tests: fix defect introduced by LU-8226
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 300739c76abdaec738c63237249d88097c595cc8

Comment by Peter Jones [ 21/Nov/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:20:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.