[LU-8805] Failover: recovery-mds-scale test_failover_mds: test_failover_mds returned 4 Created: 07/Nov/16 Updated: 24/May/17 Resolved: 21/Nov/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Failover: EL7 Server/Client |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2. The sub-test test_failover_mds failed with the following error: test_failover_mds returned 4 test_log: == recovery-mds-scale test failover_mds: failover MDS ================================================ 17:03:39 (1478131419)
Started client load: dd on onyx-40vm5
CMD: onyx-40vm5 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared_test/autotest/2016-11-02/153732-70163256913820/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= CLIENT_COUNT=2 LFS=/usr/bin/lfs run_dd.sh
Started client load: tar on onyx-40vm6
CMD: onyx-40vm6 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared_test/autotest/2016-11-02/153732-70163256913820/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= CLIENT_COUNT=2 LFS=/usr/bin/lfs run_tar.sh
client loads pids:
CMD: onyx-40vm5,onyx-40vm6 cat /tmp/client-load.pid
onyx-40vm6: 7449
onyx-40vm5: 7479
==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
Client load failed on node onyx-40vm5, rc=1
2016-11-02 17:03:46 Terminating clients loads ...
Duration: 86400
Server failover period: 1200 seconds
Exited after: 0 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 0 times
ost6: 0 times
ost7: 0 times
Status: FAIL: rc=4
CMD: onyx-40vm5,onyx-40vm6 test -f /tmp/client-load.pid &&
{ kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 22083 Killed do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS run_${load}.sh"
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 22277 Killed do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS run_${load}.sh"
Dumping lctl log to /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.*.1478131427.log
CMD: onyx-40vm3,onyx-40vm4,onyx-40vm7,onyx-40vm8 /usr/sbin/lctl dk > /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.debug_log.\$(hostname -s).1478131427.log;
dmesg > /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.dmesg.\$(hostname -s).1478131427.log
onyx-40vm3: invalid parameter 'dump_kernel'
onyx-40vm3: open(dump_kernel) failed: No such file or directory
onyx-40vm4: invalid parameter 'dump_kernel'
onyx-40vm4: open(dump_kernel) failed: No such file or directory
Could not find another useful information. |
| Comments |
| Comment by Peter Jones [ 10/Nov/16 ] |
|
Hongchao Could you please advise on this issue? Thanks Peter |
| Comment by Andreas Dilger [ 10/Nov/16 ] |
|
These tests are failing in under 20s, so I'd suspect there is something broken in the test scripts, even before it is doing anything in the test. From the "dump_kernel" messages, it appears maybe even the Lustre modules are not loaded. |
| Comment by Gerrit Updater [ 11/Nov/16 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23717 |
| Comment by Hongchao Zhang [ 11/Nov/16 ] |
|
According to the logs ( https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2), the failure should be caused check_client_load () {
local client=$1
local var=$(node_var_name $client)_load
local testload=run_${!var}.sh
ps -C $testload | grep $client || return 1 <--- this check failed.
# bug 18914: try to connect several times not only when
# check ps, but while check_node_health also
...
}
the load has been started successfully at onyx-40vm5 and onyx-40vm6. |
| Comment by Elena Gryaznova [ 18/Nov/16 ] |
|
regression is caused by http://review.whamcloud.com/#/c/20539/ : commit 35119a60678b970b76dc13d8932f5a59a9d53996
Author: Parinay Kondekar <parinay.kondekar@seagate.com>
AuthorDate: Thu Sep 29 12:50:28 2016 +0530
Commit: Vitaly Fertman <vitaly.fertman@seagate.com>
CommitDate: Fri Oct 28 23:13:58 2016 +0300
LU-8226 tests: Change check_catastrophe() to check_node_health()
by the proposed modification : + ps -C $testload | grep $client || return 1
testload is started on remote node, ps -C does not show it : [root@fre813 ~]# ps aux | grep run_dd root 13143 0.0 0.1 103952 1352 ? Sl 17:54 0:00 /usr/bin/pdsh -R ssh -S -w fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE="/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114//kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh") root 13152 0.0 0.2 58016 3316 ? Ss 17:54 0:00 ssh -oConnectTimeout=10 -2 -a -x -lroot fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE="/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114//kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh") root 15383 0.0 0.0 103236 840 pts/0 R+ 17:55 0:00 grep run_dd [root@fre813 ~]# ps -C run_dd.sh PID TTY TIME CMD |
| Comment by Gerrit Updater [ 18/Nov/16 ] |
|
Elena Gryaznova (elena.gryaznova@seagate.com) uploaded a new patch: http://review.whamcloud.com/23861 |
| Comment by Gerrit Updater [ 21/Nov/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23861/ |
| Comment by Peter Jones [ 21/Nov/16 ] |
|
Landed for 2.9 |