[LU-2415] recovery-mds-scale test_failover_mds: lustre:MDT0000/recovery_status found no match Created: 30/Nov/12 Updated: 08/Jul/16 Resolved: 19/Apr/13 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 5730 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8106fea4-3a9d-11e2-b2e6-52540035b04c. The sub-test test_failover_mds failed with the following error:
test log shows == recovery-mds-scale test failover_mds: failover MDS == 13:08:23 (1354136903)
Started client load: dd on client-28vm5
CMD: client-28vm5 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/home/autotest/.autotest/shared_dir/2012-11-28/112808-70113261691540/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2012-11-28/lustre-master-el6-x86_64-fo__1065__-70113261691540-112807/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= run_dd.sh
Started client load: tar on client-28vm6
CMD: client-28vm6 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/home/autotest/.autotest/shared_dir/2012-11-28/112808-70113261691540/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2012-11-28/lustre-master-el6-x86_64-fo__1065__-70113261691540-112807/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= run_tar.sh
client loads pids:
CMD: client-28vm5,client-28vm6 cat /tmp/client-load.pid
client-28vm6: 4127
client-28vm5: 4080
==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=900
CMD: client-28vm5 rc=\$([ -f /proc/sys/lnet/catastrophe ] && echo \$(< /proc/sys/lnet/catastrophe) || echo 0);
if [ \$rc -ne 0 ]; then echo \$(hostname): \$rc; fi
exit \$rc;
CMD: client-28vm5 ps auxwww | grep -v grep | grep -q run_dd.sh
CMD: client-28vm6 rc=\$([ -f /proc/sys/lnet/catastrophe ] && echo \$(< /proc/sys/lnet/catastrophe) || echo 0);
if [ \$rc -ne 0 ]; then echo \$(hostname): \$rc; fi
exit \$rc;
CMD: client-28vm6 ps auxwww | grep -v grep | grep -q run_tar.sh
Wait mds1 recovery complete before doing next failover...
CMD: client-28vm1.lab.whamcloud.com lctl get_param -n at_max
affected facets: mds1
CMD: client-28vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre:MDT0000.recovery_status 662
client-28vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/*/lustre:MDT0000/recovery_status: Found no match
mds1 recovery is not completed!
2012-11-28 13:08:32 Terminating clients loads ...
|
| Comments |
| Comment by Andreas Dilger [ 30/Nov/12 ] |
|
It looks like the root of the problem is that it is looking for "lctl get_param *.lustre:MDT0000.recovery_status" and "lctl get_param *.lustre:OST0000.recovery_status" (note ':' instead of '-' in the device name). Somewhere the test-framework.sh is either finding or caching the wrong device name, or the device name was not updated on disk, or it actually has the wrong name in /proc. |
| Comment by Alex Zhuravlev [ 04/Dec/12 ] |
|
just tried with llmount.sh: Setup mgs, mdt, osts ...
...
going through test-framework.sh now.. |
| Comment by Jian Yu [ 10/Apr/13 ] |
|
The issue was fixed by patch http://review.whamcloud.com/#change,5867 in |