[LU-2415] recovery-mds-scale test_failover_mds: lustre:MDT0000/recovery_status found no match Created: 30/Nov/12  Updated: 08/Jul/16  Resolved: 19/Apr/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-2008 After hardware reboot (using pm) the ... Resolved
Related
is related to LU-6992 recovery-random-scale test_fail_clien... Resolved
Severity: 3
Rank (Obsolete): 5730

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8106fea4-3a9d-11e2-b2e6-52540035b04c.

The sub-test test_failover_mds failed with the following error:

test_failover_mds returned 7

test log shows

== recovery-mds-scale test failover_mds: failover MDS == 13:08:23 (1354136903)
Started client load: dd on client-28vm5
CMD: client-28vm5 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/home/autotest/.autotest/shared_dir/2012-11-28/112808-70113261691540/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2012-11-28/lustre-master-el6-x86_64-fo__1065__-70113261691540-112807/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= run_dd.sh
Started client load: tar on client-28vm6
CMD: client-28vm6 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/home/autotest/.autotest/shared_dir/2012-11-28/112808-70113261691540/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2012-11-28/lustre-master-el6-x86_64-fo__1065__-70113261691540-112807/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= run_tar.sh
client loads pids:
CMD: client-28vm5,client-28vm6 cat /tmp/client-load.pid
client-28vm6: 4127
client-28vm5: 4080
==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=900
CMD: client-28vm5 rc=\$([ -f /proc/sys/lnet/catastrophe ] && echo \$(< /proc/sys/lnet/catastrophe) || echo 0);
		if [ \$rc -ne 0 ]; then echo \$(hostname): \$rc; fi
		exit \$rc;
CMD: client-28vm5 ps auxwww | grep -v grep | grep -q run_dd.sh
CMD: client-28vm6 rc=\$([ -f /proc/sys/lnet/catastrophe ] && echo \$(< /proc/sys/lnet/catastrophe) || echo 0);
		if [ \$rc -ne 0 ]; then echo \$(hostname): \$rc; fi
		exit \$rc;
CMD: client-28vm6 ps auxwww | grep -v grep | grep -q run_tar.sh
Wait mds1 recovery complete before doing next failover...
CMD: client-28vm1.lab.whamcloud.com lctl get_param -n at_max
affected facets: mds1
CMD: client-28vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre:MDT0000.recovery_status 662 
client-28vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/*/lustre:MDT0000/recovery_status: Found no match
mds1 recovery is not completed!
2012-11-28 13:08:32 Terminating clients loads ...


 Comments   
Comment by Andreas Dilger [ 30/Nov/12 ]

It looks like the root of the problem is that it is looking for "lctl get_param *.lustre:MDT0000.recovery_status" and "lctl get_param *.lustre:OST0000.recovery_status" (note ':' instead of '-' in the device name).

Somewhere the test-framework.sh is either finding or caching the wrong device name, or the device name was not updated on disk, or it actually has the wrong name in /proc.

Comment by Alex Zhuravlev [ 04/Dec/12 ]

just tried with llmount.sh:

Setup mgs, mdt, osts
Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1
Started lustre-MDT0000

...

  1. debugfs -R stats /tmp/lustre-mdt1 |grep 'volume name'
    debugfs 1.42.3.wc3 (15-Aug-2012)
    Filesystem volume name: lustre-MDT0000

...

  1. ls /proc/fs/lustre/mdt/
    lustre-MDT0000 num_refs

going through test-framework.sh now..

Comment by Jian Yu [ 10/Apr/13 ]

The issue was fixed by patch http://review.whamcloud.com/#change,5867 in LU-2008.

Generated at Sat Feb 10 01:25:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.