Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2415

recovery-mds-scale test_failover_mds: lustre:MDT0000/recovery_status found no match

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.4.0
    • None
    • 3
    • 5730

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8106fea4-3a9d-11e2-b2e6-52540035b04c.

      The sub-test test_failover_mds failed with the following error:

      test_failover_mds returned 7

      test log shows

      == recovery-mds-scale test failover_mds: failover MDS == 13:08:23 (1354136903)
      Started client load: dd on client-28vm5
      CMD: client-28vm5 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/home/autotest/.autotest/shared_dir/2012-11-28/112808-70113261691540/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2012-11-28/lustre-master-el6-x86_64-fo__1065__-70113261691540-112807/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= run_dd.sh
      Started client load: tar on client-28vm6
      CMD: client-28vm6 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/home/autotest/.autotest/shared_dir/2012-11-28/112808-70113261691540/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2012-11-28/lustre-master-el6-x86_64-fo__1065__-70113261691540-112807/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= run_tar.sh
      client loads pids:
      CMD: client-28vm5,client-28vm6 cat /tmp/client-load.pid
      client-28vm6: 4127
      client-28vm5: 4080
      ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=900
      CMD: client-28vm5 rc=\$([ -f /proc/sys/lnet/catastrophe ] && echo \$(< /proc/sys/lnet/catastrophe) || echo 0);
      		if [ \$rc -ne 0 ]; then echo \$(hostname): \$rc; fi
      		exit \$rc;
      CMD: client-28vm5 ps auxwww | grep -v grep | grep -q run_dd.sh
      CMD: client-28vm6 rc=\$([ -f /proc/sys/lnet/catastrophe ] && echo \$(< /proc/sys/lnet/catastrophe) || echo 0);
      		if [ \$rc -ne 0 ]; then echo \$(hostname): \$rc; fi
      		exit \$rc;
      CMD: client-28vm6 ps auxwww | grep -v grep | grep -q run_tar.sh
      Wait mds1 recovery complete before doing next failover...
      CMD: client-28vm1.lab.whamcloud.com lctl get_param -n at_max
      affected facets: mds1
      CMD: client-28vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre:MDT0000.recovery_status 662 
      client-28vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/*/lustre:MDT0000/recovery_status: Found no match
      mds1 recovery is not completed!
      2012-11-28 13:08:32 Terminating clients loads ...
      

      Attachments

        Issue Links

          Activity

            [LU-2415] recovery-mds-scale test_failover_mds: lustre:MDT0000/recovery_status found no match
            yujian Jian Yu added a comment -

            The issue was fixed by patch http://review.whamcloud.com/#change,5867 in LU-2008.

            yujian Jian Yu added a comment - The issue was fixed by patch http://review.whamcloud.com/#change,5867 in LU-2008 .

            just tried with llmount.sh:

            Setup mgs, mdt, osts
            Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1
            Started lustre-MDT0000

            ...

            1. debugfs -R stats /tmp/lustre-mdt1 |grep 'volume name'
              debugfs 1.42.3.wc3 (15-Aug-2012)
              Filesystem volume name: lustre-MDT0000

            ...

            1. ls /proc/fs/lustre/mdt/
              lustre-MDT0000 num_refs

            going through test-framework.sh now..

            bzzz Alex Zhuravlev added a comment - just tried with llmount.sh: Setup mgs, mdt, osts Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1 Started lustre-MDT0000 ... debugfs -R stats /tmp/lustre-mdt1 |grep 'volume name' debugfs 1.42.3.wc3 (15-Aug-2012) Filesystem volume name: lustre-MDT0000 ... ls /proc/fs/lustre/mdt/ lustre-MDT0000 num_refs going through test-framework.sh now..

            It looks like the root of the problem is that it is looking for "lctl get_param *.lustre:MDT0000.recovery_status" and "lctl get_param *.lustre:OST0000.recovery_status" (note ':' instead of '-' in the device name).

            Somewhere the test-framework.sh is either finding or caching the wrong device name, or the device name was not updated on disk, or it actually has the wrong name in /proc.

            adilger Andreas Dilger added a comment - It looks like the root of the problem is that it is looking for " lctl get_param *.lustre:MDT0000.recovery_status " and " lctl get_param *.lustre:OST0000.recovery_status " (note ':' instead of '-' in the device name). Somewhere the test-framework.sh is either finding or caching the wrong device name, or the device name was not updated on disk, or it actually has the wrong name in /proc.

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: