Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8805

Failover: recovery-mds-scale test_failover_mds: test_failover_mds returned 4

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • Failover: EL7 Server/Client
      master, build# 3468
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2.

      The sub-test test_failover_mds failed with the following error:

      test_failover_mds returned 4
      

      test_log:

      == recovery-mds-scale test failover_mds: failover MDS ================================================ 17:03:39 (1478131419)
      Started client load: dd on onyx-40vm5
      CMD: onyx-40vm5 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared_test/autotest/2016-11-02/153732-70163256913820/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= CLIENT_COUNT=2 LFS=/usr/bin/lfs run_dd.sh
      Started client load: tar on onyx-40vm6
      CMD: onyx-40vm6 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared_test/autotest/2016-11-02/153732-70163256913820/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= CLIENT_COUNT=2 LFS=/usr/bin/lfs run_tar.sh
      client loads pids:
      CMD: onyx-40vm5,onyx-40vm6 cat /tmp/client-load.pid
      onyx-40vm6: 7449
      onyx-40vm5: 7479
      ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=1200
      Client load failed on node onyx-40vm5, rc=1
      2016-11-02 17:03:46 Terminating clients loads ...
      Duration:               86400
      Server failover period: 1200 seconds
      Exited after:           0 seconds
      Number of failovers before exit:
      mds1: 0 times
      ost1: 0 times
      ost2: 0 times
      ost3: 0 times
      ost4: 0 times
      ost5: 0 times
      ost6: 0 times
      ost7: 0 times
      Status: FAIL: rc=4
      CMD: onyx-40vm5,onyx-40vm6 test -f /tmp/client-load.pid &&
              { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
      /usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 22083 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS run_${load}.sh"
      /usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 22277 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS run_${load}.sh"
      Dumping lctl log to /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.*.1478131427.log
      CMD: onyx-40vm3,onyx-40vm4,onyx-40vm7,onyx-40vm8 /usr/sbin/lctl dk > /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.debug_log.\$(hostname -s).1478131427.log;
               dmesg > /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.dmesg.\$(hostname -s).1478131427.log
      onyx-40vm3: invalid parameter 'dump_kernel'
      onyx-40vm3: open(dump_kernel) failed: No such file or directory
      onyx-40vm4: invalid parameter 'dump_kernel'
      onyx-40vm4: open(dump_kernel) failed: No such file or directory
      

      Could not find another useful information.
      Can be related to LU-5483

      Attachments

        Issue Links

          Activity

            [LU-8805] Failover: recovery-mds-scale test_failover_mds: test_failover_mds returned 4
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23861/
            Subject: LU-8805 tests: fix defect introduced by LU-8226
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 300739c76abdaec738c63237249d88097c595cc8

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23861/ Subject: LU-8805 tests: fix defect introduced by LU-8226 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 300739c76abdaec738c63237249d88097c595cc8

            Elena Gryaznova (elena.gryaznova@seagate.com) uploaded a new patch: http://review.whamcloud.com/23861
            Subject: LU-8805 tests: fix defect introduced by LU-8226
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ef177978fb0028083e7452ee8ffe6d1cd5ac719b

            gerrit Gerrit Updater added a comment - Elena Gryaznova (elena.gryaznova@seagate.com) uploaded a new patch: http://review.whamcloud.com/23861 Subject: LU-8805 tests: fix defect introduced by LU-8226 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ef177978fb0028083e7452ee8ffe6d1cd5ac719b

            regression is caused by http://review.whamcloud.com/#/c/20539/ :

            commit 35119a60678b970b76dc13d8932f5a59a9d53996
            Author:     Parinay Kondekar <parinay.kondekar@seagate.com>
            AuthorDate: Thu Sep 29 12:50:28 2016 +0530
            Commit:     Vitaly Fertman <vitaly.fertman@seagate.com>
            CommitDate: Fri Oct 28 23:13:58 2016 +0300
            
                LU-8226 tests: Change check_catastrophe() to check_node_health()
            
            

            by the proposed modification :

            +       ps -C $testload | grep $client || return 1
            

            testload is started on remote node, ps -C does not show it :

            [root@fre813 ~]# ps aux | grep run_dd
            root     13143  0.0  0.1 103952  1352 ?        Sl   17:54   0:00 /usr/bin/pdsh -R ssh -S -w fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE="/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114//kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh")
            root     13152  0.0  0.2  58016  3316 ?        Ss   17:54   0:00 ssh -oConnectTimeout=10 -2 -a -x -lroot fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE="/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114//kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh")
            root     15383  0.0  0.0 103236   840 pts/0    R+   17:55   0:00 grep run_dd
            
            
            [root@fre813 ~]# ps -C run_dd.sh
              PID TTY          TIME CMD
            
            
            egryaznova Elena Gryaznova added a comment - regression is caused by http://review.whamcloud.com/#/c/20539/ : commit 35119a60678b970b76dc13d8932f5a59a9d53996 Author: Parinay Kondekar <parinay.kondekar@seagate.com> AuthorDate: Thu Sep 29 12:50:28 2016 +0530 Commit: Vitaly Fertman <vitaly.fertman@seagate.com> CommitDate: Fri Oct 28 23:13:58 2016 +0300 LU-8226 tests: Change check_catastrophe() to check_node_health() by the proposed modification : + ps -C $testload | grep $client || return 1 testload is started on remote node, ps -C does not show it : [root@fre813 ~]# ps aux | grep run_dd root 13143 0.0 0.1 103952 1352 ? Sl 17:54 0:00 /usr/bin/pdsh -R ssh -S -w fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE= "/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114 //kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh" ) root 13152 0.0 0.2 58016 3316 ? Ss 17:54 0:00 ssh -oConnectTimeout=10 -2 -a -x -lroot fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE= "/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114 //kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh" ) root 15383 0.0 0.0 103236 840 pts/0 R+ 17:55 0:00 grep run_dd [root@fre813 ~]# ps -C run_dd.sh PID TTY TIME CMD

            According to the logs ( https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2), the failure should be caused
            by the following script in "test-framework.sh"

            check_client_load () {
                    local client=$1
                    local var=$(node_var_name $client)_load
                    local testload=run_${!var}.sh
                    
                    ps -C $testload | grep $client || return 1      <--- this check failed.
                
                    # bug 18914: try to connect several times not only when
                    # check ps, but  while check_node_health also
                    ...
            }
            

            the load has been started successfully at onyx-40vm5 and onyx-40vm6.
            the debug patch http://review.whamcloud.com/23717 is created to collect more logs.

            hongchao.zhang Hongchao Zhang added a comment - According to the logs ( https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2 ), the failure should be caused by the following script in "test-framework.sh" check_client_load () { local client=$1 local var=$(node_var_name $client)_load local testload=run_${!var}.sh ps -C $testload | grep $client || return 1 <--- this check failed. # bug 18914: try to connect several times not only when # check ps, but while check_node_health also ... } the load has been started successfully at onyx-40vm5 and onyx-40vm6. the debug patch http://review.whamcloud.com/23717 is created to collect more logs.

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23717
            Subject: LU-8805 test: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 312ab76e4049a443fc45ac8593825a722240003b

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23717 Subject: LU-8805 test: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 312ab76e4049a443fc45ac8593825a722240003b

            These tests are failing in under 20s, so I'd suspect there is something broken in the test scripts, even before it is doing anything in the test. From the "dump_kernel" messages, it appears maybe even the Lustre modules are not loaded.

            adilger Andreas Dilger added a comment - These tests are failing in under 20s, so I'd suspect there is something broken in the test scripts, even before it is doing anything in the test. From the "dump_kernel" messages, it appears maybe even the Lustre modules are not loaded.
            pjones Peter Jones added a comment -

            Hongchao

            Could you please advise on this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please advise on this issue? Thanks Peter

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: