Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8805

Failover: recovery-mds-scale test_failover_mds: test_failover_mds returned 4

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • Failover: EL7 Server/Client
      master, build# 3468
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2.

      The sub-test test_failover_mds failed with the following error:

      test_failover_mds returned 4
      

      test_log:

      == recovery-mds-scale test failover_mds: failover MDS ================================================ 17:03:39 (1478131419)
      Started client load: dd on onyx-40vm5
      CMD: onyx-40vm5 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared_test/autotest/2016-11-02/153732-70163256913820/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= CLIENT_COUNT=2 LFS=/usr/bin/lfs run_dd.sh
      Started client load: tar on onyx-40vm6
      CMD: onyx-40vm6 PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared_test/autotest/2016-11-02/153732-70163256913820/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB=/usr/share/doc/dbench/loadfiles DBENCH_SRC= CLIENT_COUNT=2 LFS=/usr/bin/lfs run_tar.sh
      client loads pids:
      CMD: onyx-40vm5,onyx-40vm6 cat /tmp/client-load.pid
      onyx-40vm6: 7449
      onyx-40vm5: 7479
      ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=1200
      Client load failed on node onyx-40vm5, rc=1
      2016-11-02 17:03:46 Terminating clients loads ...
      Duration:               86400
      Server failover period: 1200 seconds
      Exited after:           0 seconds
      Number of failovers before exit:
      mds1: 0 times
      ost1: 0 times
      ost2: 0 times
      ost3: 0 times
      ost4: 0 times
      ost5: 0 times
      ost6: 0 times
      ost7: 0 times
      Status: FAIL: rc=4
      CMD: onyx-40vm5,onyx-40vm6 test -f /tmp/client-load.pid &&
              { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
      /usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 22083 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS run_${load}.sh"
      /usr/lib64/lustre/tests/recovery-mds-scale.sh: line 103: 22277 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS run_${load}.sh"
      Dumping lctl log to /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.*.1478131427.log
      CMD: onyx-40vm3,onyx-40vm4,onyx-40vm7,onyx-40vm8 /usr/sbin/lctl dk > /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.debug_log.\$(hostname -s).1478131427.log;
               dmesg > /logdir/test_logs/2016-11-02/lustre-master-el7-x86_64--failover--1_15_1__3468__-70163256913820-153732/recovery-mds-scale.test_failover_mds.dmesg.\$(hostname -s).1478131427.log
      onyx-40vm3: invalid parameter 'dump_kernel'
      onyx-40vm3: open(dump_kernel) failed: No such file or directory
      onyx-40vm4: invalid parameter 'dump_kernel'
      onyx-40vm4: open(dump_kernel) failed: No such file or directory
      

      Could not find another useful information.
      Can be related to LU-5483

      Attachments

        Issue Links

          Activity

            [LU-8805] Failover: recovery-mds-scale test_failover_mds: test_failover_mds returned 4
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23861/
            Subject: LU-8805 tests: fix defect introduced by LU-8226
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 300739c76abdaec738c63237249d88097c595cc8

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23861/ Subject: LU-8805 tests: fix defect introduced by LU-8226 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 300739c76abdaec738c63237249d88097c595cc8

            Elena Gryaznova (elena.gryaznova@seagate.com) uploaded a new patch: http://review.whamcloud.com/23861
            Subject: LU-8805 tests: fix defect introduced by LU-8226
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ef177978fb0028083e7452ee8ffe6d1cd5ac719b

            gerrit Gerrit Updater added a comment - Elena Gryaznova (elena.gryaznova@seagate.com) uploaded a new patch: http://review.whamcloud.com/23861 Subject: LU-8805 tests: fix defect introduced by LU-8226 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ef177978fb0028083e7452ee8ffe6d1cd5ac719b

            regression is caused by http://review.whamcloud.com/#/c/20539/ :

            commit 35119a60678b970b76dc13d8932f5a59a9d53996
            Author:     Parinay Kondekar <parinay.kondekar@seagate.com>
            AuthorDate: Thu Sep 29 12:50:28 2016 +0530
            Commit:     Vitaly Fertman <vitaly.fertman@seagate.com>
            CommitDate: Fri Oct 28 23:13:58 2016 +0300
            
                LU-8226 tests: Change check_catastrophe() to check_node_health()
            
            

            by the proposed modification :

            +       ps -C $testload | grep $client || return 1
            

            testload is started on remote node, ps -C does not show it :

            [root@fre813 ~]# ps aux | grep run_dd
            root     13143  0.0  0.1 103952  1352 ?        Sl   17:54   0:00 /usr/bin/pdsh -R ssh -S -w fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE="/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114//kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh")
            root     13152  0.0  0.2  58016  3316 ?        Ss   17:54   0:00 ssh -oConnectTimeout=10 -2 -a -x -lroot fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE="/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114//kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh")
            root     15383  0.0  0.0 103236   840 pts/0    R+   17:55   0:00 grep run_dd
            
            
            [root@fre813 ~]# ps -C run_dd.sh
              PID TTY          TIME CMD
            
            
            egryaznova Elena Gryaznova added a comment - regression is caused by http://review.whamcloud.com/#/c/20539/ : commit 35119a60678b970b76dc13d8932f5a59a9d53996 Author: Parinay Kondekar <parinay.kondekar@seagate.com> AuthorDate: Thu Sep 29 12:50:28 2016 +0530 Commit: Vitaly Fertman <vitaly.fertman@seagate.com> CommitDate: Fri Oct 28 23:13:58 2016 +0300 LU-8226 tests: Change check_catastrophe() to check_node_health() by the proposed modification : + ps -C $testload | grep $client || return 1 testload is started on remote node, ps -C does not show it : [root@fre813 ~]# ps aux | grep run_dd root 13143 0.0 0.1 103952 1352 ? Sl 17:54 0:00 /usr/bin/pdsh -R ssh -S -w fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE= "/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114 //kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh" ) root 13152 0.0 0.2 58016 3316 ? Ss 17:54 0:00 ssh -oConnectTimeout=10 -2 -a -x -lroot fre814 (PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /root; LUSTRE= "/usr/lib64/lustre" sh -c "PATH=/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: MOUNT=/mnt/lustre ERRORS_OK= BREAK_ON_ERROR= END_RUN_FILE=/shared/fremont/test-results/xperior-custom/3114 //kvm8-octet-2/shared-dir//recovery-mds-scale/end_run_file LOAD_PID_FILE=/tmp/client-load.pid TESTLOG_PREFIX=/tmp/test_logs/1479491651/recovery-mds-scale TESTNAME=test_failover_mds DBENCH_LIB= DBENCH_SRC= CLIENT_COUNT=3 LFS=/usr/bin/lfs run_dd.sh" ) root 15383 0.0 0.0 103236 840 pts/0 R+ 17:55 0:00 grep run_dd [root@fre813 ~]# ps -C run_dd.sh PID TTY TIME CMD

            According to the logs ( https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2), the failure should be caused
            by the following script in "test-framework.sh"

            check_client_load () {
                    local client=$1
                    local var=$(node_var_name $client)_load
                    local testload=run_${!var}.sh
                    
                    ps -C $testload | grep $client || return 1      <--- this check failed.
                
                    # bug 18914: try to connect several times not only when
                    # check ps, but  while check_node_health also
                    ...
            }
            

            the load has been started successfully at onyx-40vm5 and onyx-40vm6.
            the debug patch http://review.whamcloud.com/23717 is created to collect more logs.

            hongchao.zhang Hongchao Zhang added a comment - According to the logs ( https://testing.hpdd.intel.com/test_sets/be9e4ae0-a1c0-11e6-8ed2-5254006e85c2 ), the failure should be caused by the following script in "test-framework.sh" check_client_load () { local client=$1 local var=$(node_var_name $client)_load local testload=run_${!var}.sh ps -C $testload | grep $client || return 1 <--- this check failed. # bug 18914: try to connect several times not only when # check ps, but while check_node_health also ... } the load has been started successfully at onyx-40vm5 and onyx-40vm6. the debug patch http://review.whamcloud.com/23717 is created to collect more logs.

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23717
            Subject: LU-8805 test: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 312ab76e4049a443fc45ac8593825a722240003b

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/23717 Subject: LU-8805 test: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 312ab76e4049a443fc45ac8593825a722240003b

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: