Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9602

recovery-random-scale test_fail_client_mds: PASS but marked as FAIL

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.0, (10)
      Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5
    • None
    • onyx, failover
        clients: EL7, master branch, v2.9.58, b3591
        servers: EL7, zfs, master branch, v2.9.58, b3591
    • 3
    • 9223372036854775807

    Description

      https://testing.hpdd.intel.com/test_sessions/2df73ff0-8703-4bf1-9f76-2ea9f98949fb

      From suite_log:

      PASS fail_client_mds (85783s)
      Starting client: onyx-36vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
      CMD: onyx-36vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
      CMD: onyx-36vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
      == recovery-random-scale test complete, duration 85822 sec =========================================== 17:10:21 (1495991421)
      rm: cannot remove '/mnt/lustre/d0.tar-onyx-36vm6.onyx.hpdd.intel.com/etc': Directory not empty
       recovery-random-scale : @@@@@@ FAIL: remove sub-test dirs failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4952:error()
        = /usr/lib64/lustre/tests/test-framework.sh:4471:check_and_cleanup_lustre()
        = /usr/lib64/lustre/tests/recovery-random-scale.sh:267:main()
      

      and

      Status: PASS: rc=0
      CMD: onyx-36vm5,onyx-36vm6 test -f /tmp/client-load.pid &&
              { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
      SKIP pairwise_fail (3867s)
      Starting client: onyx-36vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
      CMD: onyx-36vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
      CMD: onyx-36vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
      == recovery-double-scale test complete, duration 3904 sec ============================================ 18:20:55 (1495995655)
      rm: cannot remove '/mnt/lustre/d0.tar-onyx-36vm6.onyx.hpdd.intel.com/etc/fonts/conf.d': Directory not empty
       recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4952:error()
        = /usr/lib64/lustre/tests/test-framework.sh:4471:check_and_cleanup_lustre()
        = /usr/lib64/lustre/tests/recovery-double-scale.sh:309:main()
      

      Attachments

        Issue Links

          Activity

            [LU-9602] recovery-random-scale test_fail_client_mds: PASS but marked as FAIL

            one more with "remove sub-test dirs failed" marked as PASS
            https://testing.whamcloud.com/test_logs/966cb14d-a670-47ba-8141-f5f54088c23c/show_text

            == recovery-double-scale test complete, duration 10761 sec ========================================================== 07:46:23 (1637307983)
            rm: cannot remove '/mnt/lustre/d0.tar-onyx-78vm3.onyx.whamcloud.com': Directory not empty
            rm: cannot remove '/mnt/lustre/d0.tar-onyx-78vm7.onyx.whamcloud.com/etc': Directory not empty
             recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed 
            
            egryaznova Elena Gryaznova added a comment - one more with "remove sub-test dirs failed" marked as PASS https://testing.whamcloud.com/test_logs/966cb14d-a670-47ba-8141-f5f54088c23c/show_text == recovery- double -scale test complete, duration 10761 sec ========================================================== 07:46:23 (1637307983) rm: cannot remove '/mnt/lustre/d0.tar-onyx-78vm3.onyx.whamcloud.com' : Directory not empty rm: cannot remove '/mnt/lustre/d0.tar-onyx-78vm7.onyx.whamcloud.com/etc' : Directory not empty recovery- double -scale : @@@@@@ FAIL: remove sub-test dirs failed

            Looking at the results at https://testing.whamcloud.com/test_sets/3bb67f2c-156c-42e3-8b23-9050d1d799a4, the test passes, but we can see in the suite_log that not all processes were killed

            Stopping client onyx-42vm3.onyx.whamcloud.com /mnt/lustre opts:
            COMMAND     PID USER   FD      TYPE      DEVICE    SIZE/OFF               NODE NAME
            run_dd.sh 16846 root  cwd   unknown 1273,181606                                /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com
            dd        16909 root  cwd   unknown 1273,181606                                /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com
            dd        16909 root    1w      REG 1273,181606 26873016320 144115205272502274 /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com/dd-file
            /mnt/lustre is still busy, wait one second
            

            Looking at the run_dd_debug log on onyx-42vm3, we tried to kill the process but

            2021-05-14 00:35:09: client load was signaled to terminate
            ++++ ps -eo '%c %p %r'
            ++++ awk '/ 16845 / {print $3}'
            +++ local PGID=
            +++ kill -TERM -
            /usr/lib64/lustre/tests/functions.sh: line 166: kill: -: arguments must be process or job IDs
            +++ sleep 5
            +++ kill -KILL -
            /usr/lib64/lustre/tests/functions.sh: line 168: kill: -: arguments must be process or job IDs
            ++ sleep 5
            ++ kill -KILL -16804
            
            jamesanunez James Nunez (Inactive) added a comment - Looking at the results at https://testing.whamcloud.com/test_sets/3bb67f2c-156c-42e3-8b23-9050d1d799a4 , the test passes, but we can see in the suite_log that not all processes were killed Stopping client onyx-42vm3.onyx.whamcloud.com /mnt/lustre opts: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME run_dd.sh 16846 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com dd 16909 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com dd 16909 root 1w REG 1273,181606 26873016320 144115205272502274 /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com/dd-file /mnt/lustre is still busy, wait one second Looking at the run_dd_debug log on onyx-42vm3, we tried to kill the process but 2021-05-14 00:35:09: client load was signaled to terminate ++++ ps -eo '%c %p %r' ++++ awk '/ 16845 / {print $3}' +++ local PGID= +++ kill -TERM - /usr/lib64/lustre/tests/functions.sh: line 166: kill: -: arguments must be process or job IDs +++ sleep 5 +++ kill -KILL - /usr/lib64/lustre/tests/functions.sh: line 168: kill: -: arguments must be process or job IDs ++ sleep 5 ++ kill -KILL -16804
            jamesanunez James Nunez (Inactive) added a comment - - edited

            We see several of the recovery-*-scale tests fail in this way; we are not able to clean up the Lustre directory.

            For example, a recent failure in recovery-mds-scale test_failover_mds provides some new information; https://testing.hpdd.intel.com/test_sets/4929a310-fded-11e7-bd00-52540065bddc. Looking at the suite_log, we see the test fail due to rm failing

            rm: cannot remove '/mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com/etc': Directory not empty
            recovery-mds-scale : @@@@@@ FAIL: remove sub-test dirs failed 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:5335:error()
              = /usr/lib64/lustre/tests/test-framework.sh:4829:check_and_cleanup_lustre()
            

            Later in the log, we can see that run_tar.sh is still running:

            Stopping client onyx-41vm1.onyx.hpdd.intel.com /mnt/lustre opts:
            Stopping client onyx-41vm4.onyx.hpdd.intel.com /mnt/lustre opts:
            COMMAND     PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
            sleep      3581 root  cwd    DIR 1273,181606     4096 144115205289292362 /mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com
            run_tar.s 13671 root  cwd    DIR 1273,181606     4096 144115205289292362 /mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com
            /mnt/lustre is still busy, wait one second
            /mnt/lustre is still busy, wait one second
            /mnt/lustre is still busy, wait one second
            

            The run_tar, run_dd, etc. processes are started and the PID of each is stored in a file called client-load.pid in /tmp on the client node. In this case vm4 started run_tar as seen here

            Started client load: tar on onyx-41vm4
            CMD: onyx-41vm4 PATH=/ […]	LCTL=/usr/sbin/lctl 			FSNAME=lustre 			run_tar.sh
            client loads pids:
            CMD: onyx-41vm3,onyx-41vm4 cat /tmp/client-load.pid
            onyx-41vm3: 12666
            onyx-41vm4: 13671
            

            In stop_process(), we kill the PID saved in /tmp/client-load.pid

            2416 # Stop the process which had its PID saved in a file.
            2417 stop_process() {
            2418     local nodes=$1
            2419     local pid_file=$2
            2420 
            2421     [ -z "$nodes" -o -z "$pid_file" ] && return 0
            2422 
            2423     do_nodes $nodes "test -f $pid_file &&
            2424         { kill -s TERM \\\$(cat $pid_file); rm -f $pid_file; }" || true
            2425 }
            

            We see that vm3 tries to clean up it's run_dd PID and fails. What about vm4:

            CMD: onyx-41vm3,onyx-41vm4 test -f /tmp/client-load.pid &&
                    { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
            onyx-41vm3: sh: line 1: kill: (12666) - No such process
            /usr/lib64/lustre/tests/recovery-mds-scale.sh: line 104:  9503 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK 			BREAK_ON_ERROR=$BREAK_ON_ERROR 			END_RUN_FILE=$END_RUN_FILE 			LOAD_PID_FILE=$LOAD_PID_FILE 			TESTLOG_PREFIX=$TESTLOG_PREFIX 			TESTNAME=$TESTNAME 			DBENCH_LIB=$DBENCH_LIB 			DBENCH_SRC=$DBENCH_SRC 			CLIENT_COUNT=$((CLIENTCOUNT - 1)) 			LFS=$LFS 			LCTL=$LCTL 			FSNAME=$FSNAME 			run_${load}.sh"
            /usr/lib64/lustre/tests/recovery-mds-scale.sh: line 104:  9697 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK 			BREAK_ON_ERROR=$BREAK_ON_ERROR 			END_RUN_FILE=$END_RUN_FILE 			LOAD_PID_FILE=$LOAD_PID_FILE 			TESTLOG_PREFIX=$TESTLOG_PREFIX 			TESTNAME=$TESTNAME 			DBENCH_LIB=$DBENCH_LIB 			DBENCH_SRC=$DBENCH_SRC 			CLIENT_COUNT=$((CLIENTCOUNT - 1)) 			LFS=$LFS 			LCTL=$LCTL 			FSNAME=$FSNAME 			run_${load}.sh"
            
            jamesanunez James Nunez (Inactive) added a comment - - edited We see several of the recovery-*-scale tests fail in this way; we are not able to clean up the Lustre directory. For example, a recent failure in recovery-mds-scale test_failover_mds provides some new information; https://testing.hpdd.intel.com/test_sets/4929a310-fded-11e7-bd00-52540065bddc . Looking at the suite_log, we see the test fail due to rm failing rm: cannot remove '/mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com/etc': Directory not empty recovery-mds-scale : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5335:error() = /usr/lib64/lustre/tests/test-framework.sh:4829:check_and_cleanup_lustre() Later in the log, we can see that run_tar.sh is still running: Stopping client onyx-41vm1.onyx.hpdd.intel.com /mnt/lustre opts: Stopping client onyx-41vm4.onyx.hpdd.intel.com /mnt/lustre opts: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME sleep 3581 root cwd DIR 1273,181606 4096 144115205289292362 /mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com run_tar.s 13671 root cwd DIR 1273,181606 4096 144115205289292362 /mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second The run_tar, run_dd, etc. processes are started and the PID of each is stored in a file called client-load.pid in /tmp on the client node. In this case vm4 started run_tar as seen here Started client load: tar on onyx-41vm4 CMD: onyx-41vm4 PATH=/ […] LCTL=/usr/sbin/lctl FSNAME=lustre run_tar.sh client loads pids: CMD: onyx-41vm3,onyx-41vm4 cat /tmp/client-load.pid onyx-41vm3: 12666 onyx-41vm4: 13671 In stop_process(), we kill the PID saved in /tmp/client-load.pid 2416 # Stop the process which had its PID saved in a file. 2417 stop_process() { 2418 local nodes=$1 2419 local pid_file=$2 2420 2421 [ -z "$nodes" -o -z "$pid_file" ] && return 0 2422 2423 do_nodes $nodes "test -f $pid_file && 2424 { kill -s TERM \\\$(cat $pid_file); rm -f $pid_file; }" || true 2425 } We see that vm3 tries to clean up it's run_dd PID and fails. What about vm4: CMD: onyx-41vm3,onyx-41vm4 test -f /tmp/client-load.pid && { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; } onyx-41vm3: sh: line 1: kill: (12666) - No such process /usr/lib64/lustre/tests/recovery-mds-scale.sh: line 104: 9503 Killed do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS LCTL=$LCTL FSNAME=$FSNAME run_${load}.sh" /usr/lib64/lustre/tests/recovery-mds-scale.sh: line 104: 9697 Killed do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS LCTL=$LCTL FSNAME=$FSNAME run_${load}.sh"

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28264
            Subject: LU-9602 test: kill all tar and dd processes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 51a95be4f89237a7e2dd16d8af99793b0126c5b9

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28264 Subject: LU-9602 test: kill all tar and dd processes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 51a95be4f89237a7e2dd16d8af99793b0126c5b9

            Also seen in this config with recovery-double-scale: test pairwise_fail: SKIP but marked as FAIL

            jcasper James Casper (Inactive) added a comment - Also seen in this config with recovery-double-scale: test pairwise_fail: SKIP but marked as FAIL

            People

              Deiter Alex Deiter
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: