[LU-9602] recovery-random-scale test_fail_client_mds: PASS but marked as FAIL Created: 05/Jun/17 Updated: 14/Apr/23 |
|
| Status: | In Progress |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Casper | Assignee: | Alex Deiter |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
onyx, failover |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
https://testing.hpdd.intel.com/test_sessions/2df73ff0-8703-4bf1-9f76-2ea9f98949fb From suite_log: PASS fail_client_mds (85783s) Starting client: onyx-36vm1.onyx.hpdd.intel.com: -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre CMD: onyx-36vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre CMD: onyx-36vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre == recovery-random-scale test complete, duration 85822 sec =========================================== 17:10:21 (1495991421) rm: cannot remove '/mnt/lustre/d0.tar-onyx-36vm6.onyx.hpdd.intel.com/etc': Directory not empty recovery-random-scale : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4952:error() = /usr/lib64/lustre/tests/test-framework.sh:4471:check_and_cleanup_lustre() = /usr/lib64/lustre/tests/recovery-random-scale.sh:267:main() and Status: PASS: rc=0
CMD: onyx-36vm5,onyx-36vm6 test -f /tmp/client-load.pid &&
{ kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
SKIP pairwise_fail (3867s)
Starting client: onyx-36vm1.onyx.hpdd.intel.com: -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
CMD: onyx-36vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
CMD: onyx-36vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
== recovery-double-scale test complete, duration 3904 sec ============================================ 18:20:55 (1495995655)
rm: cannot remove '/mnt/lustre/d0.tar-onyx-36vm6.onyx.hpdd.intel.com/etc/fonts/conf.d': Directory not empty
recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4952:error()
= /usr/lib64/lustre/tests/test-framework.sh:4471:check_and_cleanup_lustre()
= /usr/lib64/lustre/tests/recovery-double-scale.sh:309:main()
|
| Comments |
| Comment by James Casper [ 05/Jun/17 ] |
|
Also seen in this config with recovery-double-scale: test pairwise_fail: SKIP but marked as FAIL |
| Comment by Gerrit Updater [ 27/Jul/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28264 |
| Comment by James Nunez (Inactive) [ 08/Feb/18 ] |
|
We see several of the recovery-*-scale tests fail in this way; we are not able to clean up the Lustre directory. For example, a recent failure in recovery-mds-scale test_failover_mds provides some new information; https://testing.hpdd.intel.com/test_sets/4929a310-fded-11e7-bd00-52540065bddc. Looking at the suite_log, we see the test fail due to rm failing rm: cannot remove '/mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com/etc': Directory not empty recovery-mds-scale : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5335:error() = /usr/lib64/lustre/tests/test-framework.sh:4829:check_and_cleanup_lustre() Later in the log, we can see that run_tar.sh is still running: Stopping client onyx-41vm1.onyx.hpdd.intel.com /mnt/lustre opts: Stopping client onyx-41vm4.onyx.hpdd.intel.com /mnt/lustre opts: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME sleep 3581 root cwd DIR 1273,181606 4096 144115205289292362 /mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com run_tar.s 13671 root cwd DIR 1273,181606 4096 144115205289292362 /mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second The run_tar, run_dd, etc. processes are started and the PID of each is stored in a file called client-load.pid in /tmp on the client node. In this case vm4 started run_tar as seen here Started client load: tar on onyx-41vm4 CMD: onyx-41vm4 PATH=/ […] LCTL=/usr/sbin/lctl FSNAME=lustre run_tar.sh client loads pids: CMD: onyx-41vm3,onyx-41vm4 cat /tmp/client-load.pid onyx-41vm3: 12666 onyx-41vm4: 13671 In stop_process(), we kill the PID saved in /tmp/client-load.pid 2416 # Stop the process which had its PID saved in a file.
2417 stop_process() {
2418 local nodes=$1
2419 local pid_file=$2
2420
2421 [ -z "$nodes" -o -z "$pid_file" ] && return 0
2422
2423 do_nodes $nodes "test -f $pid_file &&
2424 { kill -s TERM \\\$(cat $pid_file); rm -f $pid_file; }" || true
2425 }
We see that vm3 tries to clean up it's run_dd PID and fails. What about vm4: CMD: onyx-41vm3,onyx-41vm4 test -f /tmp/client-load.pid &&
{ kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
onyx-41vm3: sh: line 1: kill: (12666) - No such process
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 104: 9503 Killed do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS LCTL=$LCTL FSNAME=$FSNAME run_${load}.sh"
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 104: 9697 Killed do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK BREAK_ON_ERROR=$BREAK_ON_ERROR END_RUN_FILE=$END_RUN_FILE LOAD_PID_FILE=$LOAD_PID_FILE TESTLOG_PREFIX=$TESTLOG_PREFIX TESTNAME=$TESTNAME DBENCH_LIB=$DBENCH_LIB DBENCH_SRC=$DBENCH_SRC CLIENT_COUNT=$((CLIENTCOUNT - 1)) LFS=$LFS LCTL=$LCTL FSNAME=$FSNAME run_${load}.sh"
|
| Comment by James Nunez (Inactive) [ 28/May/21 ] |
|
Looking at the results at https://testing.whamcloud.com/test_sets/3bb67f2c-156c-42e3-8b23-9050d1d799a4, the test passes, but we can see in the suite_log that not all processes were killed Stopping client onyx-42vm3.onyx.whamcloud.com /mnt/lustre opts: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME run_dd.sh 16846 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com dd 16909 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com dd 16909 root 1w REG 1273,181606 26873016320 144115205272502274 /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com/dd-file /mnt/lustre is still busy, wait one second Looking at the run_dd_debug log on onyx-42vm3, we tried to kill the process but 2021-05-14 00:35:09: client load was signaled to terminate
++++ ps -eo '%c %p %r'
++++ awk '/ 16845 / {print $3}'
+++ local PGID=
+++ kill -TERM -
/usr/lib64/lustre/tests/functions.sh: line 166: kill: -: arguments must be process or job IDs
+++ sleep 5
+++ kill -KILL -
/usr/lib64/lustre/tests/functions.sh: line 168: kill: -: arguments must be process or job IDs
++ sleep 5
++ kill -KILL -16804
|
| Comment by Elena Gryaznova [ 19/Nov/21 ] |
|
one more with "remove sub-test dirs failed" marked as PASS == recovery-double-scale test complete, duration 10761 sec ========================================================== 07:46:23 (1637307983) rm: cannot remove '/mnt/lustre/d0.tar-onyx-78vm3.onyx.whamcloud.com': Directory not empty rm: cannot remove '/mnt/lustre/d0.tar-onyx-78vm7.onyx.whamcloud.com/etc': Directory not empty recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed |