[LU-9602] recovery-random-scale test_fail_client_mds: PASS but marked as FAIL Created: 05/Jun/17  Updated: 14/Apr/23

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Casper Assignee: Alex Deiter
Resolution: Unresolved Votes: 0
Labels: None
Environment:

onyx, failover
clients: EL7, master branch, v2.9.58, b3591
servers: EL7, zfs, master branch, v2.9.58, b3591


Issue Links:
Related
is related to LU-15140 recovery-random-scale: No sub tests f... Resolved
is related to LU-14738 recovery-small: FAIL: remove sub-test... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/2df73ff0-8703-4bf1-9f76-2ea9f98949fb

From suite_log:

PASS fail_client_mds (85783s)
Starting client: onyx-36vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
CMD: onyx-36vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
CMD: onyx-36vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
== recovery-random-scale test complete, duration 85822 sec =========================================== 17:10:21 (1495991421)
rm: cannot remove '/mnt/lustre/d0.tar-onyx-36vm6.onyx.hpdd.intel.com/etc': Directory not empty
 recovery-random-scale : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4952:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4471:check_and_cleanup_lustre()
  = /usr/lib64/lustre/tests/recovery-random-scale.sh:267:main()

and

Status: PASS: rc=0
CMD: onyx-36vm5,onyx-36vm6 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
SKIP pairwise_fail (3867s)
Starting client: onyx-36vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
CMD: onyx-36vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
CMD: onyx-36vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-36vm7:onyx-36vm3:/lustre /mnt/lustre
== recovery-double-scale test complete, duration 3904 sec ============================================ 18:20:55 (1495995655)
rm: cannot remove '/mnt/lustre/d0.tar-onyx-36vm6.onyx.hpdd.intel.com/etc/fonts/conf.d': Directory not empty
 recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4952:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4471:check_and_cleanup_lustre()
  = /usr/lib64/lustre/tests/recovery-double-scale.sh:309:main()


 Comments   
Comment by James Casper [ 05/Jun/17 ]

Also seen in this config with recovery-double-scale: test pairwise_fail: SKIP but marked as FAIL

Comment by Gerrit Updater [ 27/Jul/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28264
Subject: LU-9602 test: kill all tar and dd processes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 51a95be4f89237a7e2dd16d8af99793b0126c5b9

Comment by James Nunez (Inactive) [ 08/Feb/18 ]

We see several of the recovery-*-scale tests fail in this way; we are not able to clean up the Lustre directory.

For example, a recent failure in recovery-mds-scale test_failover_mds provides some new information; https://testing.hpdd.intel.com/test_sets/4929a310-fded-11e7-bd00-52540065bddc. Looking at the suite_log, we see the test fail due to rm failing

rm: cannot remove '/mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com/etc': Directory not empty
recovery-mds-scale : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5335:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4829:check_and_cleanup_lustre()

Later in the log, we can see that run_tar.sh is still running:

Stopping client onyx-41vm1.onyx.hpdd.intel.com /mnt/lustre opts:
Stopping client onyx-41vm4.onyx.hpdd.intel.com /mnt/lustre opts:
COMMAND     PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
sleep      3581 root  cwd    DIR 1273,181606     4096 144115205289292362 /mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com
run_tar.s 13671 root  cwd    DIR 1273,181606     4096 144115205289292362 /mnt/lustre/d0.tar-onyx-41vm4.onyx.hpdd.intel.com
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second

The run_tar, run_dd, etc. processes are started and the PID of each is stored in a file called client-load.pid in /tmp on the client node. In this case vm4 started run_tar as seen here

Started client load: tar on onyx-41vm4
CMD: onyx-41vm4 PATH=/ […]	LCTL=/usr/sbin/lctl 			FSNAME=lustre 			run_tar.sh
client loads pids:
CMD: onyx-41vm3,onyx-41vm4 cat /tmp/client-load.pid
onyx-41vm3: 12666
onyx-41vm4: 13671

In stop_process(), we kill the PID saved in /tmp/client-load.pid

2416 # Stop the process which had its PID saved in a file.
2417 stop_process() {
2418     local nodes=$1
2419     local pid_file=$2
2420 
2421     [ -z "$nodes" -o -z "$pid_file" ] && return 0
2422 
2423     do_nodes $nodes "test -f $pid_file &&
2424         { kill -s TERM \\\$(cat $pid_file); rm -f $pid_file; }" || true
2425 }

We see that vm3 tries to clean up it's run_dd PID and fails. What about vm4:

CMD: onyx-41vm3,onyx-41vm4 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
onyx-41vm3: sh: line 1: kill: (12666) - No such process
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 104:  9503 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK 			BREAK_ON_ERROR=$BREAK_ON_ERROR 			END_RUN_FILE=$END_RUN_FILE 			LOAD_PID_FILE=$LOAD_PID_FILE 			TESTLOG_PREFIX=$TESTLOG_PREFIX 			TESTNAME=$TESTNAME 			DBENCH_LIB=$DBENCH_LIB 			DBENCH_SRC=$DBENCH_SRC 			CLIENT_COUNT=$((CLIENTCOUNT - 1)) 			LFS=$LFS 			LCTL=$LCTL 			FSNAME=$FSNAME 			run_${load}.sh"
/usr/lib64/lustre/tests/recovery-mds-scale.sh: line 104:  9697 Killed                  do_node $client "PATH=$PATH MOUNT=$MOUNT ERRORS_OK=$ERRORS_OK 			BREAK_ON_ERROR=$BREAK_ON_ERROR 			END_RUN_FILE=$END_RUN_FILE 			LOAD_PID_FILE=$LOAD_PID_FILE 			TESTLOG_PREFIX=$TESTLOG_PREFIX 			TESTNAME=$TESTNAME 			DBENCH_LIB=$DBENCH_LIB 			DBENCH_SRC=$DBENCH_SRC 			CLIENT_COUNT=$((CLIENTCOUNT - 1)) 			LFS=$LFS 			LCTL=$LCTL 			FSNAME=$FSNAME 			run_${load}.sh"
Comment by James Nunez (Inactive) [ 28/May/21 ]

Looking at the results at https://testing.whamcloud.com/test_sets/3bb67f2c-156c-42e3-8b23-9050d1d799a4, the test passes, but we can see in the suite_log that not all processes were killed

Stopping client onyx-42vm3.onyx.whamcloud.com /mnt/lustre opts:
COMMAND     PID USER   FD      TYPE      DEVICE    SIZE/OFF               NODE NAME
run_dd.sh 16846 root  cwd   unknown 1273,181606                                /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com
dd        16909 root  cwd   unknown 1273,181606                                /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com
dd        16909 root    1w      REG 1273,181606 26873016320 144115205272502274 /mnt/lustre/d0.dd-onyx-42vm3.onyx.whamcloud.com/dd-file
/mnt/lustre is still busy, wait one second

Looking at the run_dd_debug log on onyx-42vm3, we tried to kill the process but

2021-05-14 00:35:09: client load was signaled to terminate
++++ ps -eo '%c %p %r'
++++ awk '/ 16845 / {print $3}'
+++ local PGID=
+++ kill -TERM -
/usr/lib64/lustre/tests/functions.sh: line 166: kill: -: arguments must be process or job IDs
+++ sleep 5
+++ kill -KILL -
/usr/lib64/lustre/tests/functions.sh: line 168: kill: -: arguments must be process or job IDs
++ sleep 5
++ kill -KILL -16804
Comment by Elena Gryaznova [ 19/Nov/21 ]

one more with "remove sub-test dirs failed" marked as PASS
https://testing.whamcloud.com/test_logs/966cb14d-a670-47ba-8141-f5f54088c23c/show_text

== recovery-double-scale test complete, duration 10761 sec ========================================================== 07:46:23 (1637307983)
rm: cannot remove '/mnt/lustre/d0.tar-onyx-78vm3.onyx.whamcloud.com': Directory not empty
rm: cannot remove '/mnt/lustre/d0.tar-onyx-78vm7.onyx.whamcloud.com/etc': Directory not empty
 recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed 
Generated at Sat Feb 10 02:27:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.