[LU-10543] racer: /mnt/lustre2 is still busy, wait one second Created: 21/Jan/18 Updated: 30/Oct/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/9df61cce-fe48-11e7-bd00-52540065bddc This failure occurs after racer is complete, while test script is exiting. |
| Comments |
| Comment by James Nunez (Inactive) [ 02/Feb/18 ] |
|
Recent racer hangs: racer test 1 looks like it completes successfully, but we aren't able to umount the second Lustre mount while umount /mnt/lustre2 2>&1 | grep -q busy; do
echo /mnt/lustre2 is still busy, wait one second && sleep 1;
done;
fi
Stopping client trevis-33vm6.trevis.whamcloud.com /mnt/lustre2 opts:
Stopping client trevis-33vm7.trevis.whamcloud.com /mnt/lustre2 opts:
/mnt/lustre2 is still busy, wait one second
/mnt/lustre2 is still busy, wait one second
/mnt/lustre2 is still busy, wait one second
/mnt/lustre2 is still busy, wait one second
/mnt/lustre2 is still busy, wait one second
It looks like there are no client process still running there should be NO racer processes: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND Filesystem 1K-blocks Used Available Use% Mounted on 10.9.4.180@tcp:/lustre 13532932 202516 12463840 2% /mnt/lustre We survived /usr/lib64/lustre/tests/racer/racer.sh for 900 seconds. there should be NO racer processes: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND Filesystem 1K-blocks Used Available Use% Mounted on 10.9.4.180@tcp:/lustre 13532932 198284 12471324 2% /mnt/lustre We survived /usr/lib64/lustre/tests/racer/racer.sh for 900 seconds. pid=17959 rc=0 pid=17960 rc=0 |
| Comment by Andreas Dilger [ 09/Aug/18 ] |
|
I checked the stack traces, and unmount does not appear. That is not surprising, given that the unmount command is returning an error instead of hanging during cleanup. This implies some kind of leak in the vfsmnt or inode refcount. That is likely due to some kind of change in the dcache or inode hanfling in llite. |
| Comment by Gerrit Updater [ 13/Aug/18 ] |
|
|
| Comment by James Nunez (Inactive) [ 13/Aug/18 ] |
|
In some cases, we see that racer completes and it is marked as PASS, but the next test suite is not able to start and timeout due to mount2 busy. See https://testing.whamcloud.com/test_sessions/91c9307e-4e4c-421f-a197-5e9827f00438; racer is marked as PASS, but replay-single hangs with, looking at the suite_log -----============= acceptance-small: replay-single ============----- Fri Aug 10 23:26:13 UTC 2018
Running: bash /usr/lib64/lustre/tests/replay-single.sh
== replay-single test complete, duration -o sec ====================================================== 23:26:13 (1533943573)
excepting tests:
Stopping clients: trevis-6vm10,trevis-6vm9.trevis.whamcloud.com /mnt/lustre2 (opts:)
CMD: trevis-6vm10,trevis-6vm9.trevis.whamcloud.com running=\$(grep -c /mnt/lustre2' ' /proc/mounts);
if [ \$running -ne 0 ] ; then
echo Stopping client \$(hostname) /mnt/lustre2 opts:;
lsof /mnt/lustre2 || need_kill=no;
if [ x != x -a x\$need_kill != xno ]; then
pids=\$(lsof -t /mnt/lustre2 | sort -u);
if [ -n \"\$pids\" ]; then
kill -9 \$pids;
fi
fi;
while umount /mnt/lustre2 2>&1 | grep -q busy; do
echo /mnt/lustre2 is still busy, wait one second && sleep 1;
done;
fi
Stopping client trevis-6vm9.trevis.whamcloud.com /mnt/lustre2 opts:
Stopping client trevis-6vm10.trevis.whamcloud.com /mnt/lustre2 opts:
/mnt/lustre2 is still busy, wait one second
/mnt/lustre2 is still busy, wait one second
/mnt/lustre2 is still busy, wait one second
/mnt/lustre2 is still busy, wait one second
...
|
| Comment by Andreas Dilger [ 02/Feb/21 ] |
|
Looking at the stack traces from several recent racer timeouts (run with env=DURATION=3600) shows that there is a lingering "cp" process running on the stuck client: [ 4688.804205] cp S ffff9ca9258aa0e0 0 6454 1 0x00000080 [ 4688.805787] Call Trace: [ 4688.806328] [<ffffffff83585da9>] schedule+0x29/0x70 [ 4688.807351] [<ffffffffc0e0f867>] ? vvp_io_init+0x347/0x480 [lustre] [ 4688.808579] [<ffffffffc08ec15b>] ? cl_io_init0.isra.13+0x8b/0x160 [obdclass] [ 4688.809924] [<ffffffffc0e04fd1>] ? cl_glimpse_lock+0x311/0x370 [lustre] [ 4688.811325] [<ffffffffc0e0537d>] ? cl_glimpse_size0+0x22d/0x260 [lustre] [ 4688.812576] [<ffffffffc0dc607e>] ? ll_getattr_dentry+0x53e/0x8e0 [lustre] [ 4688.813933] [<ffffffff83060392>] ? user_path_at_empty+0x72/0xc0 [ 4688.815083] [<ffffffffc0dc643b>] ? ll_getattr+0x1b/0x20 [lustre] [ 4688.816266] [<ffffffff83052b49>] ? vfs_getattr+0x49/0x80 [ 4688.817200] [<ffffffff83052c75>] ? vfs_fstatat+0x75/0xc0 [ 4688.818213] [<ffffffff8305301e>] ? SYSC_newstat+0x2e/0x60 [ 4688.819180] [<ffffffff83592e15>] ? system_call_after_swapgs+0xa2/0x13a [ 4688.820398] [<ffffffff83592e09>] ? system_call_after_swapgs+0x96/0x13a [ 4688.821538] [<ffffffff83592e15>] ? system_call_after_swapgs+0xa2/0x13a [ 4688.822755] [<ffffffff83592e09>] ? system_call_after_swapgs+0x96/0x13a [ 4688.823886] [<ffffffff83592e15>] ? system_call_after_swapgs+0xa2/0x13a [ 4688.825089] [<ffffffff83592e09>] ? system_call_after_swapgs+0x96/0x13a [ 4688.826220] [<ffffffff83592e15>] ? system_call_after_swapgs+0xa2/0x13a [ 4688.827464] [<ffffffff83592e09>] ? system_call_after_swapgs+0x96/0x13a [ 4688.828684] [<ffffffff83592e15>] ? system_call_after_swapgs+0xa2/0x13a [ 4688.830021] [<ffffffff83592e09>] ? system_call_after_swapgs+0x96/0x13a [ 4688.831226] [<ffffffff83592e15>] ? system_call_after_swapgs+0xa2/0x13a [ 4688.832570] [<ffffffff830534de>] ? SyS_newstat+0xe/0x10 [ 4688.833591] [<ffffffff83592ed2>] ? system_call_fastpath+0x25/0x2a [ 4688.834768] [<ffffffff83592e15>] ? system_call_after_swapgs+0xa2/0x13a : [ 4748.927655] Lustre: 6454:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1612251797/real 1612251797] req@ffff9ca9157cd680 x1690564340618304/t0(0) o49->lustre-MDT0000-mdc-ffff9ca93876b800@10.9.6.224@tcp:12/10 lens 472/1040 e 0 to 1 dl 1612251804 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'cp.0' A second better-formed stack is on the other client: [ 4684.466221] cp S ffffa0ea19d48000 0 16037 1 0x00000080 [ 4684.467522] Call Trace: [ 4684.467975] [<ffffffff9b785da9>] schedule+0x29/0x70 [ 4684.468962] [<ffffffffc0bc9c85>] ldlm_completion_ast+0x1e5/0x9d0 [ptlrpc] [ 4684.471257] [<ffffffffc0bb85ed>] ldlm_lock_match_with_skip+0x1ad/0x830 [ptlrpc] [ 4684.474583] [<ffffffffc0dbbd70>] osc_enqueue_base+0x100/0x530 [osc] [ 4684.477063] [<ffffffffc0dc64b9>] osc_lock_enqueue+0x359/0x830 [osc] [ 4684.478339] [<ffffffffc09755a5>] cl_lock_enqueue+0x65/0x120 [obdclass] [ 4684.479542] [<ffffffffc0e21a45>] lov_lock_enqueue+0x95/0x150 [lov] [ 4684.480727] [<ffffffffc09755a5>] cl_lock_enqueue+0x65/0x120 [obdclass] [ 4684.481887] [<ffffffffc0975b37>] cl_lock_request+0x67/0x1f0 [obdclass] [ 4684.483127] [<ffffffffc097961b>] cl_io_lock+0xfb/0x280 [obdclass] [ 4684.484222] [<ffffffffc097a06d>] cl_io_loop+0x8d/0x200 [obdclass] [ 4684.485523] [<ffffffffc0ec563f>] cl_setattr_ost+0x28f/0x3e0 [lustre] [ 4684.486645] [<ffffffffc0e9da5a>] ll_setattr_raw+0x102a/0x1140 [lustre] [ 4684.487880] [<ffffffffc0e9dbd3>] ll_setattr+0x63/0xc0 [lustre] [ 4684.488931] [<ffffffff9b26cefc>] notify_change+0x30c/0x4d0 [ 4684.489978] [<ffffffff9b24af35>] do_truncate+0x75/0xc0 [ 4684.490898] [<ffffffff9b25d487>] do_last+0x627/0x1340 [ 4684.493054] [<ffffffff9b25e26d>] path_openat+0xcd/0x5a0 [ 4684.496260] [<ffffffff9b2604bd>] do_filp_open+0x4d/0xb0 [ 4684.498210] [<ffffffff9b24c094>] do_sys_open+0x124/0x220 |
| Comment by Andreas Dilger [ 02/Feb/21 ] |
|
A different test run also showed the same stack: [ 4022.294274] cp S ffff97d81fe99900 0 22278 1 0x00000080 [ 4022.295585] Call Trace: [ 4022.296011] [<ffffffff9dd85da9>] schedule+0x29/0x70 [ 4022.296868] [<ffffffffc0a9dc85>] ldlm_completion_ast+0x1e5/0x9d0 [ptlrpc] [ 4022.299128] [<ffffffffc0a8c5ed>] ldlm_lock_match_with_skip+0x1ad/0x830 [ptlrpc] [ 4022.302394] [<ffffffffc0c8fd70>] osc_enqueue_base+0x100/0x530 [osc] [ 4022.304799] [<ffffffffc0c9a4b9>] osc_lock_enqueue+0x359/0x830 [osc] [ 4022.305932] [<ffffffffc08495a5>] cl_lock_enqueue+0x65/0x120 [obdclass] [ 4022.307161] [<ffffffffc0cf5a45>] lov_lock_enqueue+0x95/0x150 [lov] [ 4022.308241] [<ffffffffc08495a5>] cl_lock_enqueue+0x65/0x120 [obdclass] [ 4022.309428] [<ffffffffc0849b37>] cl_lock_request+0x67/0x1f0 [obdclass] [ 4022.310562] [<ffffffffc084d61b>] cl_io_lock+0xfb/0x280 [obdclass] [ 4022.311687] [<ffffffffc084e06d>] cl_io_loop+0x8d/0x200 [obdclass] [ 4022.312868] [<ffffffffc0d9963f>] cl_setattr_ost+0x28f/0x3e0 [lustre] [ 4022.314033] [<ffffffffc0d71a5a>] ll_setattr_raw+0x102a/0x1140 [lustre] [ 4022.315162] [<ffffffffc0d71bd3>] ll_setattr+0x63/0xc0 [lustre] [ 4022.316231] [<ffffffff9d86cefc>] notify_change+0x30c/0x4d0 [ 4022.317174] [<ffffffff9d84af35>] do_truncate+0x75/0xc0 [ 4022.318122] [<ffffffff9d85d487>] do_last+0x627/0x1340 [ 4022.320212] [<ffffffff9d85e26d>] path_openat+0xcd/0x5a0 [ 4022.322229] [<ffffffff9d8604bd>] do_filp_open+0x4d/0xb0 [ 4022.324109] [<ffffffff9d84c094>] do_sys_open+0x124/0x220 [ 4022.326209] [<ffffffff9d84c1ae>] SyS_open+0x1e/0x20 |
| Comment by Xinliang Liu [ 14/Sep/23 ] |
|
Can easily hit this issue on the Arm cluster by running the racer test suite several times. With similar cp stack: [jenkins@lustre-rec5k7yz-01 ~]$ ps -efH|grep lustre2/ jenkins 873969 873837 0 09:36 pts/1 00:00:00 grep --color=auto lustre2/ root 426463 1 0 Sep13 ? 00:00:00 cp -p /bin/sleep /mnt/lustre2/racer/4 [jenkins@lustre-rec5k7yz-01 ~]$ sudo cat /proc/426463/stack [<0>] __switch_to+0x7c/0xbc [<0>] ldlm_completion_ast+0x7a4/0xdbc [ptlrpc] [<0>] ldlm_lock_match_with_skip+0x1c8/0x854 [ptlrpc] [<0>] osc_enqueue_base+0x100/0x550 [osc] [<0>] osc_lock_enqueue+0x434/0xc20 [osc] [<0>] cl_lock_enqueue+0x98/0x1e0 [obdclass] [<0>] lov_lock_enqueue+0x90/0x22c [lov] [<0>] cl_lock_enqueue+0x98/0x1e0 [obdclass] [<0>] cl_lock_request+0xa0/0x270 [obdclass] [<0>] cl_lockset_lock+0x12c/0x240 [obdclass] [<0>] cl_io_lock+0xd0/0x204 [obdclass] [<0>] cl_io_loop+0xb8/0x280 [obdclass] [<0>] cl_setattr_ost+0x290/0x364 [lustre] [<0>] ll_setattr_raw+0x908/0x1160 [lustre] [<0>] ll_setattr+0x8c/0x130 [lustre] [<0>] notify_change+0x25c/0x43c [<0>] do_truncate+0x84/0xe0 [<0>] handle_truncate+0xd4/0x124 [<0>] do_open+0xc4/0x30c [<0>] path_openat+0x10c/0x1d0 [<0>] do_filp_open+0x84/0x134 [<0>] do_sys_openat2+0x208/0x2f0 [<0>] __arm64_sys_openat+0x6c/0xb4 [<0>] invoke_syscall+0x50/0x11c [<0>] el0_svc_common.constprop.0+0x158/0x164 [<0>] do_el0_svc+0x2c/0x9c [<0>] el0_svc+0x20/0x30 [<0>] el0_sync_handler+0xb0/0xb4 [<0>] el0_sync+0x160/0x180 [jenkins@lustre-rec5k7yz-01 ~]$ And the related test script should be: lustre-release$ cat lustre/tests/racer/file_exec.sh #!/bin/bash trap 'kill $(jobs -p)' EXIT org_LANG=$LANG export LANG=C DIR=$1 MAX=$2 PROG=/bin/sleep while /bin/true ; do file=$((RANDOM % MAX)) cp -p $PROG $DIR/$file > /dev/null 2>&1 $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null sleep $((RANDOM % 3)) done 2>&1 | egrep -v "Segmentation fault|Bus error" export LANG=$org_LANG
|
| Comment by Xinliang Liu [ 14/Sep/23 ] |
|
It seems that the problem is here:
cat lustre/tests/racer/racer.sh
...
15 RACER_PROGS=${RACER_PROGS:-"file_create dir_create file_rm file_rename \
16 file_link file_symlink file_list file_concat file_exec file_chown \
17 file_chmod file_mknod file_truncate file_delxattr file_getxattr \
18 file_setxattr"}
...
37 racer_cleanup()
38 {
39 echo "racer cleanup"
40 for P in $RACER_PROGS; do
41 killall -q $P.sh
42 done
...
It should also kill all the child processes of the sh scripts. Because if you force to kill all the file test sh scripts, you can't expect their child processes in a normal state, these child processes should be killed/cleaned up as well, because they could be in an abnormal state. |
| Comment by Andreas Dilger [ 15/Sep/23 ] |
|
xinliang please try https://review.whamcloud.com/46384 "LU-15248 tests: kill all fs users at the end of racer" to see if this fixes the problem for you. |
| Comment by Xinliang Liu [ 15/Sep/23 ] |
|
Hi adilger, with patch 46384 it still gets stuck. These cp processes seem can't be killed by fuser cmd. stuck logs ... /mnt/lustre2 is still busy, wait one second /mnt/lustre2 is still busy, wait one second /mnt/lustre2 is still busy, wait one second /mnt/lustre2 is still busy, wait one second /mnt/lustre2 is still busy, wait one second /mnt/lustre2 is still busy, wait one second /mnt/lustre2 is still busy, wait one second /mnt/lustre2 is still busy, wait one second ... [root@lustre-pdbirbyf-02 ~]# ps -efH|grep "cp -p" root 3362672 3360694 0 08:01 pts/0 00:00:00 grep --color=auto cp -p root 2975599 1 0 05:54 ? 00:00:00 cp -p /bin/sleep /mnt/lustre/racer/1 root 2982999 1 0 05:54 ? 00:00:00 cp -p /bin/sleep /mnt/lustre/racer/1 root 1553902 1 0 06:37 ? 00:00:00 cp -p /bin/sleep /mnt/lustre2/racer3/11 [root@lustre-pdbirbyf-02 ~]# fuser -k -m /mnt/lustre2/racer3/ [root@lustre-pdbirbyf-02 ~]# ps -efH|grep "cp -p" root 3362795 3360694 0 08:01 pts/0 00:00:00 grep --color=auto cp -p root 2975599 1 0 05:54 ? 00:00:00 cp -p /bin/sleep /mnt/lustre/racer/1 root 2982999 1 0 05:54 ? 00:00:00 cp -p /bin/sleep /mnt/lustre/racer/1 root 1553902 1 0 06:37 ? 00:00:00 cp -p /bin/sleep /mnt/lustre2/racer3/11
And some cp processes in D(disk sleep) state are unkillable: [root@lustre-pdbirbyf-02 ~]# kill -9 1553902 [root@lustre-pdbirbyf-02 ~]# ps --ppid 1 -fH|grep cp root 1553902 1 0 06:37 ? 00:00:00 cp -p /bin/sleep /mnt/lustre2/racer3/11 root 2982999 1 0 05:54 ? 00:00:00 cp -p /bin/sleep /mnt/lustre/racer/1 root 2975599 1 0 05:54 ? 00:00:00 cp -p /bin/sleep /mnt/lustre/racer/1 [root@lustre-pdbirbyf-02 ~]# killall -v cp Killed cp(1553902) with signal 15 Killed cp(2975599) with signal 15 Killed cp(2982999) with signal 15 [root@lustre-pdbirbyf-02 ~]# ps --ppid 1 -fH|grep cp root 1553902 1 0 06:37 ? 00:00:00 cp -p /bin/sleep /mnt/lustre2/racer3/11 |
| Comment by Andreas Dilger [ 15/Sep/23 ] |
|
There is a second patch https://review.whamcloud.com/45605 "LU-15248 tests: improve racer cleanup" for that same ticket. However, it isn't clear to me if that patch is just hiding the test failure or if it could potentially be close to a real fix for the problem? It might be enough to add the hunks from that patch in obd_get_request_slot():
if (rc == -EINTR)
rc = -ERESTARTSYS;
and then add a check in {{ptlrpc_interrupted_set()} to see if the request is a lock enqueue and don't interrupt it in this case. Something like the following (which I haven't tested at all and is just a guess at might work):
list_for_each_entry(req, &set->set_requests, rq_set_chain) {
if (req->rq_intr)
continue;
if (req->rq_phase != RQ_PHASE_RPC &&
req->rq_phase != RQ_PHASE_UNREG_RPC &&
!req->rq_allow_intr)
continue;
if (req->rq_reqmsg && lustre_msg_get_opc(req->rq_reqmsg) == LDLM_ENQUEUE &&
!__fatal_signal_pending(current))
continue;
This would have the drawback that LDLM_ENQUEUE RPCs could not normally be interrupted, but at least the threads wouldn't be stuck permanently... The alternative might be a check in ldlm_completion_ast() if the task can be interrupted? It looks like this function is already using l_wait_event_abortable(), but there have been a number of changes in this area and it may be that the interrupt handling is not working properly (or it is interrupted but restarts immediately, the lustre client debug logs should show one way or the other. |
| Comment by Xinliang Liu [ 25/Sep/23 ] |
|
Thanks for the explanation, Andreas. I'm trying to understand the problem. Will try patch 45605 later. Hmm, it looks like the problem is a little complicated. The racer test suite runs racing test scripts for a DURATION of time and then kills all the test scripts. It seems Lustre threads should be killed gracefully by a signal like SIGTERM, which doesn't yet, right? Lustre should handle such a signal to exit properly. |