[LU-14749] runtests test 1 hangs on MDS unmount Created: 09/Jun/21 Updated: 26/Mar/23 Resolved: 26/Mar/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
runtests test_1 is hanging on MDS umount only on b2_12 with DNE. We’ve seen this hang six times starting on 04 JUNE 2021. Looking at the hang at https://testing.whamcloud.com/test_sets/e9a05afa-2c52-4e7a-9528-e85f89d9d571, the last thing we see in the suite_log is CMD: trevis-45vm4 grep -c /mnt/lustre-mds3' ' /proc/mounts || true Stopping /mnt/lustre-mds3 (opts:-f) on trevis-45vm4 CMD: trevis-45vm4 umount -d -f /mnt/lustre-mds3 Looking at the console log for MDS1/3 (vm4), we see the MDS is hanging on unmount (many umount process hung with): [ 4271.014683] Lustre: DEBUG MARKER: ! zpool list -H lustre-mdt1 >/dev/null 2>&1 || [ 4271.014683] grep -q ^lustre-mdt1/ /proc/mounts || [ 4271.014683] zpool export lustre-mdt1 [ 4271.481809] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107 [ 4271.483819] LustreError: Skipped 2 previous similar messages [ 4271.484875] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [ 4271.487858] Lustre: Skipped 23 previous similar messages [ 4271.489169] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. [ 4271.492149] LustreError: Skipped 1108 previous similar messages [ 4278.514239] Lustre: 14222:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1623099860/real 1623099860] req@ffff9e71cdd2fa80 x1701939306318336/t0(0) o400->MGC10.9.6.22@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1623099867 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 4278.518985] Lustre: 14222:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 12 previous similar messages [ 4279.969273] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true [ 4280.342472] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3 [ 4440.193242] INFO: task umount:25803 blocked for more than 120 seconds. [ 4440.194850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4440.196316] umount D ffff9e71ce9fe300 0 25803 25802 0x00000080 [ 4440.197835] Call Trace: [ 4440.198401] [<ffffffffb6b89e69>] schedule_preempt_disabled+0x29/0x70 [ 4440.199694] [<ffffffffb6b87dc7>] __mutex_lock_slowpath+0xc7/0x1d0 [ 4440.200883] [<ffffffffb6b8719f>] mutex_lock+0x1f/0x2f [ 4440.201932] [<ffffffffc0f904d7>] mgc_process_config+0x207/0x13f0 [mgc] [ 4440.203416] [<ffffffffc0d186d5>] obd_process_config.constprop.14+0x75/0x210 [obdclass] [ 4440.205020] [<ffffffffc0baa177>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 4440.206335] [<ffffffffc0d04cd9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 4440.207675] [<ffffffffc0d1994f>] lustre_end_log+0x1ff/0x550 [obdclass] [ 4440.209047] [<ffffffffc0d46bbe>] server_put_super+0x82e/0xd00 [obdclass] [ 4440.210386] [<ffffffffb6697359>] ? fsnotify_unmount_inodes+0x119/0x1d0 [ 4440.211643] [<ffffffffb66507cd>] generic_shutdown_super+0x6d/0x100 [ 4440.212885] [<ffffffffb6650bd2>] kill_anon_super+0x12/0x20 [ 4440.213994] [<ffffffffc0d17d72>] lustre_kill_super+0x32/0x50 [obdclass] [ 4440.215280] [<ffffffffb6650fae>] deactivate_locked_super+0x4e/0x70 [ 4440.216493] [<ffffffffb6651736>] deactivate_super+0x46/0x60 [ 4440.217607] [<ffffffffb6670dcf>] cleanup_mnt+0x3f/0x80 [ 4440.218648] [<ffffffffb6670e62>] __cleanup_mnt+0x12/0x20 [ 4440.219731] [<ffffffffb64c28db>] task_work_run+0xbb/0xe0 [ 4440.220786] [<ffffffffb642cc65>] do_notify_resume+0xa5/0xc0 [ 4440.221907] [<ffffffffb6b962ef>] int_signal+0x12/0x17 On the client (vm1), we see [ 4178.949972] echo Stopping client $(hostname) /mnt/lustre opts:; [ 4178.949972] lsof /mnt/lustre || need_kill=no; [ 4178.949972] if [ x != x -a x$need_kill != xno ]; then [ 4178.949972] pids=$(lsof -t /mnt/lustre | sort -u); [ 4178.949972] if [ 4320.175524] INFO: task umount:18671 blocked for more than 120 seconds. [ 4320.176904] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4320.178305] umount D ffff92277fc1acc0 0 18671 18662 0x00000080 [ 4320.179747] Call Trace: [ 4320.180288] [<ffffffff91589e69>] schedule_preempt_disabled+0x29/0x70 [ 4320.181455] [<ffffffff91587dc7>] __mutex_lock_slowpath+0xc7/0x1d0 [ 4320.182589] [<ffffffff9158719f>] mutex_lock+0x1f/0x2f [ 4320.183545] [<ffffffffc0a73297>] mgc_process_config+0x207/0x13f0 [mgc] [ 4320.185118] [<ffffffffc0773315>] obd_process_config.constprop.14+0x75/0x210 [obdclass] [ 4320.186590] [<ffffffffc075fb99>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 4320.187966] [<ffffffffc077458f>] lustre_end_log+0x1ff/0x550 [obdclass] [ 4320.189271] [<ffffffffc0bf49ee>] ll_put_super+0x8e/0x9b0 [lustre] [ 4320.190454] [<ffffffff90f598ad>] ? call_rcu_sched+0x1d/0x20 [ 4320.191554] [<ffffffffc0c1c7cc>] ? ll_destroy_inode+0x1c/0x20 [lustre] [ 4320.192801] [<ffffffff9106c31b>] ? destroy_inode+0x3b/0x60 [ 4320.193851] [<ffffffff9106c455>] ? evict+0x115/0x180 [ 4320.194822] [<ffffffff9106c503>] ? dispose_list+0x43/0x60 [ 4320.195865] [<ffffffff91097279>] ? fsnotify_unmount_inodes+0x119/0x1d0 [ 4320.197108] [<ffffffff910507cd>] generic_shutdown_super+0x6d/0x100 [ 4320.198280] [<ffffffff91050bd2>] kill_anon_super+0x12/0x20 [ 4320.199352] [<ffffffffc07729c5>] lustre_kill_super+0x45/0x50 [obdclass] [ 4320.200597] [<ffffffff91050fae>] deactivate_locked_super+0x4e/0x70 [ 4320.201784] [<ffffffff91051736>] deactivate_super+0x46/0x60 [ 4320.202858] [<ffffffff91070dbf>] cleanup_mnt+0x3f/0x80 [ 4320.203851] [<ffffffff91070e52>] __cleanup_mnt+0x12/0x20 [ 4320.204889] [<ffffffff90ec28db>] task_work_run+0xbb/0xe0 [ 4320.205935] [<ffffffff90e2cc65>] do_notify_resume+0xa5/0xc0 [ 4320.207018] [<ffffffff915962ef>] int_signal+0x12/0x17 [ 4367.764390] LustreError: 166-1: MGC10.9.6.22@tcp: Connection to MGS (at 10.9.6.22@tcp) was lost; in progress operations using this service will fail [ 4367.766831] LustreError: Skipped 8 previous similar messages [ 4367.767935] LustreError: 375:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623099556, 300s ago), entering recovery for MGS@10.9.6.22@tcp ns: MGC10.9.6.22@tcp lock: ffff9227613cf200/0x605efa0a5bb89566 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x8475cec67ff59212 expref: -99 pid: 375 timeout: 0 lvb_type: 0 [ 4367.776090] LustreError: 18684:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.6.22@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff9227638ad600) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 4367.961924] Lustre: Unmounted lustre-client Log for other test sessions with this hang are at |
| Comments |
| Comment by Andreas Dilger [ 11/Jun/21 ] |
|
The test has failed 12 times since 2021-06-04, but only for patch review test sessions (where it runs after replay-dual), not for full test sessions. in the console logs I see an MGS lock timeout which predates the start of runtests in the logs: [ 4191.078567] LustreError: 14293:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319496, 300s ago), entering recovery for MGS@10.9.4.167@tcp ns: MGC10.9.4.167@tcp lock: ffff9920c80ca900/0x3c8702c41b0f74bc lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3c8702c41b0f74c3 expref: -99 pid: 14293 timeout: 0 lvb_type: 0 [ 4223.498558] Lustre: DEBUG MARKER: == runtests test 1: All Runtests =================================== 10:10:28 (1623319828) [ 4491.085906] LustreError: 14293:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319796, 300s ago), entering recovery for MGS@10.9.4.167@tcp ns: MGC10.9.4.167@tcp lock: ffff9920c1c8d440/0x3c8702c41b0f77a9 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3c8702c41b0f77b0 expref: -99 pid: 14293 timeout: 0 lvb_type: 0 [ 4491.092849] LustreError: 5423:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.4.167@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff9920c6afce40) refcount nonzero (1) after lock cleanup; forcing cleanup. and on the other MDS: 4181.557888] LustreError: 24494:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319487, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff9920dcdbb440/0x3c8702c41b0f7461 lrc: 3/0,1 mode: --/EX res: [0x65727473756c:0x2:0x0].0x0 rrc: 16 type: PLN flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 24494 timeout: 0 lvb_type: 0 [ 4181.564239] LustreError: dumping log to /tmp/lustre-log.1623319787.24494 The '[0x65727473756c:0x2:0x0]' resource is "lustre" in ASCII, which I guess is the filesystem name enqueued for the configuration llog on the MGS? One possibility is that replay-dual may be exiting with the MGC unmounted, and this is somehow causing problems for runtests to unmount or remount the filesystem? The passing full-part-1 session has runtests immediately before replay-dual and runs with a single MDT. I suspect an "easy out" would be to reverse the order of these tests, but it may be that the timeout will move to the next test down the line... |
| Comment by Andreas Dilger [ 11/Jun/21 ] |
|
This looks like a duplicate of If you follow the breadcrumbs of the MGS timeout messages from runtests, "enqueued at nnn, 300s ago", it lands in replay-dual test_25 at: [ 3574.966184] LustreError: 6741:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.7.66@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff97001a4b5300) refcount nonzero (1) after lock cleanup; forcing cleanup. which implies it or test_24 is the cause of test_26 timing out, since test_23d is itself unmounting mds1. It now seems any test doing an MDS unmount afterward will also time out, since test_25 is the second-last subtest in replay-dual, and test_28 doesn't do any unmounting of its own. |
| Comment by Andreas Dilger [ 26/Mar/23 ] |
|
Haven't seen this recently. |