[LU-14749] runtests test 1 hangs on MDS unmount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.6
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

runtests test_1 is hanging on MDS umount only on b2_12 with DNE. We’ve seen this hang six times starting on 04 JUNE 2021.

Looking at the hang at https://testing.whamcloud.com/test_sets/e9a05afa-2c52-4e7a-9528-e85f89d9d571, the last thing we see in the suite_log is

CMD: trevis-45vm4 grep -c /mnt/lustre-mds3' ' /proc/mounts || true
Stopping /mnt/lustre-mds3 (opts:-f) on trevis-45vm4
CMD: trevis-45vm4 umount -d -f /mnt/lustre-mds3

Looking at the console log for MDS1/3 (vm4), we see the MDS is hanging on unmount (many umount process hung with):

[ 4271.014683] Lustre: DEBUG MARKER: ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
[ 4271.014683] 			grep -q ^lustre-mdt1/ /proc/mounts ||
[ 4271.014683] 			zpool export  lustre-mdt1
[ 4271.481809] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107
[ 4271.483819] LustreError: Skipped 2 previous similar messages
[ 4271.484875] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[ 4271.487858] Lustre: Skipped 23 previous similar messages
[ 4271.489169] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 4271.492149] LustreError: Skipped 1108 previous similar messages
[ 4278.514239] Lustre: 14222:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1623099860/real 1623099860]  req@ffff9e71cdd2fa80 x1701939306318336/t0(0) o400->MGC10.9.6.22@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1623099867 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 4278.518985] Lustre: 14222:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
[ 4279.969273] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true
[ 4280.342472] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3
[ 4440.193242] INFO: task umount:25803 blocked for more than 120 seconds.
[ 4440.194850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4440.196316] umount          D ffff9e71ce9fe300     0 25803  25802 0x00000080
[ 4440.197835] Call Trace:
[ 4440.198401]  [<ffffffffb6b89e69>] schedule_preempt_disabled+0x29/0x70
[ 4440.199694]  [<ffffffffb6b87dc7>] __mutex_lock_slowpath+0xc7/0x1d0
[ 4440.200883]  [<ffffffffb6b8719f>] mutex_lock+0x1f/0x2f
[ 4440.201932]  [<ffffffffc0f904d7>] mgc_process_config+0x207/0x13f0 [mgc]
[ 4440.203416]  [<ffffffffc0d186d5>] obd_process_config.constprop.14+0x75/0x210 [obdclass]
[ 4440.205020]  [<ffffffffc0baa177>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 4440.206335]  [<ffffffffc0d04cd9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 4440.207675]  [<ffffffffc0d1994f>] lustre_end_log+0x1ff/0x550 [obdclass]
[ 4440.209047]  [<ffffffffc0d46bbe>] server_put_super+0x82e/0xd00 [obdclass]
[ 4440.210386]  [<ffffffffb6697359>] ? fsnotify_unmount_inodes+0x119/0x1d0
[ 4440.211643]  [<ffffffffb66507cd>] generic_shutdown_super+0x6d/0x100
[ 4440.212885]  [<ffffffffb6650bd2>] kill_anon_super+0x12/0x20
[ 4440.213994]  [<ffffffffc0d17d72>] lustre_kill_super+0x32/0x50 [obdclass]
[ 4440.215280]  [<ffffffffb6650fae>] deactivate_locked_super+0x4e/0x70
[ 4440.216493]  [<ffffffffb6651736>] deactivate_super+0x46/0x60
[ 4440.217607]  [<ffffffffb6670dcf>] cleanup_mnt+0x3f/0x80
[ 4440.218648]  [<ffffffffb6670e62>] __cleanup_mnt+0x12/0x20
[ 4440.219731]  [<ffffffffb64c28db>] task_work_run+0xbb/0xe0
[ 4440.220786]  [<ffffffffb642cc65>] do_notify_resume+0xa5/0xc0
[ 4440.221907]  [<ffffffffb6b962ef>] int_signal+0x12/0x17

On the client (vm1), we see

[ 4178.949972] echo Stopping client $(hostname) /mnt/lustre opts:;
[ 4178.949972] lsof /mnt/lustre || need_kill=no;
[ 4178.949972] if [ x != x -a x$need_kill != xno ]; then
[ 4178.949972]     pids=$(lsof -t /mnt/lustre | sort -u);
[ 4178.949972]     if 
[ 4320.175524] INFO: task umount:18671 blocked for more than 120 seconds.
[ 4320.176904] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4320.178305] umount          D ffff92277fc1acc0     0 18671  18662 0x00000080
[ 4320.179747] Call Trace:
[ 4320.180288]  [<ffffffff91589e69>] schedule_preempt_disabled+0x29/0x70
[ 4320.181455]  [<ffffffff91587dc7>] __mutex_lock_slowpath+0xc7/0x1d0
[ 4320.182589]  [<ffffffff9158719f>] mutex_lock+0x1f/0x2f
[ 4320.183545]  [<ffffffffc0a73297>] mgc_process_config+0x207/0x13f0 [mgc]
[ 4320.185118]  [<ffffffffc0773315>] obd_process_config.constprop.14+0x75/0x210 [obdclass]
[ 4320.186590]  [<ffffffffc075fb99>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 4320.187966]  [<ffffffffc077458f>] lustre_end_log+0x1ff/0x550 [obdclass]
[ 4320.189271]  [<ffffffffc0bf49ee>] ll_put_super+0x8e/0x9b0 [lustre]
[ 4320.190454]  [<ffffffff90f598ad>] ? call_rcu_sched+0x1d/0x20
[ 4320.191554]  [<ffffffffc0c1c7cc>] ? ll_destroy_inode+0x1c/0x20 [lustre]
[ 4320.192801]  [<ffffffff9106c31b>] ? destroy_inode+0x3b/0x60
[ 4320.193851]  [<ffffffff9106c455>] ? evict+0x115/0x180
[ 4320.194822]  [<ffffffff9106c503>] ? dispose_list+0x43/0x60
[ 4320.195865]  [<ffffffff91097279>] ? fsnotify_unmount_inodes+0x119/0x1d0
[ 4320.197108]  [<ffffffff910507cd>] generic_shutdown_super+0x6d/0x100
[ 4320.198280]  [<ffffffff91050bd2>] kill_anon_super+0x12/0x20
[ 4320.199352]  [<ffffffffc07729c5>] lustre_kill_super+0x45/0x50 [obdclass]
[ 4320.200597]  [<ffffffff91050fae>] deactivate_locked_super+0x4e/0x70
[ 4320.201784]  [<ffffffff91051736>] deactivate_super+0x46/0x60
[ 4320.202858]  [<ffffffff91070dbf>] cleanup_mnt+0x3f/0x80
[ 4320.203851]  [<ffffffff91070e52>] __cleanup_mnt+0x12/0x20
[ 4320.204889]  [<ffffffff90ec28db>] task_work_run+0xbb/0xe0
[ 4320.205935]  [<ffffffff90e2cc65>] do_notify_resume+0xa5/0xc0
[ 4320.207018]  [<ffffffff915962ef>] int_signal+0x12/0x17
[ 4367.764390] LustreError: 166-1: MGC10.9.6.22@tcp: Connection to MGS (at 10.9.6.22@tcp) was lost; in progress operations using this service will fail
[ 4367.766831] LustreError: Skipped 8 previous similar messages
[ 4367.767935] LustreError: 375:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623099556, 300s ago), entering recovery for MGS@10.9.6.22@tcp ns: MGC10.9.6.22@tcp lock: ffff9227613cf200/0x605efa0a5bb89566 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x8475cec67ff59212 expref: -99 pid: 375 timeout: 0 lvb_type: 0
[ 4367.776090] LustreError: 18684:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.6.22@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff9227638ad600) refcount nonzero (1) after lock cleanup; forcing cleanup.
[ 4367.961924] Lustre: Unmounted lustre-client

Log for other test sessions with this hang are at
https://testing.whamcloud.com/test_sets/6cbb5be8-e342-41f1-99ac-0681b5db24a9
https://testing.whamcloud.com/test_sets/1649ef8e-387d-47f5-aff8-d1449a7c9d7c
https://testing.whamcloud.com/test_sets/6440455d-7572-44d3-ad4a-9a55e6bcaa4c

Attachments

Issue Links

duplicates

LU-7372 replay-dual test_26: test failed to respond and timed out

Resolved

is related to

LU-14406 replay-dual test 22d fails with “Remote creation failed 1”

Open

runtests test 1 hangs on MDS unmount

Details

Description

Attachments

Issue Links

Activity

People

Dates