Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.12.6
-
None
-
3
-
9223372036854775807
Description
runtests test_1 is hanging on MDS umount only on b2_12 with DNE. We’ve seen this hang six times starting on 04 JUNE 2021.
Looking at the hang at https://testing.whamcloud.com/test_sets/e9a05afa-2c52-4e7a-9528-e85f89d9d571, the last thing we see in the suite_log is
CMD: trevis-45vm4 grep -c /mnt/lustre-mds3' ' /proc/mounts || true Stopping /mnt/lustre-mds3 (opts:-f) on trevis-45vm4 CMD: trevis-45vm4 umount -d -f /mnt/lustre-mds3
Looking at the console log for MDS1/3 (vm4), we see the MDS is hanging on unmount (many umount process hung with):
[ 4271.014683] Lustre: DEBUG MARKER: ! zpool list -H lustre-mdt1 >/dev/null 2>&1 || [ 4271.014683] grep -q ^lustre-mdt1/ /proc/mounts || [ 4271.014683] zpool export lustre-mdt1 [ 4271.481809] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107 [ 4271.483819] LustreError: Skipped 2 previous similar messages [ 4271.484875] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [ 4271.487858] Lustre: Skipped 23 previous similar messages [ 4271.489169] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. [ 4271.492149] LustreError: Skipped 1108 previous similar messages [ 4278.514239] Lustre: 14222:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1623099860/real 1623099860] req@ffff9e71cdd2fa80 x1701939306318336/t0(0) o400->MGC10.9.6.22@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1623099867 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 4278.518985] Lustre: 14222:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 12 previous similar messages [ 4279.969273] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true [ 4280.342472] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3 [ 4440.193242] INFO: task umount:25803 blocked for more than 120 seconds. [ 4440.194850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4440.196316] umount D ffff9e71ce9fe300 0 25803 25802 0x00000080 [ 4440.197835] Call Trace: [ 4440.198401] [<ffffffffb6b89e69>] schedule_preempt_disabled+0x29/0x70 [ 4440.199694] [<ffffffffb6b87dc7>] __mutex_lock_slowpath+0xc7/0x1d0 [ 4440.200883] [<ffffffffb6b8719f>] mutex_lock+0x1f/0x2f [ 4440.201932] [<ffffffffc0f904d7>] mgc_process_config+0x207/0x13f0 [mgc] [ 4440.203416] [<ffffffffc0d186d5>] obd_process_config.constprop.14+0x75/0x210 [obdclass] [ 4440.205020] [<ffffffffc0baa177>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 4440.206335] [<ffffffffc0d04cd9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 4440.207675] [<ffffffffc0d1994f>] lustre_end_log+0x1ff/0x550 [obdclass] [ 4440.209047] [<ffffffffc0d46bbe>] server_put_super+0x82e/0xd00 [obdclass] [ 4440.210386] [<ffffffffb6697359>] ? fsnotify_unmount_inodes+0x119/0x1d0 [ 4440.211643] [<ffffffffb66507cd>] generic_shutdown_super+0x6d/0x100 [ 4440.212885] [<ffffffffb6650bd2>] kill_anon_super+0x12/0x20 [ 4440.213994] [<ffffffffc0d17d72>] lustre_kill_super+0x32/0x50 [obdclass] [ 4440.215280] [<ffffffffb6650fae>] deactivate_locked_super+0x4e/0x70 [ 4440.216493] [<ffffffffb6651736>] deactivate_super+0x46/0x60 [ 4440.217607] [<ffffffffb6670dcf>] cleanup_mnt+0x3f/0x80 [ 4440.218648] [<ffffffffb6670e62>] __cleanup_mnt+0x12/0x20 [ 4440.219731] [<ffffffffb64c28db>] task_work_run+0xbb/0xe0 [ 4440.220786] [<ffffffffb642cc65>] do_notify_resume+0xa5/0xc0 [ 4440.221907] [<ffffffffb6b962ef>] int_signal+0x12/0x17
On the client (vm1), we see
[ 4178.949972] echo Stopping client $(hostname) /mnt/lustre opts:; [ 4178.949972] lsof /mnt/lustre || need_kill=no; [ 4178.949972] if [ x != x -a x$need_kill != xno ]; then [ 4178.949972] pids=$(lsof -t /mnt/lustre | sort -u); [ 4178.949972] if [ 4320.175524] INFO: task umount:18671 blocked for more than 120 seconds. [ 4320.176904] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4320.178305] umount D ffff92277fc1acc0 0 18671 18662 0x00000080 [ 4320.179747] Call Trace: [ 4320.180288] [<ffffffff91589e69>] schedule_preempt_disabled+0x29/0x70 [ 4320.181455] [<ffffffff91587dc7>] __mutex_lock_slowpath+0xc7/0x1d0 [ 4320.182589] [<ffffffff9158719f>] mutex_lock+0x1f/0x2f [ 4320.183545] [<ffffffffc0a73297>] mgc_process_config+0x207/0x13f0 [mgc] [ 4320.185118] [<ffffffffc0773315>] obd_process_config.constprop.14+0x75/0x210 [obdclass] [ 4320.186590] [<ffffffffc075fb99>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 4320.187966] [<ffffffffc077458f>] lustre_end_log+0x1ff/0x550 [obdclass] [ 4320.189271] [<ffffffffc0bf49ee>] ll_put_super+0x8e/0x9b0 [lustre] [ 4320.190454] [<ffffffff90f598ad>] ? call_rcu_sched+0x1d/0x20 [ 4320.191554] [<ffffffffc0c1c7cc>] ? ll_destroy_inode+0x1c/0x20 [lustre] [ 4320.192801] [<ffffffff9106c31b>] ? destroy_inode+0x3b/0x60 [ 4320.193851] [<ffffffff9106c455>] ? evict+0x115/0x180 [ 4320.194822] [<ffffffff9106c503>] ? dispose_list+0x43/0x60 [ 4320.195865] [<ffffffff91097279>] ? fsnotify_unmount_inodes+0x119/0x1d0 [ 4320.197108] [<ffffffff910507cd>] generic_shutdown_super+0x6d/0x100 [ 4320.198280] [<ffffffff91050bd2>] kill_anon_super+0x12/0x20 [ 4320.199352] [<ffffffffc07729c5>] lustre_kill_super+0x45/0x50 [obdclass] [ 4320.200597] [<ffffffff91050fae>] deactivate_locked_super+0x4e/0x70 [ 4320.201784] [<ffffffff91051736>] deactivate_super+0x46/0x60 [ 4320.202858] [<ffffffff91070dbf>] cleanup_mnt+0x3f/0x80 [ 4320.203851] [<ffffffff91070e52>] __cleanup_mnt+0x12/0x20 [ 4320.204889] [<ffffffff90ec28db>] task_work_run+0xbb/0xe0 [ 4320.205935] [<ffffffff90e2cc65>] do_notify_resume+0xa5/0xc0 [ 4320.207018] [<ffffffff915962ef>] int_signal+0x12/0x17 [ 4367.764390] LustreError: 166-1: MGC10.9.6.22@tcp: Connection to MGS (at 10.9.6.22@tcp) was lost; in progress operations using this service will fail [ 4367.766831] LustreError: Skipped 8 previous similar messages [ 4367.767935] LustreError: 375:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623099556, 300s ago), entering recovery for MGS@10.9.6.22@tcp ns: MGC10.9.6.22@tcp lock: ffff9227613cf200/0x605efa0a5bb89566 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x8475cec67ff59212 expref: -99 pid: 375 timeout: 0 lvb_type: 0 [ 4367.776090] LustreError: 18684:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.6.22@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff9227638ad600) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 4367.961924] Lustre: Unmounted lustre-client
Log for other test sessions with this hang are at
https://testing.whamcloud.com/test_sets/6cbb5be8-e342-41f1-99ac-0681b5db24a9
https://testing.whamcloud.com/test_sets/1649ef8e-387d-47f5-aff8-d1449a7c9d7c
https://testing.whamcloud.com/test_sets/6440455d-7572-44d3-ad4a-9a55e6bcaa4c