Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14749

runtests test 1 hangs on MDS unmount

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.12.6
    • None
    • 3
    • 9223372036854775807

    Description

      runtests test_1 is hanging on MDS umount only on b2_12 with DNE. We’ve seen this hang six times starting on 04 JUNE 2021.

      Looking at the hang at https://testing.whamcloud.com/test_sets/e9a05afa-2c52-4e7a-9528-e85f89d9d571, the last thing we see in the suite_log is

      CMD: trevis-45vm4 grep -c /mnt/lustre-mds3' ' /proc/mounts || true
      Stopping /mnt/lustre-mds3 (opts:-f) on trevis-45vm4
      CMD: trevis-45vm4 umount -d -f /mnt/lustre-mds3
      

      Looking at the console log for MDS1/3 (vm4), we see the MDS is hanging on unmount (many umount process hung with):

      [ 4271.014683] Lustre: DEBUG MARKER: ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
      [ 4271.014683] 			grep -q ^lustre-mdt1/ /proc/mounts ||
      [ 4271.014683] 			zpool export  lustre-mdt1
      [ 4271.481809] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107
      [ 4271.483819] LustreError: Skipped 2 previous similar messages
      [ 4271.484875] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [ 4271.487858] Lustre: Skipped 23 previous similar messages
      [ 4271.489169] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 4271.492149] LustreError: Skipped 1108 previous similar messages
      [ 4278.514239] Lustre: 14222:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1623099860/real 1623099860]  req@ffff9e71cdd2fa80 x1701939306318336/t0(0) o400->MGC10.9.6.22@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1623099867 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 4278.518985] Lustre: 14222:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
      [ 4279.969273] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds3' ' /proc/mounts || true
      [ 4280.342472] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds3
      [ 4440.193242] INFO: task umount:25803 blocked for more than 120 seconds.
      [ 4440.194850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 4440.196316] umount          D ffff9e71ce9fe300     0 25803  25802 0x00000080
      [ 4440.197835] Call Trace:
      [ 4440.198401]  [<ffffffffb6b89e69>] schedule_preempt_disabled+0x29/0x70
      [ 4440.199694]  [<ffffffffb6b87dc7>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 4440.200883]  [<ffffffffb6b8719f>] mutex_lock+0x1f/0x2f
      [ 4440.201932]  [<ffffffffc0f904d7>] mgc_process_config+0x207/0x13f0 [mgc]
      [ 4440.203416]  [<ffffffffc0d186d5>] obd_process_config.constprop.14+0x75/0x210 [obdclass]
      [ 4440.205020]  [<ffffffffc0baa177>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [ 4440.206335]  [<ffffffffc0d04cd9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [ 4440.207675]  [<ffffffffc0d1994f>] lustre_end_log+0x1ff/0x550 [obdclass]
      [ 4440.209047]  [<ffffffffc0d46bbe>] server_put_super+0x82e/0xd00 [obdclass]
      [ 4440.210386]  [<ffffffffb6697359>] ? fsnotify_unmount_inodes+0x119/0x1d0
      [ 4440.211643]  [<ffffffffb66507cd>] generic_shutdown_super+0x6d/0x100
      [ 4440.212885]  [<ffffffffb6650bd2>] kill_anon_super+0x12/0x20
      [ 4440.213994]  [<ffffffffc0d17d72>] lustre_kill_super+0x32/0x50 [obdclass]
      [ 4440.215280]  [<ffffffffb6650fae>] deactivate_locked_super+0x4e/0x70
      [ 4440.216493]  [<ffffffffb6651736>] deactivate_super+0x46/0x60
      [ 4440.217607]  [<ffffffffb6670dcf>] cleanup_mnt+0x3f/0x80
      [ 4440.218648]  [<ffffffffb6670e62>] __cleanup_mnt+0x12/0x20
      [ 4440.219731]  [<ffffffffb64c28db>] task_work_run+0xbb/0xe0
      [ 4440.220786]  [<ffffffffb642cc65>] do_notify_resume+0xa5/0xc0
      [ 4440.221907]  [<ffffffffb6b962ef>] int_signal+0x12/0x17
      

      On the client (vm1), we see

      [ 4178.949972] echo Stopping client $(hostname) /mnt/lustre opts:;
      [ 4178.949972] lsof /mnt/lustre || need_kill=no;
      [ 4178.949972] if [ x != x -a x$need_kill != xno ]; then
      [ 4178.949972]     pids=$(lsof -t /mnt/lustre | sort -u);
      [ 4178.949972]     if 
      [ 4320.175524] INFO: task umount:18671 blocked for more than 120 seconds.
      [ 4320.176904] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 4320.178305] umount          D ffff92277fc1acc0     0 18671  18662 0x00000080
      [ 4320.179747] Call Trace:
      [ 4320.180288]  [<ffffffff91589e69>] schedule_preempt_disabled+0x29/0x70
      [ 4320.181455]  [<ffffffff91587dc7>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 4320.182589]  [<ffffffff9158719f>] mutex_lock+0x1f/0x2f
      [ 4320.183545]  [<ffffffffc0a73297>] mgc_process_config+0x207/0x13f0 [mgc]
      [ 4320.185118]  [<ffffffffc0773315>] obd_process_config.constprop.14+0x75/0x210 [obdclass]
      [ 4320.186590]  [<ffffffffc075fb99>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [ 4320.187966]  [<ffffffffc077458f>] lustre_end_log+0x1ff/0x550 [obdclass]
      [ 4320.189271]  [<ffffffffc0bf49ee>] ll_put_super+0x8e/0x9b0 [lustre]
      [ 4320.190454]  [<ffffffff90f598ad>] ? call_rcu_sched+0x1d/0x20
      [ 4320.191554]  [<ffffffffc0c1c7cc>] ? ll_destroy_inode+0x1c/0x20 [lustre]
      [ 4320.192801]  [<ffffffff9106c31b>] ? destroy_inode+0x3b/0x60
      [ 4320.193851]  [<ffffffff9106c455>] ? evict+0x115/0x180
      [ 4320.194822]  [<ffffffff9106c503>] ? dispose_list+0x43/0x60
      [ 4320.195865]  [<ffffffff91097279>] ? fsnotify_unmount_inodes+0x119/0x1d0
      [ 4320.197108]  [<ffffffff910507cd>] generic_shutdown_super+0x6d/0x100
      [ 4320.198280]  [<ffffffff91050bd2>] kill_anon_super+0x12/0x20
      [ 4320.199352]  [<ffffffffc07729c5>] lustre_kill_super+0x45/0x50 [obdclass]
      [ 4320.200597]  [<ffffffff91050fae>] deactivate_locked_super+0x4e/0x70
      [ 4320.201784]  [<ffffffff91051736>] deactivate_super+0x46/0x60
      [ 4320.202858]  [<ffffffff91070dbf>] cleanup_mnt+0x3f/0x80
      [ 4320.203851]  [<ffffffff91070e52>] __cleanup_mnt+0x12/0x20
      [ 4320.204889]  [<ffffffff90ec28db>] task_work_run+0xbb/0xe0
      [ 4320.205935]  [<ffffffff90e2cc65>] do_notify_resume+0xa5/0xc0
      [ 4320.207018]  [<ffffffff915962ef>] int_signal+0x12/0x17
      [ 4367.764390] LustreError: 166-1: MGC10.9.6.22@tcp: Connection to MGS (at 10.9.6.22@tcp) was lost; in progress operations using this service will fail
      [ 4367.766831] LustreError: Skipped 8 previous similar messages
      [ 4367.767935] LustreError: 375:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623099556, 300s ago), entering recovery for MGS@10.9.6.22@tcp ns: MGC10.9.6.22@tcp lock: ffff9227613cf200/0x605efa0a5bb89566 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x8475cec67ff59212 expref: -99 pid: 375 timeout: 0 lvb_type: 0
      [ 4367.776090] LustreError: 18684:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.6.22@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff9227638ad600) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 4367.961924] Lustre: Unmounted lustre-client
      

      Log for other test sessions with this hang are at
      https://testing.whamcloud.com/test_sets/6cbb5be8-e342-41f1-99ac-0681b5db24a9
      https://testing.whamcloud.com/test_sets/1649ef8e-387d-47f5-aff8-d1449a7c9d7c
      https://testing.whamcloud.com/test_sets/6440455d-7572-44d3-ad4a-9a55e6bcaa4c

      Attachments

        Issue Links

          Activity

            [LU-14749] runtests test 1 hangs on MDS unmount

            Haven't seen this recently.

            adilger Andreas Dilger added a comment - Haven't seen this recently.

            This looks like a duplicate of LU-7372, which is listed as the reason why replay-dual test_26 is in the ALWAYS_EXCEPT list due to constant test failures.

            If you follow the breadcrumbs of the MGS timeout messages from runtests, "enqueued at nnn, 300s ago", it lands in replay-dual test_25 at:

            [ 3574.966184] LustreError: 6741:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.7.66@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff97001a4b5300) refcount nonzero (1) after lock cleanup; forcing cleanup.
            

            which implies it or test_24 is the cause of test_26 timing out, since test_23d is itself unmounting mds1. It now seems any test doing an MDS unmount afterward will also time out, since test_25 is the second-last subtest in replay-dual, and test_28 doesn't do any unmounting of its own.

            adilger Andreas Dilger added a comment - This looks like a duplicate of LU-7372 , which is listed as the reason why replay-dual test_26 is in the ALWAYS_EXCEPT list due to constant test failures. If you follow the breadcrumbs of the MGS timeout messages from runtests , " enqueued at nnn, 300s ago ", it lands in replay-dual test_25 at: [ 3574.966184] LustreError: 6741:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.7.66@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff97001a4b5300) refcount nonzero (1) after lock cleanup; forcing cleanup. which implies it or test_24 is the cause of test_26 timing out, since test_23d is itself unmounting mds1 . It now seems any test doing an MDS unmount afterward will also time out, since test_25 is the second-last subtest in replay-dual , and test_28 doesn't do any unmounting of its own.

            The test has failed 12 times since 2021-06-04, but only for patch review test sessions (where it runs after replay-dual), not for full test sessions. in the console logs I see an MGS lock timeout which predates the start of runtests in the logs:

            [ 4191.078567] LustreError: 14293:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319496, 300s ago), entering recovery for MGS@10.9.4.167@tcp ns: MGC10.9.4.167@tcp lock: ffff9920c80ca900/0x3c8702c41b0f74bc lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3c8702c41b0f74c3 expref: -99 pid: 14293 timeout: 0 lvb_type: 0
            [ 4223.498558] Lustre: DEBUG MARKER: == runtests test 1: All Runtests =================================== 10:10:28 (1623319828)
            [ 4491.085906] LustreError: 14293:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319796, 300s ago), entering recovery for MGS@10.9.4.167@tcp ns: MGC10.9.4.167@tcp lock: ffff9920c1c8d440/0x3c8702c41b0f77a9 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3c8702c41b0f77b0 expref: -99 pid: 14293 timeout: 0 lvb_type: 0
            [ 4491.092849] LustreError: 5423:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.4.167@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff9920c6afce40) refcount nonzero (1) after lock cleanup; forcing cleanup.
            

            and on the other MDS:

             4181.557888] LustreError: 24494:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319487, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff9920dcdbb440/0x3c8702c41b0f7461 lrc: 3/0,1 mode: --/EX res: [0x65727473756c:0x2:0x0].0x0 rrc: 16 type: PLN flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 24494 timeout: 0 lvb_type: 0
            [ 4181.564239] LustreError: dumping log to /tmp/lustre-log.1623319787.24494
            

            The '[0x65727473756c:0x2:0x0]' resource is "lustre" in ASCII, which I guess is the filesystem name enqueued for the configuration llog on the MGS?

            One possibility is that replay-dual may be exiting with the MGC unmounted, and this is somehow causing problems for runtests to unmount or remount the filesystem? The passing full-part-1 session has runtests immediately before replay-dual and runs with a single MDT. I suspect an "easy out" would be to reverse the order of these tests, but it may be that the timeout will move to the next test down the line...

            adilger Andreas Dilger added a comment - The test has failed 12 times since 2021-06-04, but only for patch review test sessions (where it runs after replay-dual), not for full test sessions. in the console logs I see an MGS lock timeout which predates the start of runtests in the logs: [ 4191.078567] LustreError: 14293:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319496, 300s ago), entering recovery for MGS@10.9.4.167@tcp ns: MGC10.9.4.167@tcp lock: ffff9920c80ca900/0x3c8702c41b0f74bc lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3c8702c41b0f74c3 expref: -99 pid: 14293 timeout: 0 lvb_type: 0 [ 4223.498558] Lustre: DEBUG MARKER: == runtests test 1: All Runtests =================================== 10:10:28 (1623319828) [ 4491.085906] LustreError: 14293:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319796, 300s ago), entering recovery for MGS@10.9.4.167@tcp ns: MGC10.9.4.167@tcp lock: ffff9920c1c8d440/0x3c8702c41b0f77a9 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3c8702c41b0f77b0 expref: -99 pid: 14293 timeout: 0 lvb_type: 0 [ 4491.092849] LustreError: 5423:0:(ldlm_resource.c:1137:ldlm_resource_complain()) MGC10.9.4.167@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff9920c6afce40) refcount nonzero (1) after lock cleanup; forcing cleanup. and on the other MDS: 4181.557888] LustreError: 24494:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1623319487, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff9920dcdbb440/0x3c8702c41b0f7461 lrc: 3/0,1 mode: --/EX res: [0x65727473756c:0x2:0x0].0x0 rrc: 16 type: PLN flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 24494 timeout: 0 lvb_type: 0 [ 4181.564239] LustreError: dumping log to /tmp/lustre-log.1623319787.24494 The ' [0x65727473756c:0x2:0x0] ' resource is " lustre " in ASCII, which I guess is the filesystem name enqueued for the configuration llog on the MGS? One possibility is that replay-dual may be exiting with the MGC unmounted, and this is somehow causing problems for runtests to unmount or remount the filesystem? The passing full-part-1 session has runtests immediately before replay-dual and runs with a single MDT. I suspect an "easy out" would be to reverse the order of these tests, but it may be that the timeout will move to the next test down the line...

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: