Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13608

MDT stuck in WAITING, abort_recov stuck too

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      when some network switch goes down, taking with it a number of compute nodes.
      a bunch of compute nodes were stuck, and also caused some oss to go down to hit
      LBUG (LU-12906) and crash.

      the mds went into softlockups before crashing. when it got back, 3 out of 4 MDTs
      mounted and recovered, but one MDT went into WAITING and stayed there.
      lctl abort_recov had no effect on its status. so the mds was rebooted.

      when the MDT again went into WAITING state, a pre-emptive abort_recov was issued
      before it could time-out. but it did not help and the MDT continued to try to recover.

      2020-05-08 18:52:15 [ 1640.880984] Pid: 15922, comm: mdt02_020 3.10.0-1062.4.1.el7_lustre.x86_64 #1 SMP Mon Oct 28 01:39:05 UTC 2019
      2020-05-08 18:52:15 [ 1640.892505] Call Trace:
      2020-05-08 18:52:15 [ 1640.895708]  [<ffffffffc169bdc0>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
      2020-05-08 18:52:15 [ 1640.903498]  [<ffffffffc169c153>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      2020-05-08 18:52:15 [ 1640.911383]  [<ffffffffc1a9eaf3>] osp_remote_sync+0xd3/0x200 [osp]
      2020-05-08 18:52:15 [ 1640.918760]  [<ffffffffc1a84c63>] osp_attr_get+0x463/0x730 [osp]
      2020-05-08 18:52:15 [ 1640.925917]  [<ffffffffc1a818cd>] osp_object_init+0x16d/0x2d0 [osp]
      2020-05-08 18:52:15 [ 1640.933361]  [<ffffffffc141c59b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      2020-05-08 18:52:15 [ 1640.941977]  [<ffffffffc1420471>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      2020-05-08 18:52:15 [ 1640.950100]  [<ffffffffc1420d06>] lu_object_find+0x16/0x20 [obdclass]
      2020-05-08 18:52:15 [ 1640.957737]  [<ffffffffc194a01b>] mdt_object_find+0x4b/0x170 [mdt]
      2020-05-08 18:52:15 [ 1640.965074]  [<ffffffffc194cc38>] mdt_getattr_name_lock+0x848/0x1c30 [mdt]
      2020-05-08 18:52:15 [ 1640.973194]  [<ffffffffc1954d25>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      2020-05-08 18:52:15 [ 1640.980928]  [<ffffffffc1951bb5>] mdt_intent_policy+0x435/0xd80 [mdt]
      2020-05-08 18:52:15 [ 1640.988552]  [<ffffffffc1659d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      2020-05-08 18:52:15 [ 1640.996483]  [<ffffffffc1682366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.004811]  [<ffffffffc170ab02>] tgt_enqueue+0x62/0x210 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.012076]  [<ffffffffc17112ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.020213]  [<ffffffffc16b629b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.029217]  [<ffffffffc16b9bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      
      2020-05-08 18:52:15 [ 1641.056176] Pid: 14912, comm: mdt03_006 3.10.0-1062.4.1.el7_lustre.x86_64 #1 SMP Mon Oct 28 01:39:05 UTC 2019
      2020-05-08 18:52:15 [ 1641.067667] Call Trace:
      2020-05-08 18:52:15 [ 1641.070833]  [<ffffffffc1672b96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.078951]  [<ffffffffc167492f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.087272]  [<ffffffffc167751e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.095108]  [<ffffffffc1a997f2>] osp_md_object_lock+0x162/0x2d0 [osp]
      2020-05-08 18:52:15 [ 1641.102832]  [<ffffffffc10cb193>] lod_object_lock+0xf3/0x7b0 [lod]
      2020-05-08 18:52:15 [ 1641.110179]  [<ffffffffc1a2eeee>] mdd_object_lock+0x3e/0xe0 [mdd]
      2020-05-08 18:52:15 [ 1641.117429]  [<ffffffffc194a341>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
      2020-05-08 18:52:15 [ 1641.125926]  [<ffffffffc194a8da>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      2020-05-08 18:52:15 [ 1641.133847]  [<ffffffffc195f2ae>] mdt_rename_lock+0xbe/0x4b0 [mdt]
      2020-05-08 18:52:15 [ 1641.141189]  [<ffffffffc1961605>] mdt_reint_rename+0x2c5/0x2b90 [mdt]
      2020-05-08 18:52:15 [ 1641.148819]  [<ffffffffc196a693>] mdt_reint_rec+0x83/0x210 [mdt]
      2020-05-08 18:52:15 [ 1641.155965]  [<ffffffffc19471b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      2020-05-08 18:52:15 [ 1641.163689]  [<ffffffffc1952567>] mdt_reint+0x67/0x140 [mdt]
      2020-05-08 18:52:15 [ 1641.170469]  [<ffffffffc17112ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.178601]  [<ffffffffc16b629b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.187639]  [<ffffffffc16b9bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      

      Attachments

        Issue Links

          Activity

            [LU-13608] MDT stuck in WAITING, abort_recov stuck too
            utopiabound Nathaniel Clark made changes -
            Link New: This issue is related to EX-6680 [ EX-6680 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to EX-3792 [ EX-3792 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14119 [ LU-14119 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.12.7 [ 14793 ]

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41183/
            Subject: LU-13608 out: don't return einprogress error
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 7817acc39ee1d6859c2737f75619748dc8e37f95

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41183/ Subject: LU-13608 out: don't return einprogress error Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 7817acc39ee1d6859c2737f75619748dc8e37f95

            Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/39539
            Subject: LU-13608 tests: check MDS recovery hang
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2329127abf2ace5333a10352b77202c55f8da0aa

            adilger Andreas Dilger added a comment - Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/39539 Subject: LU-13608 tests: check MDS recovery hang Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2329127abf2ace5333a10352b77202c55f8da0aa
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14318 [ LU-14318 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to DDN-1804 [ DDN-1804 ]

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41183
            Subject: LU-13608 out: don't return einprogress error
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 2c6b286c1e596b850cabe0b185c1552b0133496d

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41183 Subject: LU-13608 out: don't return einprogress error Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 2c6b286c1e596b850cabe0b185c1552b0133496d
            adilger Andreas Dilger made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            People

              hongchao.zhang Hongchao Zhang
              hongchao.zhang Hongchao Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: