Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13608

MDT stuck in WAITING, abort_recov stuck too

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      when some network switch goes down, taking with it a number of compute nodes.
      a bunch of compute nodes were stuck, and also caused some oss to go down to hit
      LBUG (LU-12906) and crash.

      the mds went into softlockups before crashing. when it got back, 3 out of 4 MDTs
      mounted and recovered, but one MDT went into WAITING and stayed there.
      lctl abort_recov had no effect on its status. so the mds was rebooted.

      when the MDT again went into WAITING state, a pre-emptive abort_recov was issued
      before it could time-out. but it did not help and the MDT continued to try to recover.

      2020-05-08 18:52:15 [ 1640.880984] Pid: 15922, comm: mdt02_020 3.10.0-1062.4.1.el7_lustre.x86_64 #1 SMP Mon Oct 28 01:39:05 UTC 2019
      2020-05-08 18:52:15 [ 1640.892505] Call Trace:
      2020-05-08 18:52:15 [ 1640.895708]  [<ffffffffc169bdc0>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
      2020-05-08 18:52:15 [ 1640.903498]  [<ffffffffc169c153>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      2020-05-08 18:52:15 [ 1640.911383]  [<ffffffffc1a9eaf3>] osp_remote_sync+0xd3/0x200 [osp]
      2020-05-08 18:52:15 [ 1640.918760]  [<ffffffffc1a84c63>] osp_attr_get+0x463/0x730 [osp]
      2020-05-08 18:52:15 [ 1640.925917]  [<ffffffffc1a818cd>] osp_object_init+0x16d/0x2d0 [osp]
      2020-05-08 18:52:15 [ 1640.933361]  [<ffffffffc141c59b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      2020-05-08 18:52:15 [ 1640.941977]  [<ffffffffc1420471>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      2020-05-08 18:52:15 [ 1640.950100]  [<ffffffffc1420d06>] lu_object_find+0x16/0x20 [obdclass]
      2020-05-08 18:52:15 [ 1640.957737]  [<ffffffffc194a01b>] mdt_object_find+0x4b/0x170 [mdt]
      2020-05-08 18:52:15 [ 1640.965074]  [<ffffffffc194cc38>] mdt_getattr_name_lock+0x848/0x1c30 [mdt]
      2020-05-08 18:52:15 [ 1640.973194]  [<ffffffffc1954d25>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      2020-05-08 18:52:15 [ 1640.980928]  [<ffffffffc1951bb5>] mdt_intent_policy+0x435/0xd80 [mdt]
      2020-05-08 18:52:15 [ 1640.988552]  [<ffffffffc1659d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      2020-05-08 18:52:15 [ 1640.996483]  [<ffffffffc1682366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.004811]  [<ffffffffc170ab02>] tgt_enqueue+0x62/0x210 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.012076]  [<ffffffffc17112ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.020213]  [<ffffffffc16b629b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.029217]  [<ffffffffc16b9bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      
      2020-05-08 18:52:15 [ 1641.056176] Pid: 14912, comm: mdt03_006 3.10.0-1062.4.1.el7_lustre.x86_64 #1 SMP Mon Oct 28 01:39:05 UTC 2019
      2020-05-08 18:52:15 [ 1641.067667] Call Trace:
      2020-05-08 18:52:15 [ 1641.070833]  [<ffffffffc1672b96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.078951]  [<ffffffffc167492f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.087272]  [<ffffffffc167751e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.095108]  [<ffffffffc1a997f2>] osp_md_object_lock+0x162/0x2d0 [osp]
      2020-05-08 18:52:15 [ 1641.102832]  [<ffffffffc10cb193>] lod_object_lock+0xf3/0x7b0 [lod]
      2020-05-08 18:52:15 [ 1641.110179]  [<ffffffffc1a2eeee>] mdd_object_lock+0x3e/0xe0 [mdd]
      2020-05-08 18:52:15 [ 1641.117429]  [<ffffffffc194a341>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
      2020-05-08 18:52:15 [ 1641.125926]  [<ffffffffc194a8da>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      2020-05-08 18:52:15 [ 1641.133847]  [<ffffffffc195f2ae>] mdt_rename_lock+0xbe/0x4b0 [mdt]
      2020-05-08 18:52:15 [ 1641.141189]  [<ffffffffc1961605>] mdt_reint_rename+0x2c5/0x2b90 [mdt]
      2020-05-08 18:52:15 [ 1641.148819]  [<ffffffffc196a693>] mdt_reint_rec+0x83/0x210 [mdt]
      2020-05-08 18:52:15 [ 1641.155965]  [<ffffffffc19471b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      2020-05-08 18:52:15 [ 1641.163689]  [<ffffffffc1952567>] mdt_reint+0x67/0x140 [mdt]
      2020-05-08 18:52:15 [ 1641.170469]  [<ffffffffc17112ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.178601]  [<ffffffffc16b629b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2020-05-08 18:52:15 [ 1641.187639]  [<ffffffffc16b9bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              hongchao.zhang Hongchao Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: