Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8481

MDT hung in recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Trivial
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      On our DNE testbed I have an MDT hung in recovery. This is LLNL lustre tag 2.8.0_0.0.llnlpreview.30 (see the lustre-release-fe-llnl repo). This tag includes patches and debugging for issues LU-8370, LU-8422, and LU-7800. See the tag for specifics.

      One of the MDTs is hanging in recovery after reaching 0, even though all of the other MDTs are alive and well. Trying to abort recovery does not work.

      Checking backtraces, I see the following thread that looks stuck:

      PID: 14316  TASK: ffff883f1cb26780  CPU: 6   COMMAND: "tgt_recover_15"
       #0 [ffff883e760b37e8] __schedule+0x295 at ffffffff81651da5
       #1 [ffff883e760b3850] schedule+0x29 at ffffffff81652479
       #2 [ffff883e760b3860] ldlm_completion_ast+0x62d at ffffffffa0deb1cd [ptlrpc]
       #3 [ffff883e760b3900] ldlm_cli_enqueue_fini+0x938 at ffffffffa0dec958 [ptlrpc]
       #4 [ffff883e760b39a8] ldlm_cli_enqueue+0x2aa at ffffffffa0ded07a [ptlrpc]
       #5 [ffff883e760b3a50] osp_md_object_lock+0x154 at ffffffffa129b5c4 [osp]
       #6 [ffff883e760b3ad0] lod_object_lock+0xf0 at ffffffffa11d8310 [lod]
       #7 [ffff883e760b3b80] mdd_object_lock+0x3b at ffffffffa124070b [mdd]
       #8 [ffff883e760b3b90] mdt_remote_object_lock+0x1cf at ffffffffa10f563f [mdt]
       #9 [ffff883e760b3be8] mdt_object_lock_internal+0x15e at ffffffffa10f683e [mdt]
      #10 [ffff883e760b3c30] mdt_reint_object_lock+0x20 at ffffffffa10f6b50 [mdt]
      #11 [ffff883e760b3c40] mdt_reint_link+0x7e4 at ffffffffa110bd94 [mdt]
      #12 [ffff883e760b3cc8] mdt_reint_rec+0x80 at ffffffffa110e470 [mdt]
      #13 [ffff883e760b3cf0] mdt_reint_internal+0x5e1 at ffffffffa10f1971 [mdt]
      #14 [ffff883e760b3d28] mdt_reint+0x67 at ffffffffa10fb0d7 [mdt]
      #15 [ffff883e760b3d58] tgt_request_handle+0x915 at ffffffffa0e7d695 [ptlrpc]
      #16 [ffff883e760b3da0] handle_recovery_req+0x8b at ffffffffa0dda95b [ptlrpc]
      #17 [ffff883e760b3dc8] replay_request_or_update+0x4aa at ffffffffa0de499a [ptlrpc]
      #18 [ffff883e760b3e40] target_recovery_thread+0x617 at ffffffffa0de53c7 [ptlrpc]
      #19 [ffff883e760b3ec8] kthread+0xcf at ffffffff810a99bf
      #20 [ffff883e760b3f50] ret_from_fork+0x58 at ffffffff8165d9d8
      

      Here is the recovery_status:

      [root@jet16:lquake-MDT000f]# cat recovery_status
      status: RECOVERING
      recovery_start: 1470427157
      time_remaining: 0
      connected_clients: 202/203
      req_replay_clients: 1
      lock_repay_clients: 1
      completed_clients: 201
      evicted_clients: 1
      replayed_requests: 0
      queued_requests: 0
      next_transno: 64483674941
      

      The console is regularly noting the passed recovery deadline:

      [ 4472.930991] Lustre: lquake-MDT000f: Recovery already passed deadline 47:36, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: