Details
-
Bug
-
Resolution: Cannot Reproduce
-
Trivial
-
None
-
None
-
3
-
9223372036854775807
Description
On our DNE testbed I have an MDT hung in recovery. This is LLNL lustre tag 2.8.0_0.0.llnlpreview.30 (see the lustre-release-fe-llnl repo). This tag includes patches and debugging for issues LU-8370, LU-8422, and LU-7800. See the tag for specifics.
One of the MDTs is hanging in recovery after reaching 0, even though all of the other MDTs are alive and well. Trying to abort recovery does not work.
Checking backtraces, I see the following thread that looks stuck:
PID: 14316 TASK: ffff883f1cb26780 CPU: 6 COMMAND: "tgt_recover_15" #0 [ffff883e760b37e8] __schedule+0x295 at ffffffff81651da5 #1 [ffff883e760b3850] schedule+0x29 at ffffffff81652479 #2 [ffff883e760b3860] ldlm_completion_ast+0x62d at ffffffffa0deb1cd [ptlrpc] #3 [ffff883e760b3900] ldlm_cli_enqueue_fini+0x938 at ffffffffa0dec958 [ptlrpc] #4 [ffff883e760b39a8] ldlm_cli_enqueue+0x2aa at ffffffffa0ded07a [ptlrpc] #5 [ffff883e760b3a50] osp_md_object_lock+0x154 at ffffffffa129b5c4 [osp] #6 [ffff883e760b3ad0] lod_object_lock+0xf0 at ffffffffa11d8310 [lod] #7 [ffff883e760b3b80] mdd_object_lock+0x3b at ffffffffa124070b [mdd] #8 [ffff883e760b3b90] mdt_remote_object_lock+0x1cf at ffffffffa10f563f [mdt] #9 [ffff883e760b3be8] mdt_object_lock_internal+0x15e at ffffffffa10f683e [mdt] #10 [ffff883e760b3c30] mdt_reint_object_lock+0x20 at ffffffffa10f6b50 [mdt] #11 [ffff883e760b3c40] mdt_reint_link+0x7e4 at ffffffffa110bd94 [mdt] #12 [ffff883e760b3cc8] mdt_reint_rec+0x80 at ffffffffa110e470 [mdt] #13 [ffff883e760b3cf0] mdt_reint_internal+0x5e1 at ffffffffa10f1971 [mdt] #14 [ffff883e760b3d28] mdt_reint+0x67 at ffffffffa10fb0d7 [mdt] #15 [ffff883e760b3d58] tgt_request_handle+0x915 at ffffffffa0e7d695 [ptlrpc] #16 [ffff883e760b3da0] handle_recovery_req+0x8b at ffffffffa0dda95b [ptlrpc] #17 [ffff883e760b3dc8] replay_request_or_update+0x4aa at ffffffffa0de499a [ptlrpc] #18 [ffff883e760b3e40] target_recovery_thread+0x617 at ffffffffa0de53c7 [ptlrpc] #19 [ffff883e760b3ec8] kthread+0xcf at ffffffff810a99bf #20 [ffff883e760b3f50] ret_from_fork+0x58 at ffffffff8165d9d8
Here is the recovery_status:
[root@jet16:lquake-MDT000f]# cat recovery_status status: RECOVERING recovery_start: 1470427157 time_remaining: 0 connected_clients: 202/203 req_replay_clients: 1 lock_repay_clients: 1 completed_clients: 201 evicted_clients: 1 replayed_requests: 0 queued_requests: 0 next_transno: 64483674941
The console is regularly noting the passed recovery deadline:
[ 4472.930991] Lustre: lquake-MDT000f: Recovery already passed deadline 47:36, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
Attachments
Issue Links
- is related to
-
LU-9049 DNE MDT Never completes recovery
- Resolved