[LU-8481] MDT hung in recovery Created: 05/Aug/16 Updated: 12/Jan/19 Resolved: 12/Jan/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Trivial |
| Reporter: | Christopher Morrone | Assignee: | nasf (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
On our DNE testbed I have an MDT hung in recovery. This is LLNL lustre tag 2.8.0_0.0.llnlpreview.30 (see the lustre-release-fe-llnl repo). This tag includes patches and debugging for issues One of the MDTs is hanging in recovery after reaching 0, even though all of the other MDTs are alive and well. Trying to abort recovery does not work. Checking backtraces, I see the following thread that looks stuck: PID: 14316 TASK: ffff883f1cb26780 CPU: 6 COMMAND: "tgt_recover_15" #0 [ffff883e760b37e8] __schedule+0x295 at ffffffff81651da5 #1 [ffff883e760b3850] schedule+0x29 at ffffffff81652479 #2 [ffff883e760b3860] ldlm_completion_ast+0x62d at ffffffffa0deb1cd [ptlrpc] #3 [ffff883e760b3900] ldlm_cli_enqueue_fini+0x938 at ffffffffa0dec958 [ptlrpc] #4 [ffff883e760b39a8] ldlm_cli_enqueue+0x2aa at ffffffffa0ded07a [ptlrpc] #5 [ffff883e760b3a50] osp_md_object_lock+0x154 at ffffffffa129b5c4 [osp] #6 [ffff883e760b3ad0] lod_object_lock+0xf0 at ffffffffa11d8310 [lod] #7 [ffff883e760b3b80] mdd_object_lock+0x3b at ffffffffa124070b [mdd] #8 [ffff883e760b3b90] mdt_remote_object_lock+0x1cf at ffffffffa10f563f [mdt] #9 [ffff883e760b3be8] mdt_object_lock_internal+0x15e at ffffffffa10f683e [mdt] #10 [ffff883e760b3c30] mdt_reint_object_lock+0x20 at ffffffffa10f6b50 [mdt] #11 [ffff883e760b3c40] mdt_reint_link+0x7e4 at ffffffffa110bd94 [mdt] #12 [ffff883e760b3cc8] mdt_reint_rec+0x80 at ffffffffa110e470 [mdt] #13 [ffff883e760b3cf0] mdt_reint_internal+0x5e1 at ffffffffa10f1971 [mdt] #14 [ffff883e760b3d28] mdt_reint+0x67 at ffffffffa10fb0d7 [mdt] #15 [ffff883e760b3d58] tgt_request_handle+0x915 at ffffffffa0e7d695 [ptlrpc] #16 [ffff883e760b3da0] handle_recovery_req+0x8b at ffffffffa0dda95b [ptlrpc] #17 [ffff883e760b3dc8] replay_request_or_update+0x4aa at ffffffffa0de499a [ptlrpc] #18 [ffff883e760b3e40] target_recovery_thread+0x617 at ffffffffa0de53c7 [ptlrpc] #19 [ffff883e760b3ec8] kthread+0xcf at ffffffff810a99bf #20 [ffff883e760b3f50] ret_from_fork+0x58 at ffffffff8165d9d8 Here is the recovery_status: [root@jet16:lquake-MDT000f]# cat recovery_status status: RECOVERING recovery_start: 1470427157 time_remaining: 0 connected_clients: 202/203 req_replay_clients: 1 lock_repay_clients: 1 completed_clients: 201 evicted_clients: 1 replayed_requests: 0 queued_requests: 0 next_transno: 64483674941 The console is regularly noting the passed recovery deadline: [ 4472.930991] Lustre: lquake-MDT000f: Recovery already passed deadline 47:36, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. |
| Comments |
| Comment by Christopher Morrone [ 05/Aug/16 ] |
|
Interestingly, /proc/fs/lustre/health_check says "healthy". |
| Comment by Christopher Morrone [ 05/Aug/16 ] |
|
On reboot the MDT was able to complete recovery the second time. That may indicate that this hang is racy. |
| Comment by Peter Jones [ 08/Aug/16 ] |
|
Fan Yong Could you please asssit with this one? Thanks Peter |
| Comment by nasf (Inactive) [ 09/Aug/16 ] |
|
The stack trace tells us that the MDT000f was replaying some 'link' operation from the client, the parent object to hold the target name entry was on another MDT, assume it was MDT000X. Such replay operation triggered ldlm enqueue RPC from the MDT000f to the MDT000X. At that time, the parent object's lock was held by some other, then the MDT000X did not grant the lock to the MDT000f immediately. So the lock sponsor "tgt_recover_15" on the MDT000f waited there inside "ldlm_completion_ast()" as following: int ldlm_completion_ast(struct ldlm_lock *lock, __u64 flags, void *data)
{
...
if (ldlm_is_no_timeout(lock)) {
LDLM_DEBUG(lock, "waiting indefinitely because of NO_TIMEOUT");
lwi = LWI_INTR(interrupted_completion_wait, &lwd);
} else {
lwi = LWI_TIMEOUT_INTR(cfs_time_seconds(timeout),
ldlm_expired_completion_wait,
interrupted_completion_wait, &lwd);
}
...
}
But for some unknown reason, the lock was not granted in time. It may because: 1) The conflict lock holder did not release the conflict lock in time, that should triggered eviction on the MDT000X for the blocking_ast timeout. Please check log messages on the MDT000X. 2) The conflict lock holder released the conflict lock in time, but the MDT000f failed to send the completion_ast to the MDT000f. If that happened, there should be lock callback timeout logs on the MDT000X. Please check log messages on the MDT000X. Anyway, it also depends on how long you saw the MDT000f hung there. If it is long enough, the MDT000X should give us more information; otherwise, if the MDT000X was not aware of the timeout (lock blocking ast or completion ast), then we cannot know why the "tgt_recover_15" on the MDT000f was not waken up. |
| Comment by Christopher Morrone [ 09/Aug/16 ] |
|
From the "Recovery already passed deadline 47:36" message, we know that it was stuck for at least 47 minutes. I think I left it sitting like that for over an hour before I rebooted it. How do I identify MDT000X? And what am I looking for in the sea of Lustre console messages? |
| Comment by nasf (Inactive) [ 10/Aug/16 ] |
Honestly, only from the given logs, I do not know what the exact 'X' is. You have to check every other MDT one by one with the message like "lock callback timer expired after xxxx: evicting client at ..." |
| Comment by nasf (Inactive) [ 24/Aug/16 ] |
|
Any further input? Thanks! |
| Comment by Christopher Morrone [ 24/Aug/16 ] |
|
Not at this time. |
| Comment by nasf (Inactive) [ 05/Jun/18 ] |
|
Do you have more input for this ticket or can we close it? I do not know whether the issue is still there or not because there are some DNE recovery related patches have been landed in the past 20 months. |
| Comment by Peter Jones [ 12/Jan/19 ] |
|
closing ancient ticket |