Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
Lustre 2.15.6
-
None
-
3
-
9223372036854775807
Description
During DNE recovery, lod_sub_recovery_thread makes an osp_remote_sync call to fetch the attributes of the update log on the other MDT. However, in a case where this call is not being completed, it never times out, and the lod_sub_recovery_thread will sleep indefinitely on this RPC. Even after the hard timeout is reached, recovery is unable to progress as this thread does not wake up, sleeping on ptlrpc_set_wait
We investigated one such MDT stuck in the following trace:
Name: lod0003_rec0004 - stack ptlrpc_set_wait+0x5a4/0x7c8 [ptlrpc] ptlrpc_queue_wait+0xa4/0x364 [ptlrpc] osp_remote_sync+0x1ac/0x27c [osp] osp_attr_get+0x68c/0x9d0 [osp] osp_object_init+0x1a4/0x340 [osp] lu_object_start+0x84/0x154 [obdclass] lu_object_find_at+0x37c/0x730 [obdclass] dt_locate_at+0x28/0xc4 [obdclass] llog_osd_get_cat_list+0xa8/0xc60 [obdclass] lod_sub_prep_llog+0x1ac/0x888 [lod] lod_sub_recovery_thread+0x49c/0x10ec [lod]
In the kernel logs, recovery was continually trying to abort but it was unable to.
kernel: LustreError: 7454:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery kernel: LustreError: 7454:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 115 previous similar messages kernel: LustreError: 9776:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery kernel: LustreError: 9776:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 338 previous similar messages kernel: LustreError: 7456:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery kernel: LustreError: 7456:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 575 previous similar messages kernel: LustreError: 7435:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery kernel: LustreError: 7435:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 1159 previous similar messages kernel: LustreError: 7454:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery kernel: LustreError: 7454:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 2337 previous similar messages kernel: LustreError: 7438:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery kernel: LustreError: 7438:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 2712 previous similar messages kernel: LustreError: 7436:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery kernel: LustreError: 7436:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 2701 previous similar messages kernel: LustreError: 7452:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery kernel: LustreError: 7452:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 2741 previous similar messages
Regardless of the underlying problem preventing the RPC from being completed, this RPC needs to be able to be interrupted by a recovery abort, or a filesystem can be stuck offline indefinitely.
This issue is very similar to LU-15517, and it involves the same stack trace. However, that issue addresses a specific failure scenario - this ticket is to address the error handling side of the situation. Recovery abort needs to cancel the in-flight RPCs so it can clean up and finish the abort process.
Attachments
Issue Links
- is related to
-
LU-15517 MDT stuck in recovery if one other MDT is failed over to partner node
-
- Open
-