Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19565

MDT cannot abort recovery with in-flight osp_remote_sync

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • None
    • Lustre 2.15.6
    • None
    • 3
    • 9223372036854775807

    Description

      During DNE recovery, lod_sub_recovery_thread makes an osp_remote_sync call to fetch the attributes of the update log on the other MDT. However, in a case where this call is not being completed, it never times out, and the lod_sub_recovery_thread will sleep indefinitely on this RPC. Even after the hard timeout is reached, recovery is unable to progress as this thread does not wake up, sleeping on ptlrpc_set_wait

      We investigated one such MDT stuck in the following trace:

      Name: lod0003_rec0004 - stack
      ptlrpc_set_wait+0x5a4/0x7c8 [ptlrpc]
      ptlrpc_queue_wait+0xa4/0x364 [ptlrpc]
      osp_remote_sync+0x1ac/0x27c [osp]
      osp_attr_get+0x68c/0x9d0 [osp]
      osp_object_init+0x1a4/0x340 [osp]
      lu_object_start+0x84/0x154 [obdclass]
      lu_object_find_at+0x37c/0x730 [obdclass]
      dt_locate_at+0x28/0xc4 [obdclass]
      llog_osd_get_cat_list+0xa8/0xc60 [obdclass]
      lod_sub_prep_llog+0x1ac/0x888 [lod]
      lod_sub_recovery_thread+0x49c/0x10ec [lod] 

      In the kernel logs, recovery was continually trying to abort but it was unable to.

      kernel: LustreError: 7454:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery
      kernel: LustreError: 7454:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 115 previous similar messages
      kernel: LustreError: 9776:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery
      kernel: LustreError: 9776:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 338 previous similar messages
      kernel: LustreError: 7456:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery
      kernel: LustreError: 7456:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 575 previous similar messages
      kernel: LustreError: 7435:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery
      kernel: LustreError: 7435:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 1159 previous similar messages
      kernel: LustreError: 7454:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery
      kernel: LustreError: 7454:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 2337 previous similar messages
      kernel: LustreError: 7438:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery
      kernel: LustreError: 7438:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 2712 previous similar messages
      kernel: LustreError: 7436:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery
      kernel: LustreError: 7436:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 2701 previous similar messages
      kernel: LustreError: 7452:0:(ldlm_lib.c:876:target_check_recovery_timer()) xyz-MDT0003: Aborting recovery
      kernel: LustreError: 7452:0:(ldlm_lib.c:876:target_check_recovery_timer()) Skipped 2741 previous similar messages
      

      Regardless of the underlying problem preventing the RPC from being completed, this RPC needs to be able to be interrupted by a recovery abort, or a filesystem can be stuck offline indefinitely.

      This issue is very similar to LU-15517, and it involves the same stack trace. However, that issue addresses a specific failure scenario - this ticket is to address the error handling side of the situation. Recovery abort needs to cancel the in-flight RPCs so it can clean up and finish the abort process.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              fvogdunc Duncan Vogel
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: