Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16356

high contention on cdt_request_lock causes clients to hang

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      If there is above a million entries in cdt_restore_list, it takes a time to go through the all list.

       #4 [ffff8b71d270f900] memcmp at ffffffff8ab7fe1c
       #5 [ffff8b71d270f908] cdt_restore_handle_find at ffffffffc14e7ee9 [mdt]
       #6 [ffff8b71d270f938] mdt_hsm_restore_is_running at ffffffffc14df0c2 [mdt]
       #7 [ffff8b71d270f968] mdt_getattr_internal at ffffffffc14919a1 [mdt]
       #8 [ffff8b71d270f9e0] mdt_getattr_name_lock at ffffffffc1495a7d [mdt]
       #9 [ffff8b71d270fa90] mdt_intent_getattr at ffffffffc149d5d5 [mdt]
      #10 [ffff8b71d270fad0] mdt_intent_opc at ffffffffc14926ba [mdt]
      #11 [ffff8b71d270fb30] mdt_intent_policy at ffffffffc149a7f4 [mdt]
      #12 [ffff8b71d270fb70] ldlm_lock_enqueue at ffffffffc0ff852a [ptlrpc]
      #13 [ffff8b71d270fbf0] ldlm_handle_enqueue0 at ffffffffc1020f97 [ptlrpc]
      #14 [ffff8b71d270fc80] tgt_enqueue at ffffffffc10ab0f2 [ptlrpc]
      #15 [ffff8b71d270fca0] tgt_request_handle at ffffffffc10aff0a [ptlrpc]
      #16 [ffff8b71d270fd30] ptlrpc_server_handle_request at ffffffffc1055a56 [ptlrpc]
      #17 [ffff8b71d270fde8] ptlrpc_main at ffffffffc1059b35 [ptlrpc]
      #18 [ffff8b71d270fec8] kthread at ffffffff8a8c1f81
      #19 [ffff8b71d270ff50] ret_from_fork_nospec_begin at ffffffff8af77c1d   

      Holding cdt_restore_lock it blocks other tasks that also need cdt_restore_lock:

      crash> bt 38174
      PID: 38174  TASK: ffff8b722d5fb0c0  CPU: 28  COMMAND: "hsm_cdtr"
       #0 [ffff8b71c31af778] __schedule at ffffffff8af6ab17
       #1 [ffff8b71c31af808] schedule_preempt_disabled at ffffffff8af6bf39
       #2 [ffff8b71c31af818] __mutex_lock_slowpath at ffffffff8af69e87
       #3 [ffff8b71c31af878] mutex_lock at ffffffff8af6926f
       #4 [ffff8b71c31af890] cdt_restore_handle_del at ffffffffc14e8008 [mdt]
       #5 [ffff8b71c31af8c0] mdt_cdt_started_cb at ffffffffc14e8393 [mdt]
       #6 [ffff8b71c31af940] mdt_coordinator_cb at ffffffffc14e8659 [mdt]
       #7 [ffff8b71c31af978] llog_process_thread at ffffffffc0d1a7ff [obdclass]
       #8 [ffff8b71c31afa88] llog_process_or_fork at ffffffffc0d1bae9 [obdclass]
       #9 [ffff8b71c31afaf0] llog_cat_process_cb at ffffffffc0d211ea [obdclass]
      #10 [ffff8b71c31afb40] llog_process_thread at ffffffffc0d1a7ff [obdclass]
      #11 [ffff8b71c31afc50] llog_process_or_fork at ffffffffc0d1bae9 [obdclass]
      #12 [ffff8b71c31afcb8] llog_cat_process_or_fork at ffffffffc0d1d961 [obdclass]
      #13 [ffff8b71c31afd30] llog_cat_process at ffffffffc0d1db0e [obdclass]
      #14 [ffff8b71c31afd50] cdt_llog_process at ffffffffc14da8be [mdt]
      #15 [ffff8b71c31afda0] mdt_coordinator at ffffffffc14e4621 [mdt]
      #16 [ffff8b71c31afec8] kthread at ffffffff8a8c1f81
      #17 [ffff8b71c31aff50] ret_from_fork_nospec_begin at ffffffff8af77c1d   

      Investigating the case from HPe's customer I found there was above a million RESTORE entries in cdt_restore_handle_list:

      crash> list -o cdt_restore_handle.crh_list -H 0xffff8b71c5ede158 | wc -l
      1218162  

      I also tried to analyze the entries but it seems that there are no duplicates in a list, i.e. there are only unique FIDs.

      Attachments

        Issue Links

          Activity

            People

              scherementsev Sergey Cheremencev
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: