Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
3
-
9223372036854775807
Description
If there is above a million entries in cdt_restore_list, it takes a time to go through the all list.
#4 [ffff8b71d270f900] memcmp at ffffffff8ab7fe1c #5 [ffff8b71d270f908] cdt_restore_handle_find at ffffffffc14e7ee9 [mdt] #6 [ffff8b71d270f938] mdt_hsm_restore_is_running at ffffffffc14df0c2 [mdt] #7 [ffff8b71d270f968] mdt_getattr_internal at ffffffffc14919a1 [mdt] #8 [ffff8b71d270f9e0] mdt_getattr_name_lock at ffffffffc1495a7d [mdt] #9 [ffff8b71d270fa90] mdt_intent_getattr at ffffffffc149d5d5 [mdt] #10 [ffff8b71d270fad0] mdt_intent_opc at ffffffffc14926ba [mdt] #11 [ffff8b71d270fb30] mdt_intent_policy at ffffffffc149a7f4 [mdt] #12 [ffff8b71d270fb70] ldlm_lock_enqueue at ffffffffc0ff852a [ptlrpc] #13 [ffff8b71d270fbf0] ldlm_handle_enqueue0 at ffffffffc1020f97 [ptlrpc] #14 [ffff8b71d270fc80] tgt_enqueue at ffffffffc10ab0f2 [ptlrpc] #15 [ffff8b71d270fca0] tgt_request_handle at ffffffffc10aff0a [ptlrpc] #16 [ffff8b71d270fd30] ptlrpc_server_handle_request at ffffffffc1055a56 [ptlrpc] #17 [ffff8b71d270fde8] ptlrpc_main at ffffffffc1059b35 [ptlrpc] #18 [ffff8b71d270fec8] kthread at ffffffff8a8c1f81 #19 [ffff8b71d270ff50] ret_from_fork_nospec_begin at ffffffff8af77c1d
Holding cdt_restore_lock it blocks other tasks that also need cdt_restore_lock:
crash> bt 38174 PID: 38174 TASK: ffff8b722d5fb0c0 CPU: 28 COMMAND: "hsm_cdtr" #0 [ffff8b71c31af778] __schedule at ffffffff8af6ab17 #1 [ffff8b71c31af808] schedule_preempt_disabled at ffffffff8af6bf39 #2 [ffff8b71c31af818] __mutex_lock_slowpath at ffffffff8af69e87 #3 [ffff8b71c31af878] mutex_lock at ffffffff8af6926f #4 [ffff8b71c31af890] cdt_restore_handle_del at ffffffffc14e8008 [mdt] #5 [ffff8b71c31af8c0] mdt_cdt_started_cb at ffffffffc14e8393 [mdt] #6 [ffff8b71c31af940] mdt_coordinator_cb at ffffffffc14e8659 [mdt] #7 [ffff8b71c31af978] llog_process_thread at ffffffffc0d1a7ff [obdclass] #8 [ffff8b71c31afa88] llog_process_or_fork at ffffffffc0d1bae9 [obdclass] #9 [ffff8b71c31afaf0] llog_cat_process_cb at ffffffffc0d211ea [obdclass] #10 [ffff8b71c31afb40] llog_process_thread at ffffffffc0d1a7ff [obdclass] #11 [ffff8b71c31afc50] llog_process_or_fork at ffffffffc0d1bae9 [obdclass] #12 [ffff8b71c31afcb8] llog_cat_process_or_fork at ffffffffc0d1d961 [obdclass] #13 [ffff8b71c31afd30] llog_cat_process at ffffffffc0d1db0e [obdclass] #14 [ffff8b71c31afd50] cdt_llog_process at ffffffffc14da8be [mdt] #15 [ffff8b71c31afda0] mdt_coordinator at ffffffffc14e4621 [mdt] #16 [ffff8b71c31afec8] kthread at ffffffff8a8c1f81 #17 [ffff8b71c31aff50] ret_from_fork_nospec_begin at ffffffff8af77c1d
Investigating the case from HPe's customer I found there was above a million RESTORE entries in cdt_restore_handle_list:
crash> list -o cdt_restore_handle.crh_list -H 0xffff8b71c5ede158 | wc -l 1218162
I also tried to analyze the entries but it seems that there are no duplicates in a list, i.e. there are only unique FIDs.