Details
Description
At Cea T100 system we have an issue similar than the LU-5781, after a failover server node some client nodes have the thread ptlrpcd_rcv who use 100% of one cpu, with perf we can see :
0.55% ptlrpcd_rcv [obdclass] [k] cl_page_at_trusted | --- cl_page_at_trusted | |--97.59%-- cl_page_gang_lookup | osc_ldlm_weigh_ast | osc_cancel_for_recovery | ldlm_cancel_no_wait_policy | ldlm_prepare_lru_list | ldlm_cancel_lru_local | ldlm_replay_locks | ptlrpc_import_recovery_state_machine | ptlrpc_connect_interpret | ptlrpc_check_set | ptlrpcd_check | ptlrpcd | kthread | child_rip | --2.41%-- osc_ldlm_weigh_ast osc_cancel_for_recovery ldlm_cancel_no_wait_policy ldlm_prepare_lru_list ldlm_cancel_lru_local ldlm_replay_locks ptlrpc_import_recovery_state_machine ptlrpc_connect_interpret ptlrpc_check_set ptlrpcd_check ptlrpcd kthread child_rip
we have some osc with the state :
/proc/fs/lustre/osc/ptmp2-OST0021-osc-ffff8801ff8b3800/state:current_state: CONNECTING /proc/fs/lustre/osc/ptmp2-OST0022-osc-ffff8801ff8b3800/state:current_state: REPLAY_LOCKS /proc/fs/lustre/osc/ptmp2-OST0023-osc-ffff8801ff8b3800/state:current_state: REPLAY_LOCKS /proc/fs/lustre/osc/ptmp2-OST0024-osc-ffff8801ff8b3800/state:current_state: CONNECTING /proc/fs/lustre/osc/ptmp2-OST0025-osc-ffff8801ff8b3800/state:current_state: REPLAY_LOCKS /proc/fs/lustre/osc/ptmp2-OST0026-osc-ffff8801ff8b3800/state:current_state: CONNECTING /proc/fs/lustre/osc/ptmp2-OST0027-osc-ffff8801ff8b3800/state:current_state: CONNECTING /proc/fs/lustre/osc/ptmp2-OST0028-osc-ffff8801ff8b3800/state:current_state: CONNECTING /proc/fs/lustre/osc/ptmp2-OST0029-osc-ffff8801ff8b3800/state:current_state: REPLAY_LOCKS /proc/fs/lustre/osc/ptmp2-OST002a-osc-ffff8801ff8b3800/state:current_state: CONNECTING /proc/fs/lustre/osc/ptmp2-OST002b-osc-ffff8801ff8b3800/state:current_state: REPLAY_WAIT
and also,in some case, it is possible to release the state to FULL after running
- lctl set_param ldlm.namespaces.*.lru_size=clear
or - echo 1 > /proc/sys/vm/drop_caches
and after a NMI ptlrpcd_rcv stack was :
crash> bt 15400 PID: 15400 TASK: ffff880c7c5a0100 CPU: 14 COMMAND: "ptlrpcd_rcv" #0 [ffff88088e4c7e90] crash_nmi_callback at ffffffff81030096 #1 [ffff88088e4c7ea0] notifier_call_chain at ffffffff8152f9d5 #2 [ffff88088e4c7ee0] atomic_notifier_call_chain at ffffffff8152fa3a #3 [ffff88088e4c7ef0] notify_die at ffffffff810a056e #4 [ffff88088e4c7f20] do_nmi at ffffffff8152d69b #5 [ffff88088e4c7f50] nmi at ffffffff8152cf60 [exception RIP: cl_page_gang_lookup+292] RIP: ffffffffa04f18b4 RSP: ffff880c7cabb990 RFLAGS: 00000206 RAX: 000000000000000a RBX: 000000000000000b RCX: 0000000000000000 RDX: ffff880660a63da8 RSI: ffffffffa0af8740 RDI: ffff88065b15ae00 RBP: ffff880c7cabba30 R8: 000000000000000e R9: ffff880c7cabb950 R10: 0000000000002362 R11: ffff88087a09e5d0 R12: ffff88065b15a800 R13: ffff880660a63df8 R14: 000000000000000b R15: 000000000000000e ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff880c7cabb990] cl_page_gang_lookup at ffffffffa04f18b4 [obdclass] #7 [ffff880c7cabba38] osc_ldlm_weigh_ast at ffffffffa095e9b7 [osc] #8 [ffff880c7cabbab8] osc_cancel_for_recovery at ffffffffa094305d [osc] #9 [ffff880c7cabbac8] ldlm_cancel_no_wait_policy at ffffffffa0637711 [ptlrpc] #10 [ffff880c7cabbae8] ldlm_prepare_lru_list at ffffffffa063b61b [ptlrpc] #11 [ffff880c7cabbb68] ldlm_cancel_lru_local at ffffffffa063ba34 [ptlrpc] #12 [ffff880c7cabbb88] ldlm_replay_locks at ffffffffa063bbbc [ptlrpc] #13 [ffff880c7cabbc08] ptlrpc_import_recovery_state_machine at ffffffffa06844f7 [ptlrpc] #14 [ffff880c7cabbc68] ptlrpc_connect_interpret at ffffffffa0685659 [ptlrpc] #15 [ffff880c7cabbd08] ptlrpc_check_set at ffffffffa065bbc1 [ptlrpc] #16 [ffff880c7cabbda8] ptlrpcd_check at ffffffffa0687f9b [ptlrpc] #17 [ffff880c7cabbe08] ptlrpcd at ffffffffa06884bb [ptlrpc] #18 [ffff880c7cabbee8] kthread at ffffffff81099f56 #19 [ffff880c7cabbf48] kernel_thread at ffffffff8100c20a
when the root will be understanding on LU-5781, we need a patch version for lustre 2.5.3
Attachments
Issue Links
- is related to
-
LU-5781 endless loop in osc_lock_weight()
- Resolved