Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.4
-
None
-
3
-
9223372036854775807
Description
Switching back from "tbf uid" to fifo caused soft lockup. Including backtrace of all threads from the crash dump.
From dmesg
[-- MARK -- Mon Jan 25 15:00:00 2021] [15694977.724675] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [mdt00_088:11264] [15694977.724677] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [mdt00_080:11250] [15694977.724679] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [mdt00_109:11297] [15694977.724681] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [mdt00_102:11285] [15694977.724683] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [mdt00_034:11187] [15694977.724685] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mdt00_016:11166] [15694977.724687] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [mdt00_046:11201]
I was able to get a crash dump.
All the hung threads are in the same state
crash> bt 11285
PID: 11285 TASK: ffffa137e72d9070 CPU: 3 COMMAND: "mdt00_102"
#0 [ffffa117fecc8e48] crash_nmi_callback at ffffffffb7658017
#1 [ffffa117fecc8e58] nmi_handle at ffffffffb7d8593c
#2 [ffffa117fecc8eb0] do_nmi at ffffffffb7d85b5d
#3 [ffffa117fecc8ef0] end_repeat_nmi at ffffffffb7d84d9c
[exception RIP: native_queued_spin_lock_slowpath+344]
RIP: ffffffffb7717478 RSP: ffffa1375e5e3d38 RFLAGS: 00000202
RAX: 0000000000000101 RBX: ffffa117fb5e1108 RCX: 0000000000190000
RDX: 0000000000590101 RSI: 0000000000000101 RDI: ffffa117fb5e1108
RBP: ffffa1375e5e3d38 R8: ffffa117fecdb880 R9: 0000000000000000
R10: ffffffffc0d37e40 R11: ffffa117fb5e1108 R12: 0000000000000000
R13: ffffa0f8eb8a3b80 R14: ffffa0f8eb8a3b80 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffffa1375e5e3d38] native_queued_spin_lock_slowpath at ffffffffb7717478
#5 [ffffa1375e5e3d40] queued_spin_lock_slowpath at ffffffffb7d7546a
#6 [ffffa1375e5e3d50] _raw_spin_lock at ffffffffb7d83350
#7 [ffffa1375e5e3d60] nrs_resource_get_safe at ffffffffc1039402 [ptlrpc]
#8 [ffffa1375e5e3d98] ptlrpc_nrs_req_initialize at ffffffffc1039f13 [ptlrpc]
#9 [ffffa1375e5e3db0] ptlrpc_server_handle_req_in at ffffffffc1004c21 [ptlrpc]
#10 [ffffa1375e5e3df8] ptlrpc_main at ffffffffc1008d65 [ptlrpc]
#11 [ffffa1375e5e3ec8] kthread at ffffffffb76c61f1
Attachments
Issue Links
Activity
Link | Original: This issue is related to JFC-21 [ JFC-21 ] |
Link | New: This issue is related to JFC-21 [ JFC-21 ] |
Assignee | Original: WC Triage [ wc-triage ] | New: Li Xi [ lixi_wc ] |
Description |
Original:
Switching back to "tbf uid" to fifo caused soft lockup. Including backtrace of all threads from the crash dump.
From dmesg {code:java} [-- MARK -- Mon Jan 25 15:00:00 2021] [15694977.724675] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [mdt00_088:11264] [15694977.724677] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [mdt00_080:11250] [15694977.724679] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [mdt00_109:11297] [15694977.724681] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [mdt00_102:11285] [15694977.724683] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [mdt00_034:11187] [15694977.724685] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mdt00_016:11166] [15694977.724687] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [mdt00_046:11201] {code} I was able to get a crash dump. All the threads are in the same state {code:java} crash> bt 11285 PID: 11285 TASK: ffffa137e72d9070 CPU: 3 COMMAND: "mdt00_102" #0 [ffffa117fecc8e48] crash_nmi_callback at ffffffffb7658017 #1 [ffffa117fecc8e58] nmi_handle at ffffffffb7d8593c #2 [ffffa117fecc8eb0] do_nmi at ffffffffb7d85b5d #3 [ffffa117fecc8ef0] end_repeat_nmi at ffffffffb7d84d9c [exception RIP: native_queued_spin_lock_slowpath+344] RIP: ffffffffb7717478 RSP: ffffa1375e5e3d38 RFLAGS: 00000202 RAX: 0000000000000101 RBX: ffffa117fb5e1108 RCX: 0000000000190000 RDX: 0000000000590101 RSI: 0000000000000101 RDI: ffffa117fb5e1108 RBP: ffffa1375e5e3d38 R8: ffffa117fecdb880 R9: 0000000000000000 R10: ffffffffc0d37e40 R11: ffffa117fb5e1108 R12: 0000000000000000 R13: ffffa0f8eb8a3b80 R14: ffffa0f8eb8a3b80 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #4 [ffffa1375e5e3d38] native_queued_spin_lock_slowpath at ffffffffb7717478 #5 [ffffa1375e5e3d40] queued_spin_lock_slowpath at ffffffffb7d7546a #6 [ffffa1375e5e3d50] _raw_spin_lock at ffffffffb7d83350 #7 [ffffa1375e5e3d60] nrs_resource_get_safe at ffffffffc1039402 [ptlrpc] #8 [ffffa1375e5e3d98] ptlrpc_nrs_req_initialize at ffffffffc1039f13 [ptlrpc] #9 [ffffa1375e5e3db0] ptlrpc_server_handle_req_in at ffffffffc1004c21 [ptlrpc] #10 [ffffa1375e5e3df8] ptlrpc_main at ffffffffc1008d65 [ptlrpc] #11 [ffffa1375e5e3ec8] kthread at ffffffffb76c61f1 {code} |
New:
Switching back from "tbf uid" to fifo caused soft lockup. Including backtrace of all threads from the crash dump.
From dmesg {code:java} [-- MARK -- Mon Jan 25 15:00:00 2021] [15694977.724675] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [mdt00_088:11264] [15694977.724677] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [mdt00_080:11250] [15694977.724679] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [mdt00_109:11297] [15694977.724681] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [mdt00_102:11285] [15694977.724683] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [mdt00_034:11187] [15694977.724685] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mdt00_016:11166] [15694977.724687] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [mdt00_046:11201] {code} I was able to get a crash dump. All the hung threads are in the same state {code:java} crash> bt 11285 PID: 11285 TASK: ffffa137e72d9070 CPU: 3 COMMAND: "mdt00_102" #0 [ffffa117fecc8e48] crash_nmi_callback at ffffffffb7658017 #1 [ffffa117fecc8e58] nmi_handle at ffffffffb7d8593c #2 [ffffa117fecc8eb0] do_nmi at ffffffffb7d85b5d #3 [ffffa117fecc8ef0] end_repeat_nmi at ffffffffb7d84d9c [exception RIP: native_queued_spin_lock_slowpath+344] RIP: ffffffffb7717478 RSP: ffffa1375e5e3d38 RFLAGS: 00000202 RAX: 0000000000000101 RBX: ffffa117fb5e1108 RCX: 0000000000190000 RDX: 0000000000590101 RSI: 0000000000000101 RDI: ffffa117fb5e1108 RBP: ffffa1375e5e3d38 R8: ffffa117fecdb880 R9: 0000000000000000 R10: ffffffffc0d37e40 R11: ffffa117fb5e1108 R12: 0000000000000000 R13: ffffa0f8eb8a3b80 R14: ffffa0f8eb8a3b80 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #4 [ffffa1375e5e3d38] native_queued_spin_lock_slowpath at ffffffffb7717478 #5 [ffffa1375e5e3d40] queued_spin_lock_slowpath at ffffffffb7d7546a #6 [ffffa1375e5e3d50] _raw_spin_lock at ffffffffb7d83350 #7 [ffffa1375e5e3d60] nrs_resource_get_safe at ffffffffc1039402 [ptlrpc] #8 [ffffa1375e5e3d98] ptlrpc_nrs_req_initialize at ffffffffc1039f13 [ptlrpc] #9 [ffffa1375e5e3db0] ptlrpc_server_handle_req_in at ffffffffc1004c21 [ptlrpc] #10 [ffffa1375e5e3df8] ptlrpc_main at ffffffffc1008d65 [ptlrpc] #11 [ffffa1375e5e3ec8] kthread at ffffffffb76c61f1 {code} |
Summary | Original: Switching tbf uid to fifo caused soft lockup | New: Switching QoS from tbf uid to fifo caused soft lockup |
Summary | Original: Switching tbf uid to fifo causes soft lockup | New: Switching tbf uid to fifo caused soft lockup |
Summary | Original: Switching tbf uid to fio causes soft lockup | New: Switching tbf uid to fifo causes soft lockup |