[LU-14364] Switching QoS from tbf uid to fifo caused soft lockup Created: 26/Jan/21 Updated: 12/Nov/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Li Xi |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Switching back from "tbf uid" to fifo caused soft lockup. Including backtrace of all threads from the crash dump. From dmesg [-- MARK -- Mon Jan 25 15:00:00 2021] [15694977.724675] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [mdt00_088:11264] [15694977.724677] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [mdt00_080:11250] [15694977.724679] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [mdt00_109:11297] [15694977.724681] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [mdt00_102:11285] [15694977.724683] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [mdt00_034:11187] [15694977.724685] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mdt00_016:11166] [15694977.724687] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [mdt00_046:11201] I was able to get a crash dump.
crash> bt 11285
PID: 11285 TASK: ffffa137e72d9070 CPU: 3 COMMAND: "mdt00_102"
#0 [ffffa117fecc8e48] crash_nmi_callback at ffffffffb7658017
#1 [ffffa117fecc8e58] nmi_handle at ffffffffb7d8593c
#2 [ffffa117fecc8eb0] do_nmi at ffffffffb7d85b5d
#3 [ffffa117fecc8ef0] end_repeat_nmi at ffffffffb7d84d9c
[exception RIP: native_queued_spin_lock_slowpath+344]
RIP: ffffffffb7717478 RSP: ffffa1375e5e3d38 RFLAGS: 00000202
RAX: 0000000000000101 RBX: ffffa117fb5e1108 RCX: 0000000000190000
RDX: 0000000000590101 RSI: 0000000000000101 RDI: ffffa117fb5e1108
RBP: ffffa1375e5e3d38 R8: ffffa117fecdb880 R9: 0000000000000000
R10: ffffffffc0d37e40 R11: ffffa117fb5e1108 R12: 0000000000000000
R13: ffffa0f8eb8a3b80 R14: ffffa0f8eb8a3b80 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffffa1375e5e3d38] native_queued_spin_lock_slowpath at ffffffffb7717478
#5 [ffffa1375e5e3d40] queued_spin_lock_slowpath at ffffffffb7d7546a
#6 [ffffa1375e5e3d50] _raw_spin_lock at ffffffffb7d83350
#7 [ffffa1375e5e3d60] nrs_resource_get_safe at ffffffffc1039402 [ptlrpc]
#8 [ffffa1375e5e3d98] ptlrpc_nrs_req_initialize at ffffffffc1039f13 [ptlrpc]
#9 [ffffa1375e5e3db0] ptlrpc_server_handle_req_in at ffffffffc1004c21 [ptlrpc]
#10 [ffffa1375e5e3df8] ptlrpc_main at ffffffffc1008d65 [ptlrpc]
#11 [ffffa1375e5e3ec8] kthread at ffffffffb76c61f1
|
| Comments |
| Comment by Peter Jones [ 26/Jan/21 ] |
|
Li Xi Could you please advise? Thanks Peter |
| Comment by Alex Zhuravlev [ 05/Nov/21 ] |
|
I see this very often in my testing [ 1090.821759] Lustre: DEBUG MARKER: == sanityn test 78: Enable policy and specify tunings right away ========================================================== 17:47:37 (1636134457) [ 1091.415104] BUG: sleeping function called from invalid context at /home/lustre/master-mine/libcfs/libcfs/hash.c:1155 [ 1091.415287] in_atomic(): 1, irqs_disabled(): 0, pid: 48457, name: lctl [ 1091.415326] INFO: lockdep is turned off. [ 1091.415348] CPU: 0 PID: 48457 Comm: lctl Tainted: P W O --------- --- 4.18.0 #43 [ 1091.415398] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1091.415431] Call Trace: [ 1091.415451] dump_stack+0x85/0xc0 [ 1091.415479] ___might_sleep.cold.15+0xac/0xbc [ 1091.415522] cfs_hash_putref+0x277/0x4d0 [libcfs] [ 1091.415712] nrs_orr_stop+0x60/0x240 [ptlrpc] [ 1091.415858] nrs_policy_stop0+0x30/0x1b0 [ptlrpc] [ 1091.415945] nrs_policy_stop_primary.isra.3+0x17c/0x1c0 [ptlrpc] [ 1091.416035] nrs_policy_start_locked+0x466/0x670 [ptlrpc] [ 1091.416211] nrs_policy_ctl+0x1e9/0x2c0 [ptlrpc] [ 1091.416360] ptlrpc_nrs_policy_control+0x110/0x2f0 [ptlrpc] [ 1091.416509] ptlrpc_lprocfs_nrs_policies_seq_write+0x43c/0x590 [ptlrpc] [ 1091.416647] full_proxy_write+0x4b/0x70 [ 1091.416728] __vfs_write+0x1e/0x160 [ 1091.416807] ? rcu_sync_lockdep_assert+0x9/0x50 [ 1091.429057] ? __sb_start_write+0x147/0x1b0 [ 1091.429136] ? vfs_write+0x182/0x1b0 [ 1091.429208] vfs_write+0xb9/0x1b0 |
| Comment by Alex Zhuravlev [ 05/Nov/21 ] |
|
and even with cond_resched() removed it still fails: [ 1171.884805] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:748 [ 1171.885071] in_atomic(): 1, irqs_disabled(): 0, pid: 87261, name: lctl [ 1171.885187] INFO: lockdep is turned off. [ 1171.885260] CPU: 1 PID: 87261 Comm: lctl Tainted: P W O --------- --- 4.18.0 #44 [ 1171.885413] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1171.885523] Call Trace: [ 1171.885577] dump_stack+0x85/0xc0 [ 1171.885650] ___might_sleep.cold.15+0xac/0xbc [ 1171.885742] __mutex_lock+0x42/0x9e0 [ 1171.885814] ? ___cache_free+0x323/0x4a0 [ 1171.885886] ? kmem_cache_destroy+0x1b/0xf0 [ 1171.885961] kmem_cache_destroy+0x1b/0xf0 [ 1171.886094] nrs_orr_stop+0x69/0x240 [ptlrpc] [ 1171.886236] nrs_policy_stop0+0x30/0x1b0 [ptlrpc] [ 1171.886376] nrs_policy_stop_primary.isra.3+0x17c/0x1c0 [ptlrpc] [ 1171.886536] nrs_policy_start_locked+0x466/0x670 [ptlrpc] [ 1171.886676] nrs_policy_ctl+0x1e9/0x2c0 [ptlrpc] [ 1171.886815] ptlrpc_nrs_policy_control+0x110/0x2f0 [ptlrpc] I think we should start to use CONFIG_DEBUG_* things in AT and reject changes breaking obvious rules like this. |
| Comment by Etienne Aujames [ 12/Nov/21 ] |
|
nrs_policy_stop0() should not be call with the "nrs->nrs_lock" spinlock because nrs_orr_stop()/nrs_tbf_stop() etc... free memory. |
| Comment by Li Xi [ 12/Nov/21 ] |
|
green Thanks for the dump stack. Yes, rhashtable_free_and_destroy() indeed can sleep. I will check how to fix it. |
| Comment by Gerrit Updater [ 12/Nov/21 ] |
|
"Li Xi <lixi@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45554 |