Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
Multiple nodes down with spinlock contention
c0-0c0s14n1-2204300051.cdump:crash_x86_64> sys
KERNEL: vmlinux-5.3.18-59.34_7.0.4.6-cray_ari_c
DUMPFILE: c0-0c0s14n1-2204300051.cdump [PARTIAL DUMP]
CPUS: 256
DATE: Fri Apr 29 15:59:58 CDT 2022
UPTIME: 02:20:22
LOAD AVERAGE: 214.66, 137.29, 95.43
TASKS: 2710
NODENAME: nid00057
RELEASE: 5.3.18-59.34_7.0.4.6-cray_ari_c
VERSION: #1 SMP Wed Apr 27 03:52:58 UTC 2022 (cce0346)
MACHINE: x86_64 (1300 Mhz)
MEMORY: 95.9 GB
PANIC: ""
crash_x86_64>
crash_x86_64> epython rcu
sched struct rcu_state ffffffff82059000 in progress (last activity 4296989573/ period end 4296989561) for 8243 jiffies
show the queue length with -lcpu 255 sched PENDING, not stalled yet (not completed) ** Execution took 0.02s (real) 0.02s (CPU)
crash_x86_64>rcu thread has not run for almost 8243 jiffies or 32 secondsThere are 135 CPUs that are currently spinning on some lock
most of these are in the following stack trace:crash_x86_64> bt -c 255
PID: 34632 TASK: ffff889708048940 CPU: 255 COMMAND: "ldlm_bl_10"
[exception RIP: queued_spin_lock_slowpath+377]
RIP: ffffffff810d1a49 RSP: ffffc9000fca7aa0 RFLAGS: 00000046
RAX: 0000000000000000 RBX: ffff88972ed9b060 RCX: 0000000004000000
RDX: ffff8897ddbe9dc0 RSI: 00000000000000ab RDI: ffff88972ed9b060
RBP: ffffc9000fca7aa0 R8: 0000000004000000 R9: 0000000000028a80
R10: ffffc9000fca79a0 R11: 000000000000010e R12: 0000000000000202
R13: 0000000000000202 R14: 0000000000000000 R15: 0000000000000003
CS: 0010 SS: 0018
#0 [ffffc9000fca7aa8] _raw_spin_lock_irqsave at ffffffff81703da7
#1 [ffffc9000fca7ac8] __wake_up_common_lock at ffffffff810c5dc3
#2 [ffffc9000fca7b38] __wake_up at ffffffff810c5e33
#3 [ffffc9000fca7b48] obd_put_mod_rpc_slot at ffffffffa04f5964 [obdclass]
#4 [ffffc9000fca7b68] ptlrpc_put_mod_rpc_slot at ffffffffa067fd34 [ptlrpc]
#5 [ffffc9000fca7b90] mdc_close at ffffffffa08783ac [mdc]
#6 [ffffc9000fca7be0] lmv_close at ffffffffa08b7d82 [lmv]
#7 [ffffc9000fca7c20] ll_close_inode_openhandle at ffffffffa08f7641 [lustre]
#8 [ffffc9000fca7c78] ll_md_real_close at ffffffffa08fa7ce [lustre]
#9 [ffffc9000fca7ca8] ll_md_blocking_ast at ffffffffa092db4d [lustre]
#10 [ffffc9000fca7d10] ldlm_cancel_callback at ffffffffa0660128 [ptlrpc]
#11 [ffffc9000fca7d68] ldlm_cli_cancel_local at ffffffffa066d8e5 [ptlrpc]
#12 [ffffc9000fca7d90] ldlm_cli_cancel at ffffffffa0672d2b [ptlrpc]
#13 [ffffc9000fca7df8] ll_md_blocking_ast at ffffffffa092d510 [lustre]
#14 [ffffc9000fca7e60] ldlm_handle_bl_callback at ffffffffa0675c30 [ptlrpc]
#15 [ffffc9000fca7e88] ldlm_bl_thread_main at ffffffffa067641a [ptlrpc]
#16 [ffffc9000fca7f08] kthread at ffffffff810a2400
#17 [ffffc9000fca7f50] ret_from_fork at ffffffff8180021a
crash_x86_64>