Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
Multiple nodes down with spinlock contention
c0-0c0s14n1-2204300051.cdump:crash_x86_64> sys KERNEL: vmlinux-5.3.18-59.34_7.0.4.6-cray_ari_c DUMPFILE: c0-0c0s14n1-2204300051.cdump [PARTIAL DUMP] CPUS: 256 DATE: Fri Apr 29 15:59:58 CDT 2022 UPTIME: 02:20:22 LOAD AVERAGE: 214.66, 137.29, 95.43 TASKS: 2710 NODENAME: nid00057 RELEASE: 5.3.18-59.34_7.0.4.6-cray_ari_c VERSION: #1 SMP Wed Apr 27 03:52:58 UTC 2022 (cce0346) MACHINE: x86_64 (1300 Mhz) MEMORY: 95.9 GB PANIC: "" crash_x86_64> crash_x86_64> epython rcu sched struct rcu_state ffffffff82059000 in progress (last activity 4296989573/ period end 4296989561) for 8243 jiffies show the queue length with -lcpu 255 sched PENDING, not stalled yet (not completed) ** Execution took 0.02s (real) 0.02s (CPU) crash_x86_64>rcu thread has not run for almost 8243 jiffies or 32 secondsThere are 135 CPUs that are currently spinning on some lock most of these are in the following stack trace:crash_x86_64> bt -c 255 PID: 34632 TASK: ffff889708048940 CPU: 255 COMMAND: "ldlm_bl_10" [exception RIP: queued_spin_lock_slowpath+377] RIP: ffffffff810d1a49 RSP: ffffc9000fca7aa0 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff88972ed9b060 RCX: 0000000004000000 RDX: ffff8897ddbe9dc0 RSI: 00000000000000ab RDI: ffff88972ed9b060 RBP: ffffc9000fca7aa0 R8: 0000000004000000 R9: 0000000000028a80 R10: ffffc9000fca79a0 R11: 000000000000010e R12: 0000000000000202 R13: 0000000000000202 R14: 0000000000000000 R15: 0000000000000003 CS: 0010 SS: 0018 #0 [ffffc9000fca7aa8] _raw_spin_lock_irqsave at ffffffff81703da7 #1 [ffffc9000fca7ac8] __wake_up_common_lock at ffffffff810c5dc3 #2 [ffffc9000fca7b38] __wake_up at ffffffff810c5e33 #3 [ffffc9000fca7b48] obd_put_mod_rpc_slot at ffffffffa04f5964 [obdclass] #4 [ffffc9000fca7b68] ptlrpc_put_mod_rpc_slot at ffffffffa067fd34 [ptlrpc] #5 [ffffc9000fca7b90] mdc_close at ffffffffa08783ac [mdc] #6 [ffffc9000fca7be0] lmv_close at ffffffffa08b7d82 [lmv] #7 [ffffc9000fca7c20] ll_close_inode_openhandle at ffffffffa08f7641 [lustre] #8 [ffffc9000fca7c78] ll_md_real_close at ffffffffa08fa7ce [lustre] #9 [ffffc9000fca7ca8] ll_md_blocking_ast at ffffffffa092db4d [lustre] #10 [ffffc9000fca7d10] ldlm_cancel_callback at ffffffffa0660128 [ptlrpc] #11 [ffffc9000fca7d68] ldlm_cli_cancel_local at ffffffffa066d8e5 [ptlrpc] #12 [ffffc9000fca7d90] ldlm_cli_cancel at ffffffffa0672d2b [ptlrpc] #13 [ffffc9000fca7df8] ll_md_blocking_ast at ffffffffa092d510 [lustre] #14 [ffffc9000fca7e60] ldlm_handle_bl_callback at ffffffffa0675c30 [ptlrpc] #15 [ffffc9000fca7e88] ldlm_bl_thread_main at ffffffffa067641a [ptlrpc] #16 [ffffc9000fca7f08] kthread at ffffffff810a2400 #17 [ffffc9000fca7f50] ret_from_fork at ffffffff8180021a crash_x86_64>