[LU-15947] Spinlock contention during wake_up_all() in obd_put_mod_rpc_slot() Created: 15/Jun/22  Updated: 11/Dec/23  Resolved: 09/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Shaun Tancheff Assignee: Shaun Tancheff
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17197 Performance regression with "LU-15947... Resolved
is related to LU-16633 obd_get_mod_rpc_slot() is vulnerable ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Multiple nodes down with spinlock contention

c0-0c0s14n1-2204300051.cdump:crash_x86_64> sys
      KERNEL: vmlinux-5.3.18-59.34_7.0.4.6-cray_ari_c
    DUMPFILE: c0-0c0s14n1-2204300051.cdump  [PARTIAL DUMP]
        CPUS: 256
        DATE: Fri Apr 29 15:59:58 CDT 2022
      UPTIME: 02:20:22
LOAD AVERAGE: 214.66, 137.29, 95.43
       TASKS: 2710
    NODENAME: nid00057
     RELEASE: 5.3.18-59.34_7.0.4.6-cray_ari_c
     VERSION: #1 SMP Wed Apr 27 03:52:58 UTC 2022 (cce0346)
     MACHINE: x86_64  (1300 Mhz)
      MEMORY: 95.9 GB
       PANIC: ""
crash_x86_64>
crash_x86_64> epython rcu
sched struct rcu_state ffffffff82059000 in progress (last activity 4296989573/ period end 4296989561) for 8243 jiffies
  show the queue length with -lcpu 255 sched PENDING, not stalled yet (not completed) ** Execution took   0.02s (real)   0.02s (CPU)
crash_x86_64>rcu thread has not run for almost 8243 jiffies or 32 secondsThere are 135 CPUs that are currently spinning on some  lock
 most of these are in the following stack trace:crash_x86_64> bt -c 255
PID: 34632  TASK: ffff889708048940  CPU: 255  COMMAND: "ldlm_bl_10"
    [exception RIP: queued_spin_lock_slowpath+377]
    RIP: ffffffff810d1a49  RSP: ffffc9000fca7aa0  RFLAGS: 00000046
    RAX: 0000000000000000  RBX: ffff88972ed9b060  RCX: 0000000004000000
    RDX: ffff8897ddbe9dc0  RSI: 00000000000000ab  RDI: ffff88972ed9b060
    RBP: ffffc9000fca7aa0   R8: 0000000004000000   R9: 0000000000028a80
    R10: ffffc9000fca79a0  R11: 000000000000010e  R12: 0000000000000202
    R13: 0000000000000202  R14: 0000000000000000  R15: 0000000000000003
    CS: 0010  SS: 0018
 #0 [ffffc9000fca7aa8] _raw_spin_lock_irqsave at ffffffff81703da7
 #1 [ffffc9000fca7ac8] __wake_up_common_lock at ffffffff810c5dc3
 #2 [ffffc9000fca7b38] __wake_up at ffffffff810c5e33
 #3 [ffffc9000fca7b48] obd_put_mod_rpc_slot at ffffffffa04f5964 [obdclass]
 #4 [ffffc9000fca7b68] ptlrpc_put_mod_rpc_slot at ffffffffa067fd34 [ptlrpc]
 #5 [ffffc9000fca7b90] mdc_close at ffffffffa08783ac [mdc]
 #6 [ffffc9000fca7be0] lmv_close at ffffffffa08b7d82 [lmv]
 #7 [ffffc9000fca7c20] ll_close_inode_openhandle at ffffffffa08f7641 [lustre]
 #8 [ffffc9000fca7c78] ll_md_real_close at ffffffffa08fa7ce [lustre]
 #9 [ffffc9000fca7ca8] ll_md_blocking_ast at ffffffffa092db4d [lustre]
#10 [ffffc9000fca7d10] ldlm_cancel_callback at ffffffffa0660128 [ptlrpc]
#11 [ffffc9000fca7d68] ldlm_cli_cancel_local at ffffffffa066d8e5 [ptlrpc]
#12 [ffffc9000fca7d90] ldlm_cli_cancel at ffffffffa0672d2b [ptlrpc]
#13 [ffffc9000fca7df8] ll_md_blocking_ast at ffffffffa092d510 [lustre]
#14 [ffffc9000fca7e60] ldlm_handle_bl_callback at ffffffffa0675c30 [ptlrpc]
#15 [ffffc9000fca7e88] ldlm_bl_thread_main at ffffffffa067641a [ptlrpc]
#16 [ffffc9000fca7f08] kthread at ffffffff810a2400
#17 [ffffc9000fca7f50] ret_from_fork at ffffffff8180021a
crash_x86_64>


 Comments   
Comment by Gerrit Updater [ 15/Jun/22 ]

"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47634
Subject: LU-15947 ptlrpc: Sort waiters on close_req completion
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4eb788baf0bdd5241999f101e17c00c275e067e0

Comment by Gerrit Updater [ 08/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/44041/
Subject: LU-15947 obdclass: improve precision of wakeups for mod_rpcs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5243630b09d22e0b576d81390d604774881f63f7

Comment by Cory Spitz [ 09/Mar/23 ]

I think we can consider this resolved with the landing of https://review.whamcloud.com/c/fs/lustre-release/+/44041.

Comment by Cory Spitz [ 09/Mar/23 ]

stancheff, do you agree? ^^^
Please open a new ticket if there is remaining work.

Comment by Gerrit Updater [ 03/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51538
Subject: LU-15947 obdclass: improve precision of wakeups for mod_rpcs
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 54e1ad7ad3c8f894d0805d9765f341e112c38afe

Generated at Sat Feb 10 03:22:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.