Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15947

Spinlock contention during wake_up_all() in obd_put_mod_rpc_slot()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Multiple nodes down with spinlock contention

      c0-0c0s14n1-2204300051.cdump:crash_x86_64> sys
            KERNEL: vmlinux-5.3.18-59.34_7.0.4.6-cray_ari_c
          DUMPFILE: c0-0c0s14n1-2204300051.cdump  [PARTIAL DUMP]
              CPUS: 256
              DATE: Fri Apr 29 15:59:58 CDT 2022
            UPTIME: 02:20:22
      LOAD AVERAGE: 214.66, 137.29, 95.43
             TASKS: 2710
          NODENAME: nid00057
           RELEASE: 5.3.18-59.34_7.0.4.6-cray_ari_c
           VERSION: #1 SMP Wed Apr 27 03:52:58 UTC 2022 (cce0346)
           MACHINE: x86_64  (1300 Mhz)
            MEMORY: 95.9 GB
             PANIC: ""
      crash_x86_64>
      crash_x86_64> epython rcu
      sched struct rcu_state ffffffff82059000 in progress (last activity 4296989573/ period end 4296989561) for 8243 jiffies
        show the queue length with -lcpu 255 sched PENDING, not stalled yet (not completed) ** Execution took   0.02s (real)   0.02s (CPU)
      crash_x86_64>rcu thread has not run for almost 8243 jiffies or 32 secondsThere are 135 CPUs that are currently spinning on some  lock
       most of these are in the following stack trace:crash_x86_64> bt -c 255
      PID: 34632  TASK: ffff889708048940  CPU: 255  COMMAND: "ldlm_bl_10"
          [exception RIP: queued_spin_lock_slowpath+377]
          RIP: ffffffff810d1a49  RSP: ffffc9000fca7aa0  RFLAGS: 00000046
          RAX: 0000000000000000  RBX: ffff88972ed9b060  RCX: 0000000004000000
          RDX: ffff8897ddbe9dc0  RSI: 00000000000000ab  RDI: ffff88972ed9b060
          RBP: ffffc9000fca7aa0   R8: 0000000004000000   R9: 0000000000028a80
          R10: ffffc9000fca79a0  R11: 000000000000010e  R12: 0000000000000202
          R13: 0000000000000202  R14: 0000000000000000  R15: 0000000000000003
          CS: 0010  SS: 0018
       #0 [ffffc9000fca7aa8] _raw_spin_lock_irqsave at ffffffff81703da7
       #1 [ffffc9000fca7ac8] __wake_up_common_lock at ffffffff810c5dc3
       #2 [ffffc9000fca7b38] __wake_up at ffffffff810c5e33
       #3 [ffffc9000fca7b48] obd_put_mod_rpc_slot at ffffffffa04f5964 [obdclass]
       #4 [ffffc9000fca7b68] ptlrpc_put_mod_rpc_slot at ffffffffa067fd34 [ptlrpc]
       #5 [ffffc9000fca7b90] mdc_close at ffffffffa08783ac [mdc]
       #6 [ffffc9000fca7be0] lmv_close at ffffffffa08b7d82 [lmv]
       #7 [ffffc9000fca7c20] ll_close_inode_openhandle at ffffffffa08f7641 [lustre]
       #8 [ffffc9000fca7c78] ll_md_real_close at ffffffffa08fa7ce [lustre]
       #9 [ffffc9000fca7ca8] ll_md_blocking_ast at ffffffffa092db4d [lustre]
      #10 [ffffc9000fca7d10] ldlm_cancel_callback at ffffffffa0660128 [ptlrpc]
      #11 [ffffc9000fca7d68] ldlm_cli_cancel_local at ffffffffa066d8e5 [ptlrpc]
      #12 [ffffc9000fca7d90] ldlm_cli_cancel at ffffffffa0672d2b [ptlrpc]
      #13 [ffffc9000fca7df8] ll_md_blocking_ast at ffffffffa092d510 [lustre]
      #14 [ffffc9000fca7e60] ldlm_handle_bl_callback at ffffffffa0675c30 [ptlrpc]
      #15 [ffffc9000fca7e88] ldlm_bl_thread_main at ffffffffa067641a [ptlrpc]
      #16 [ffffc9000fca7f08] kthread at ffffffff810a2400
      #17 [ffffc9000fca7f50] ret_from_fork at ffffffff8180021a
      crash_x86_64>
      

      Attachments

        Issue Links

          Activity

            People

              stancheff Shaun Tancheff
              stancheff Shaun Tancheff
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: