Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14364

Switching QoS from tbf uid to fifo caused soft lockup

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      Switching back from "tbf uid" to fifo caused soft lockup. Including backtrace of all threads from the crash dump.

      From dmesg

       [-- MARK -- Mon Jan 25 15:00:00 2021]
      [15694977.724675] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [mdt00_088:11264]
      [15694977.724677] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [mdt00_080:11250]
      [15694977.724679] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [mdt00_109:11297]
      [15694977.724681] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [mdt00_102:11285]
      [15694977.724683] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [mdt00_034:11187]
      [15694977.724685] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mdt00_016:11166]
      [15694977.724687] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [mdt00_046:11201]
      

      I was able to get a crash dump.
      All the hung threads are in the same state

      crash> bt 11285
      PID: 11285  TASK: ffffa137e72d9070  CPU: 3   COMMAND: "mdt00_102"
       #0 [ffffa117fecc8e48] crash_nmi_callback at ffffffffb7658017
       #1 [ffffa117fecc8e58] nmi_handle at ffffffffb7d8593c
       #2 [ffffa117fecc8eb0] do_nmi at ffffffffb7d85b5d
       #3 [ffffa117fecc8ef0] end_repeat_nmi at ffffffffb7d84d9c
          [exception RIP: native_queued_spin_lock_slowpath+344]
          RIP: ffffffffb7717478  RSP: ffffa1375e5e3d38  RFLAGS: 00000202
          RAX: 0000000000000101  RBX: ffffa117fb5e1108  RCX: 0000000000190000
          RDX: 0000000000590101  RSI: 0000000000000101  RDI: ffffa117fb5e1108
          RBP: ffffa1375e5e3d38   R8: ffffa117fecdb880   R9: 0000000000000000
          R10: ffffffffc0d37e40  R11: ffffa117fb5e1108  R12: 0000000000000000
          R13: ffffa0f8eb8a3b80  R14: ffffa0f8eb8a3b80  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #4 [ffffa1375e5e3d38] native_queued_spin_lock_slowpath at ffffffffb7717478
       #5 [ffffa1375e5e3d40] queued_spin_lock_slowpath at ffffffffb7d7546a
       #6 [ffffa1375e5e3d50] _raw_spin_lock at ffffffffb7d83350
       #7 [ffffa1375e5e3d60] nrs_resource_get_safe at ffffffffc1039402 [ptlrpc]
       #8 [ffffa1375e5e3d98] ptlrpc_nrs_req_initialize at ffffffffc1039f13 [ptlrpc]
       #9 [ffffa1375e5e3db0] ptlrpc_server_handle_req_in at ffffffffc1004c21 [ptlrpc]
      #10 [ffffa1375e5e3df8] ptlrpc_main at ffffffffc1008d65 [ptlrpc]
      #11 [ffffa1375e5e3ec8] kthread at ffffffffb76c61f1
      

      Attachments

        1. bt.all
          872 kB
          Mahmoud Hanafi

        Issue Links

          Activity

            People

              lixi_wc Li Xi
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: