Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14364

Switching QoS from tbf uid to fifo caused soft lockup

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      Switching back from "tbf uid" to fifo caused soft lockup. Including backtrace of all threads from the crash dump.

      From dmesg

       [-- MARK -- Mon Jan 25 15:00:00 2021]
      [15694977.724675] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [mdt00_088:11264]
      [15694977.724677] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [mdt00_080:11250]
      [15694977.724679] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [mdt00_109:11297]
      [15694977.724681] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [mdt00_102:11285]
      [15694977.724683] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [mdt00_034:11187]
      [15694977.724685] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mdt00_016:11166]
      [15694977.724687] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [mdt00_046:11201]
      

      I was able to get a crash dump.
      All the hung threads are in the same state

      crash> bt 11285
      PID: 11285  TASK: ffffa137e72d9070  CPU: 3   COMMAND: "mdt00_102"
       #0 [ffffa117fecc8e48] crash_nmi_callback at ffffffffb7658017
       #1 [ffffa117fecc8e58] nmi_handle at ffffffffb7d8593c
       #2 [ffffa117fecc8eb0] do_nmi at ffffffffb7d85b5d
       #3 [ffffa117fecc8ef0] end_repeat_nmi at ffffffffb7d84d9c
          [exception RIP: native_queued_spin_lock_slowpath+344]
          RIP: ffffffffb7717478  RSP: ffffa1375e5e3d38  RFLAGS: 00000202
          RAX: 0000000000000101  RBX: ffffa117fb5e1108  RCX: 0000000000190000
          RDX: 0000000000590101  RSI: 0000000000000101  RDI: ffffa117fb5e1108
          RBP: ffffa1375e5e3d38   R8: ffffa117fecdb880   R9: 0000000000000000
          R10: ffffffffc0d37e40  R11: ffffa117fb5e1108  R12: 0000000000000000
          R13: ffffa0f8eb8a3b80  R14: ffffa0f8eb8a3b80  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #4 [ffffa1375e5e3d38] native_queued_spin_lock_slowpath at ffffffffb7717478
       #5 [ffffa1375e5e3d40] queued_spin_lock_slowpath at ffffffffb7d7546a
       #6 [ffffa1375e5e3d50] _raw_spin_lock at ffffffffb7d83350
       #7 [ffffa1375e5e3d60] nrs_resource_get_safe at ffffffffc1039402 [ptlrpc]
       #8 [ffffa1375e5e3d98] ptlrpc_nrs_req_initialize at ffffffffc1039f13 [ptlrpc]
       #9 [ffffa1375e5e3db0] ptlrpc_server_handle_req_in at ffffffffc1004c21 [ptlrpc]
      #10 [ffffa1375e5e3df8] ptlrpc_main at ffffffffc1008d65 [ptlrpc]
      #11 [ffffa1375e5e3ec8] kthread at ffffffffb76c61f1
      

      Attachments

        Issue Links

          Activity

            [LU-14364] Switching QoS from tbf uid to fifo caused soft lockup
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14976 [ LU-14976 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14698 [ LU-14698 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-21 [ JFC-21 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-21 [ JFC-21 ]
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Li Xi [ lixi_wc ]
            mhanafi Mahmoud Hanafi made changes -
            Description Original: Switching back to "tbf uid" to fifo caused soft lockup. Including backtrace of all threads from the crash dump.

            From dmesg
            {code:java}
             [-- MARK -- Mon Jan 25 15:00:00 2021]
            [15694977.724675] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [mdt00_088:11264]
            [15694977.724677] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [mdt00_080:11250]
            [15694977.724679] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [mdt00_109:11297]
            [15694977.724681] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [mdt00_102:11285]
            [15694977.724683] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [mdt00_034:11187]
            [15694977.724685] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mdt00_016:11166]
            [15694977.724687] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [mdt00_046:11201]
            {code}
            I was able to get a crash dump.
             All the threads are in the same state
            {code:java}
            crash> bt 11285
            PID: 11285 TASK: ffffa137e72d9070 CPU: 3 COMMAND: "mdt00_102"
             #0 [ffffa117fecc8e48] crash_nmi_callback at ffffffffb7658017
             #1 [ffffa117fecc8e58] nmi_handle at ffffffffb7d8593c
             #2 [ffffa117fecc8eb0] do_nmi at ffffffffb7d85b5d
             #3 [ffffa117fecc8ef0] end_repeat_nmi at ffffffffb7d84d9c
                [exception RIP: native_queued_spin_lock_slowpath+344]
                RIP: ffffffffb7717478 RSP: ffffa1375e5e3d38 RFLAGS: 00000202
                RAX: 0000000000000101 RBX: ffffa117fb5e1108 RCX: 0000000000190000
                RDX: 0000000000590101 RSI: 0000000000000101 RDI: ffffa117fb5e1108
                RBP: ffffa1375e5e3d38 R8: ffffa117fecdb880 R9: 0000000000000000
                R10: ffffffffc0d37e40 R11: ffffa117fb5e1108 R12: 0000000000000000
                R13: ffffa0f8eb8a3b80 R14: ffffa0f8eb8a3b80 R15: 0000000000000000
                ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
            --- <NMI exception stack> ---
             #4 [ffffa1375e5e3d38] native_queued_spin_lock_slowpath at ffffffffb7717478
             #5 [ffffa1375e5e3d40] queued_spin_lock_slowpath at ffffffffb7d7546a
             #6 [ffffa1375e5e3d50] _raw_spin_lock at ffffffffb7d83350
             #7 [ffffa1375e5e3d60] nrs_resource_get_safe at ffffffffc1039402 [ptlrpc]
             #8 [ffffa1375e5e3d98] ptlrpc_nrs_req_initialize at ffffffffc1039f13 [ptlrpc]
             #9 [ffffa1375e5e3db0] ptlrpc_server_handle_req_in at ffffffffc1004c21 [ptlrpc]
            #10 [ffffa1375e5e3df8] ptlrpc_main at ffffffffc1008d65 [ptlrpc]
            #11 [ffffa1375e5e3ec8] kthread at ffffffffb76c61f1
            {code}
            New: Switching back from "tbf uid" to fifo caused soft lockup. Including backtrace of all threads from the crash dump.

            From dmesg
            {code:java}
             [-- MARK -- Mon Jan 25 15:00:00 2021]
            [15694977.724675] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [mdt00_088:11264]
            [15694977.724677] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [mdt00_080:11250]
            [15694977.724679] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [mdt00_109:11297]
            [15694977.724681] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [mdt00_102:11285]
            [15694977.724683] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [mdt00_034:11187]
            [15694977.724685] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mdt00_016:11166]
            [15694977.724687] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [mdt00_046:11201]
            {code}
            I was able to get a crash dump.
             All the hung threads are in the same state
            {code:java}
            crash> bt 11285
            PID: 11285 TASK: ffffa137e72d9070 CPU: 3 COMMAND: "mdt00_102"
             #0 [ffffa117fecc8e48] crash_nmi_callback at ffffffffb7658017
             #1 [ffffa117fecc8e58] nmi_handle at ffffffffb7d8593c
             #2 [ffffa117fecc8eb0] do_nmi at ffffffffb7d85b5d
             #3 [ffffa117fecc8ef0] end_repeat_nmi at ffffffffb7d84d9c
                [exception RIP: native_queued_spin_lock_slowpath+344]
                RIP: ffffffffb7717478 RSP: ffffa1375e5e3d38 RFLAGS: 00000202
                RAX: 0000000000000101 RBX: ffffa117fb5e1108 RCX: 0000000000190000
                RDX: 0000000000590101 RSI: 0000000000000101 RDI: ffffa117fb5e1108
                RBP: ffffa1375e5e3d38 R8: ffffa117fecdb880 R9: 0000000000000000
                R10: ffffffffc0d37e40 R11: ffffa117fb5e1108 R12: 0000000000000000
                R13: ffffa0f8eb8a3b80 R14: ffffa0f8eb8a3b80 R15: 0000000000000000
                ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
            --- <NMI exception stack> ---
             #4 [ffffa1375e5e3d38] native_queued_spin_lock_slowpath at ffffffffb7717478
             #5 [ffffa1375e5e3d40] queued_spin_lock_slowpath at ffffffffb7d7546a
             #6 [ffffa1375e5e3d50] _raw_spin_lock at ffffffffb7d83350
             #7 [ffffa1375e5e3d60] nrs_resource_get_safe at ffffffffc1039402 [ptlrpc]
             #8 [ffffa1375e5e3d98] ptlrpc_nrs_req_initialize at ffffffffc1039f13 [ptlrpc]
             #9 [ffffa1375e5e3db0] ptlrpc_server_handle_req_in at ffffffffc1004c21 [ptlrpc]
            #10 [ffffa1375e5e3df8] ptlrpc_main at ffffffffc1008d65 [ptlrpc]
            #11 [ffffa1375e5e3ec8] kthread at ffffffffb76c61f1
            {code}
            mhanafi Mahmoud Hanafi made changes -
            Summary Original: Switching tbf uid to fifo caused soft lockup New: Switching QoS from tbf uid to fifo caused soft lockup
            mhanafi Mahmoud Hanafi made changes -
            Summary Original: Switching tbf uid to fifo causes soft lockup New: Switching tbf uid to fifo caused soft lockup
            mhanafi Mahmoud Hanafi made changes -
            Summary Original: Switching tbf uid to fio causes soft lockup New: Switching tbf uid to fifo causes soft lockup
            mhanafi Mahmoud Hanafi created issue -

            People

              lixi_wc Li Xi
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: