Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.3.0
    • None
    • None
    • lustre 2.1.0-21chaos (github.com/chaos/lustre)
    • 2
    • 4614

    Description

      While investigating LU-1087, the ll_mgs_* threads suddenly went nuts and shot the load through the roof, to the point where the node is almost completely unresponsive, and a "top" that I had running is only able to redraw ever minute or so.

      The console is mostly unresponsive, but it did respond to a sysreq-l, so I can see that they are all in a backtrace similar to this:

      Call Trace:
       [<ffffffffa06da060>] lock_res_and_lock+0x30/0x40 [ptlrpc]
       [<ffffffffa06deca3>] ldlm_lock_enqueue+0x453/0x7e0 [ptlrpc]
       [<ffffffffa06fd206>] ldlm_handle_enqueue0+0x406/0xd70 [ptlrpc]
       [<ffffffffa06fdbd6>] ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
       [<ffffffffa06fdbe0>] ? ldlm_server_completion_ast+0x0/0x590 [ptlrpc]
       [<ffffffffa06fe170>] ? ldlm_server_blocking_ast+0x0/0x740 [ptlrpc]
       [<ffffffffa0b55245>] mgs_handle+0x545/0x1350 [mgs]
       [<ffffffffa04933f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
       [<ffffffffa04933f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
       [<ffffffffa0723181>] ptlrpc_main+0xcd1/0x1690 [ptlrpc]
       [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
       [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Attachments

        Activity

          [LU-1088] mgs threads go nuts
          pjones Peter Jones added a comment -

          Landed for 2.3

          pjones Peter Jones added a comment - Landed for 2.3
          laisiyao Lai Siyao added a comment - Review is on http://review.whamcloud.com/#change,2250 .
          laisiyao Lai Siyao added a comment -

          It looks okay to use RCU for resource lock dump, and compared to ldlm_lock_debug() ldlm_lock_dump() is inefficient, I'll replace it the former one.

          laisiyao Lai Siyao added a comment - It looks okay to use RCU for resource lock dump, and compared to ldlm_lock_debug() ldlm_lock_dump() is inefficient, I'll replace it the former one.

          Yes, I had recently added D_INFO to get a look at what the spinning mdt thread was doing for LU-1087.

          We have a few thousand clients. Most of them should not be disconnected. There are on the order of a couple hundred that might reboot and reconnect at any time (BGP nodes).

          I think that we definitely need ldlm_resource_dump changed. I certainly accept that performance is reduced when I enable higher logging levels, but I don't expect a denial-of-service attack.

          morrone Christopher Morrone (Inactive) added a comment - Yes, I had recently added D_INFO to get a look at what the spinning mdt thread was doing for LU-1087 . We have a few thousand clients. Most of them should not be disconnected. There are on the order of a couple hundred that might reboot and reconnect at any time (BGP nodes). I think that we definitely need ldlm_resource_dump changed. I certainly accept that performance is reduced when I enable higher logging levels, but I don't expect a denial-of-service attack.
          green Oleg Drokin added a comment -

          From the stack trace:
          PID: 21012 TASK: ffff880812dd7500 CPU: 8 COMMAND: "ll_mgs_22"
          #0 [ffff88043f407e90] crash_nmi_callback at ffffffff81029746
          #1 [ffff88043f407ea0] notifier_call_chain at ffffffff814f3eb5
          #2 [ffff88043f407ee0] atomic_notifier_call_chain at ffffffff814f3f1a
          #3 [ffff88043f407ef0] notify_die at ffffffff81096a6e
          #4 [ffff88043f407f20] do_nmi at ffffffff814f1b33
          #5 [ffff88043f407f50] nmi at ffffffff814f1440
          [exception RIP: vsnprintf+779]
          RIP: ffffffff81276c2b RSP: ffff880812da5810 RFLAGS: 00000206
          RAX: ffff880812da5a40 RBX: ffff8805aa88a529 RCX: 0000000000000002
          RDX: ffffffffa076a2fc RSI: ffff8805aa88b000 RDI: ffff8805aa88a529
          RBP: ffff880812da58a0 R8: 0000000000000073 R9: 0000000000000000
          R10: 0000000000000001 R11: 00000000000000dc R12: ffffffffa075b626
          R13: ffffffffa075b624 R14: ffff880812da5980 R15: ffff8805aa88b000
          ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
          — <NMI exception stack> —
          #6 [ffff880812da5810] vsnprintf at ffffffff81276c2b
          #7 [ffff880812da5868] cfs_set_ptldebug_header at ffffffffa048919b [libcfs]
          #8 [ffff880812da58a8] libcfs_debug_vmsg2 at ffffffffa0492b13 [libcfs]
          #9 [ffff880812da5a08] libcfs_debug_vmsg1 at ffffffffa04933f1 [libcfs]
          #10 [ffff880812da5a68] ldlm_lock_dump at ffffffffa06da97e [ptlrpc]
          #11 [ffff880812da5aa8] ldlm_resource_dump at ffffffffa06e0ae3 [ptlrpc]
          #12 [ffff880812da5b08] ldlm_granted_list_add_lock at ffffffffa06dd280 [ptlrpc]
          #13 [ffff880812da5b48] ldlm_grant_lock at ffffffffa06dd83c [ptlrpc]
          #14 [ffff880812da5ba8] ldlm_process_plain_lock at ffffffffa06edc1a [ptlrpc]
          #15 [ffff880812da5c48] ldlm_lock_enqueue at ffffffffa06deb6d [ptlrpc]
          #16 [ffff880812da5cb8] ldlm_handle_enqueue0 at ffffffffa06fd206 [ptlrpc]
          #17 [ffff880812da5d28] ldlm_handle_enqueue at ffffffffa06fdbd6 [ptlrpc]
          #18 [ffff880812da5d68] mgs_handle at ffffffffa0b55245 [mgs]
          #19 [ffff880812da5df8] ptlrpc_main at ffffffffa0723181 [ptlrpc]
          #20 [ffff880812da5f48] kernel_thread at ffffffff8100c14a

          So I imagine you have a lot of clients (tens of thousands?), and once all of them somehow got disconnected, they all come rushing to reconnect back and get their config lock too (all on the same resource)..
          Now in the middle of the locking we have this ldlm_resource_dump() call that would print all locks on the resource, which would take quite a while I can imagine, and everybody else is spinning on the resource spinlock meanwhile.

          That ldlm_resource_dump is D_IFO which is probably bad idea and should be D_DLMTRACE too, what is the lustre debug level you are running at?
          I wonder if we should just change ldlm_resource_dump() to only print the individual locks if the refcount is below some threshold, as otherwise it is just too much info I suspect.

          green Oleg Drokin added a comment - From the stack trace: PID: 21012 TASK: ffff880812dd7500 CPU: 8 COMMAND: "ll_mgs_22" #0 [ffff88043f407e90] crash_nmi_callback at ffffffff81029746 #1 [ffff88043f407ea0] notifier_call_chain at ffffffff814f3eb5 #2 [ffff88043f407ee0] atomic_notifier_call_chain at ffffffff814f3f1a #3 [ffff88043f407ef0] notify_die at ffffffff81096a6e #4 [ffff88043f407f20] do_nmi at ffffffff814f1b33 #5 [ffff88043f407f50] nmi at ffffffff814f1440 [exception RIP: vsnprintf+779] RIP: ffffffff81276c2b RSP: ffff880812da5810 RFLAGS: 00000206 RAX: ffff880812da5a40 RBX: ffff8805aa88a529 RCX: 0000000000000002 RDX: ffffffffa076a2fc RSI: ffff8805aa88b000 RDI: ffff8805aa88a529 RBP: ffff880812da58a0 R8: 0000000000000073 R9: 0000000000000000 R10: 0000000000000001 R11: 00000000000000dc R12: ffffffffa075b626 R13: ffffffffa075b624 R14: ffff880812da5980 R15: ffff8805aa88b000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 — <NMI exception stack> — #6 [ffff880812da5810] vsnprintf at ffffffff81276c2b #7 [ffff880812da5868] cfs_set_ptldebug_header at ffffffffa048919b [libcfs] #8 [ffff880812da58a8] libcfs_debug_vmsg2 at ffffffffa0492b13 [libcfs] #9 [ffff880812da5a08] libcfs_debug_vmsg1 at ffffffffa04933f1 [libcfs] #10 [ffff880812da5a68] ldlm_lock_dump at ffffffffa06da97e [ptlrpc] #11 [ffff880812da5aa8] ldlm_resource_dump at ffffffffa06e0ae3 [ptlrpc] #12 [ffff880812da5b08] ldlm_granted_list_add_lock at ffffffffa06dd280 [ptlrpc] #13 [ffff880812da5b48] ldlm_grant_lock at ffffffffa06dd83c [ptlrpc] #14 [ffff880812da5ba8] ldlm_process_plain_lock at ffffffffa06edc1a [ptlrpc] #15 [ffff880812da5c48] ldlm_lock_enqueue at ffffffffa06deb6d [ptlrpc] #16 [ffff880812da5cb8] ldlm_handle_enqueue0 at ffffffffa06fd206 [ptlrpc] #17 [ffff880812da5d28] ldlm_handle_enqueue at ffffffffa06fdbd6 [ptlrpc] #18 [ffff880812da5d68] mgs_handle at ffffffffa0b55245 [mgs] #19 [ffff880812da5df8] ptlrpc_main at ffffffffa0723181 [ptlrpc] #20 [ffff880812da5f48] kernel_thread at ffffffff8100c14a So I imagine you have a lot of clients (tens of thousands?), and once all of them somehow got disconnected, they all come rushing to reconnect back and get their config lock too (all on the same resource).. Now in the middle of the locking we have this ldlm_resource_dump() call that would print all locks on the resource, which would take quite a while I can imagine, and everybody else is spinning on the resource spinlock meanwhile. That ldlm_resource_dump is D_IFO which is probably bad idea and should be D_DLMTRACE too, what is the lustre debug level you are running at? I wonder if we should just change ldlm_resource_dump() to only print the individual locks if the refcount is below some threshold, as otherwise it is just too much info I suspect.
          pjones Peter Jones added a comment -

          Added Alex as a watcher so he is aware of Chris's answer

          pjones Peter Jones added a comment - Added Alex as a watcher so he is aware of Chris's answer

          Attach "foreach bt" from momus mds.

          morrone Christopher Morrone (Inactive) added a comment - Attach "foreach bt" from momus mds.

          Alex, we have a crash dump. If you want backtraces from all tasks, we'll get you that. No need to wait for another instance.

          morrone Christopher Morrone (Inactive) added a comment - Alex, we have a crash dump. If you want backtraces from all tasks, we'll get you that. No need to wait for another instance.

          once you meet this problem again, please grab all the traces and attach them here.

          bzzz Alex Zhuravlev added a comment - once you meet this problem again, please grab all the traces and attach them here.

          MDS went through recovery and seems happy for the moment.

          morrone Christopher Morrone (Inactive) added a comment - MDS went through recovery and seems happy for the moment.

          People

            laisiyao Lai Siyao
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: