[LU-1088] mgs threads go nuts - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.3.0
Affects Version/s: None
Labels:
None
Environment:
lustre 2.1.0-21chaos (github.com/chaos/lustre)

Severity:
2
Rank (Obsolete):
4614

Description

While investigating ~~LU-1087~~, the ll_mgs_* threads suddenly went nuts and shot the load through the roof, to the point where the node is almost completely unresponsive, and a "top" that I had running is only able to redraw ever minute or so.

The console is mostly unresponsive, but it did respond to a sysreq-l, so I can see that they are all in a backtrace similar to this:

Call Trace:
 [<ffffffffa06da060>] lock_res_and_lock+0x30/0x40 [ptlrpc]
 [<ffffffffa06deca3>] ldlm_lock_enqueue+0x453/0x7e0 [ptlrpc]
 [<ffffffffa06fd206>] ldlm_handle_enqueue0+0x406/0xd70 [ptlrpc]
 [<ffffffffa06fdbd6>] ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
 [<ffffffffa06fdbe0>] ? ldlm_server_completion_ast+0x0/0x590 [ptlrpc]
 [<ffffffffa06fe170>] ? ldlm_server_blocking_ast+0x0/0x740 [ptlrpc]
 [<ffffffffa0b55245>] mgs_handle+0x545/0x1350 [mgs]
 [<ffffffffa04933f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
 [<ffffffffa04933f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
 [<ffffffffa0723181>] ptlrpc_main+0xcd1/0x1690 [ptlrpc]
 [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
 [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

momus-mds1-backtraces.txt
542 kB
10/Feb/12 2:05 PM

Activity

[LU-1088] mgs threads go nuts

Peter Jones added a comment - 04/Jun/12 2:21 AM

Landed for 2.3

Peter Jones added a comment - 04/Jun/12 2:21 AM Landed for 2.3

Lai Siyao added a comment - 02/Mar/12 11:00 PM

Review is on http://review.whamcloud.com/#change,2250.

Lai Siyao added a comment - 02/Mar/12 11:00 PM Review is on http://review.whamcloud.com/#change,2250 .

Lai Siyao added a comment - 02/Mar/12 10:10 PM

It looks okay to use RCU for resource lock dump, and compared to ldlm_lock_debug() ldlm_lock_dump() is inefficient, I'll replace it the former one.

Lai Siyao added a comment - 02/Mar/12 10:10 PM It looks okay to use RCU for resource lock dump, and compared to ldlm_lock_debug() ldlm_lock_dump() is inefficient, I'll replace it the former one.

Christopher Morrone (Inactive) added a comment - 10/Feb/12 7:41 PM

Yes, I had recently added D_INFO to get a look at what the spinning mdt thread was doing for ~~LU-1087~~.

We have a few thousand clients. Most of them should not be disconnected. There are on the order of a couple hundred that might reboot and reconnect at any time (BGP nodes).

I think that we definitely need ldlm_resource_dump changed. I certainly accept that performance is reduced when I enable higher logging levels, but I don't expect a denial-of-service attack.

Christopher Morrone (Inactive) added a comment - 10/Feb/12 7:41 PM Yes, I had recently added D_INFO to get a look at what the spinning mdt thread was doing for LU-1087 . We have a few thousand clients. Most of them should not be disconnected. There are on the order of a couple hundred that might reboot and reconnect at any time (BGP nodes). I think that we definitely need ldlm_resource_dump changed. I certainly accept that performance is reduced when I enable higher logging levels, but I don't expect a denial-of-service attack.

Oleg Drokin added a comment - 10/Feb/12 7:18 PM

From the stack trace:
PID: 21012 TASK: ffff880812dd7500 CPU: 8 COMMAND: "ll_mgs_22"
#0 [ffff88043f407e90] crash_nmi_callback at ffffffff81029746
#1 [ffff88043f407ea0] notifier_call_chain at ffffffff814f3eb5
#2 [ffff88043f407ee0] atomic_notifier_call_chain at ffffffff814f3f1a
#3 [ffff88043f407ef0] notify_die at ffffffff81096a6e
#4 [ffff88043f407f20] do_nmi at ffffffff814f1b33
#5 [ffff88043f407f50] nmi at ffffffff814f1440
[exception RIP: vsnprintf+779]
RIP: ffffffff81276c2b RSP: ffff880812da5810 RFLAGS: 00000206
RAX: ffff880812da5a40 RBX: ffff8805aa88a529 RCX: 0000000000000002
RDX: ffffffffa076a2fc RSI: ffff8805aa88b000 RDI: ffff8805aa88a529
RBP: ffff880812da58a0 R8: 0000000000000073 R9: 0000000000000000
R10: 0000000000000001 R11: 00000000000000dc R12: ffffffffa075b626
R13: ffffffffa075b624 R14: ffff880812da5980 R15: ffff8805aa88b000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff880812da5810] vsnprintf at ffffffff81276c2b
#7 [ffff880812da5868] cfs_set_ptldebug_header at ffffffffa048919b [libcfs]
#8 [ffff880812da58a8] libcfs_debug_vmsg2 at ffffffffa0492b13 [libcfs]
#9 [ffff880812da5a08] libcfs_debug_vmsg1 at ffffffffa04933f1 [libcfs]
#10 [ffff880812da5a68] ldlm_lock_dump at ffffffffa06da97e [ptlrpc]
#11 [ffff880812da5aa8] ldlm_resource_dump at ffffffffa06e0ae3 [ptlrpc]
#12 [ffff880812da5b08] ldlm_granted_list_add_lock at ffffffffa06dd280 [ptlrpc]
#13 [ffff880812da5b48] ldlm_grant_lock at ffffffffa06dd83c [ptlrpc]
#14 [ffff880812da5ba8] ldlm_process_plain_lock at ffffffffa06edc1a [ptlrpc]
#15 [ffff880812da5c48] ldlm_lock_enqueue at ffffffffa06deb6d [ptlrpc]
#16 [ffff880812da5cb8] ldlm_handle_enqueue0 at ffffffffa06fd206 [ptlrpc]
#17 [ffff880812da5d28] ldlm_handle_enqueue at ffffffffa06fdbd6 [ptlrpc]
#18 [ffff880812da5d68] mgs_handle at ffffffffa0b55245 [mgs]
#19 [ffff880812da5df8] ptlrpc_main at ffffffffa0723181 [ptlrpc]
#20 [ffff880812da5f48] kernel_thread at ffffffff8100c14a

So I imagine you have a lot of clients (tens of thousands?), and once all of them somehow got disconnected, they all come rushing to reconnect back and get their config lock too (all on the same resource)..
Now in the middle of the locking we have this ldlm_resource_dump() call that would print all locks on the resource, which would take quite a while I can imagine, and everybody else is spinning on the resource spinlock meanwhile.

That ldlm_resource_dump is D_IFO which is probably bad idea and should be D_DLMTRACE too, what is the lustre debug level you are running at?
I wonder if we should just change ldlm_resource_dump() to only print the individual locks if the refcount is below some threshold, as otherwise it is just too much info I suspect.

Oleg Drokin added a comment - 10/Feb/12 7:18 PM From the stack trace: PID: 21012 TASK: ffff880812dd7500 CPU: 8 COMMAND: "ll_mgs_22" #0 [ffff88043f407e90] crash_nmi_callback at ffffffff81029746 #1 [ffff88043f407ea0] notifier_call_chain at ffffffff814f3eb5 #2 [ffff88043f407ee0] atomic_notifier_call_chain at ffffffff814f3f1a #3 [ffff88043f407ef0] notify_die at ffffffff81096a6e #4 [ffff88043f407f20] do_nmi at ffffffff814f1b33 #5 [ffff88043f407f50] nmi at ffffffff814f1440 [exception RIP: vsnprintf+779] RIP: ffffffff81276c2b RSP: ffff880812da5810 RFLAGS: 00000206 RAX: ffff880812da5a40 RBX: ffff8805aa88a529 RCX: 0000000000000002 RDX: ffffffffa076a2fc RSI: ffff8805aa88b000 RDI: ffff8805aa88a529 RBP: ffff880812da58a0 R8: 0000000000000073 R9: 0000000000000000 R10: 0000000000000001 R11: 00000000000000dc R12: ffffffffa075b626 R13: ffffffffa075b624 R14: ffff880812da5980 R15: ffff8805aa88b000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 — <NMI exception stack> — #6 [ffff880812da5810] vsnprintf at ffffffff81276c2b #7 [ffff880812da5868] cfs_set_ptldebug_header at ffffffffa048919b [libcfs] #8 [ffff880812da58a8] libcfs_debug_vmsg2 at ffffffffa0492b13 [libcfs] #9 [ffff880812da5a08] libcfs_debug_vmsg1 at ffffffffa04933f1 [libcfs] #10 [ffff880812da5a68] ldlm_lock_dump at ffffffffa06da97e [ptlrpc] #11 [ffff880812da5aa8] ldlm_resource_dump at ffffffffa06e0ae3 [ptlrpc] #12 [ffff880812da5b08] ldlm_granted_list_add_lock at ffffffffa06dd280 [ptlrpc] #13 [ffff880812da5b48] ldlm_grant_lock at ffffffffa06dd83c [ptlrpc] #14 [ffff880812da5ba8] ldlm_process_plain_lock at ffffffffa06edc1a [ptlrpc] #15 [ffff880812da5c48] ldlm_lock_enqueue at ffffffffa06deb6d [ptlrpc] #16 [ffff880812da5cb8] ldlm_handle_enqueue0 at ffffffffa06fd206 [ptlrpc] #17 [ffff880812da5d28] ldlm_handle_enqueue at ffffffffa06fdbd6 [ptlrpc] #18 [ffff880812da5d68] mgs_handle at ffffffffa0b55245 [mgs] #19 [ffff880812da5df8] ptlrpc_main at ffffffffa0723181 [ptlrpc] #20 [ffff880812da5f48] kernel_thread at ffffffff8100c14a So I imagine you have a lot of clients (tens of thousands?), and once all of them somehow got disconnected, they all come rushing to reconnect back and get their config lock too (all on the same resource).. Now in the middle of the locking we have this ldlm_resource_dump() call that would print all locks on the resource, which would take quite a while I can imagine, and everybody else is spinning on the resource spinlock meanwhile. That ldlm_resource_dump is D_IFO which is probably bad idea and should be D_DLMTRACE too, what is the lustre debug level you are running at? I wonder if we should just change ldlm_resource_dump() to only print the individual locks if the refcount is below some threshold, as otherwise it is just too much info I suspect.

Peter Jones added a comment - 10/Feb/12 6:38 PM

Added Alex as a watcher so he is aware of Chris's answer

Peter Jones added a comment - 10/Feb/12 6:38 PM Added Alex as a watcher so he is aware of Chris's answer

Christopher Morrone (Inactive) added a comment - 10/Feb/12 2:05 PM

Attach "foreach bt" from momus mds.

Christopher Morrone (Inactive) added a comment - 10/Feb/12 2:05 PM Attach "foreach bt" from momus mds.

Christopher Morrone (Inactive) added a comment - 10/Feb/12 1:16 PM

Alex, we have a crash dump. If you want backtraces from all tasks, we'll get you that. No need to wait for another instance.

Christopher Morrone (Inactive) added a comment - 10/Feb/12 1:16 PM Alex, we have a crash dump. If you want backtraces from all tasks, we'll get you that. No need to wait for another instance.

Alex Zhuravlev added a comment - 10/Feb/12 3:48 AM

once you meet this problem again, please grab all the traces and attach them here.

Alex Zhuravlev added a comment - 10/Feb/12 3:48 AM once you meet this problem again, please grab all the traces and attach them here.

Christopher Morrone (Inactive) added a comment - 09/Feb/12 8:53 PM

MDS went through recovery and seems happy for the moment.

Christopher Morrone (Inactive) added a comment - 09/Feb/12 8:53 PM MDS went through recovery and seems happy for the moment.

People

Assignee:: Lai Siyao

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/Feb/12 8:24 PM

Updated:: 04/Jun/12 2:21 AM

Resolved:: 04/Jun/12 2:21 AM