[LU-4195] MDT Slow with ptlrpcd using 100% cpu. Created: 01/Nov/13  Updated: 15/Sep/16  Resolved: 29/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

server running 2.1.5-2nas
our source is at git://github.com/jlan/lustre-nas.git


Attachments: File bt.LU4195     File meminfo.LU4195    
Severity: 3
Epic: hang, server
Rank (Obsolete): 11369

 Description   

mdt response very slow. Top showed ptlrpcd running at 100% cpu. Console showed errors. Was able to run debug trace. See attached files.

Lustre: Service thread pid 7065 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 7065, comm: mdt_93

Call Trace:
[<ffffffff810539b2>] ? enqueue_task_fair+0x142/0x490
[<ffffffff81096f5d>] ? sched_clock_cpu+0xcd/0x110
[<ffffffffa04f960e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[<ffffffffa0a9f6de>] qos_statfs_update+0x7fe/0xa70 [lov]
[<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
[<ffffffffa0aa00fd>] alloc_qos+0x1ad/0x21a0 [lov]
[<ffffffffa0aa306c>] qos_prep_create+0x1ec/0x2380 [lov]
[<ffffffffa04f98be>] ? cfs_free+0xe/0x10 [libcfs]
[<ffffffffa0a9c63a>] lov_prep_create_set+0xea/0x390 [lov]
[<ffffffffa0a84b0c>] lov_create+0x1ac/0x1400 [lov]
[<ffffffffa0d8a50f>] ? obd_iocontrol+0xef/0x390 [mdd]
[<ffffffffa0d8f90e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
[<ffffffffa0da1871>] mdd_create+0xf81/0x1a90 [mdd]
[<ffffffffa0ea9df3>] ? osd_oi_lookup+0x83/0x110 [osd_ldiskfs]
[<ffffffffa0ea456c>] ? osd_object_init+0xdc/0x3e0 [osd_ldiskfs]
[<ffffffffa0eda3f7>] cml_create+0x97/0x250 [cmm]
[<ffffffffa0e165e1>] ? mdt_version_get_save+0x91/0xd0 [mdt]
[<ffffffffa0e2c06e>] mdt_reint_open+0x1aae/0x28a0 [mdt]
[<ffffffffa078f724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
[<ffffffffa0da456e>] ? md_ucred+0x1e/0x60 [mdd]
[<ffffffffa0e14c81>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0e0bed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
[<ffffffffa0e0c53d>] mdt_intent_reint+0x1ed/0x530 [mdt]
[<ffffffffa0e0ac09>] mdt_intent_policy+0x379/0x690 [mdt]
[<ffffffffa074b351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
[<ffffffffa07711ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
[<ffffffffa0e0b586>] mdt_enqueue+0x46/0x130 [mdt]
[<ffffffffa0e00772>] mdt_handle_common+0x932/0x1750 [mdt]
[<ffffffffa0e01665>] mdt_regular_handle+0x15/0x20 [mdt]
[<ffffffffa079fb4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
[<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20



 Comments   
Comment by Mahmoud Hanafi [ 01/Nov/13 ]

Uploaded the following files to ftp site.

LU4195.lustre-log.dump.selected.tgz
LU4195.ptlrpcd.debug2.out.gz

I also have crash dump that can be uploaded if needed.

Comment by Peter Jones [ 02/Nov/13 ]

Lai

What do you advise here?

Peter

Comment by Niu Yawei (Inactive) [ 07/Nov/13 ]

Mahmoud, do you know what kind of application/operation caused the problem? I see there are quite lot of transaction commits in the log, looks like heavy load was on mds.

Is it possible to get a full stack trace (especially the stack trace for ptlrpcd) by sysrq when this happened again? Thanks.

Comment by Mahmoud Hanafi [ 07/Nov/13 ]

Hope this helps

crash> bt 5560
PID: 5560 TASK: ffff881fb5e87540 CPU: 8 COMMAND: "ptlrpcd"
#0 [ffff8820b0c07e90] crash_nmi_callback at ffffffff81029956
#1 [ffff8820b0c07ea0] notifier_call_chain at ffffffff815222f5
#2 [ffff8820b0c07ee0] atomic_notifier_call_chain at ffffffff8152235a
#3 [ffff8820b0c07ef0] notify_die at ffffffff81095f1e
#4 [ffff8820b0c07f20] do_nmi at ffffffff8151ff2f
#5 [ffff8820b0c07f50] nmi at ffffffff8151f7e0
[exception RIP: _spin_lock+33]
RIP: ffffffff8151f051 RSP: ffff881fac2b3bf0 RFLAGS: 00000297
RAX: 0000000000000d28 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000000000d27 RSI: 0000000000000007 RDI: ffffffffa05ada40
RBP: ffff881fac2b3bf0 R8: 0000000000000246 R9: 0000000000000001
R10: ffff883343e44cc0 R11: 0000000000000000 R12: 000500000a971b57
R13: 0000000000003039 R14: 0000000000000000 R15: 000527015f9078c7
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff881fac2b3bf0] _spin_lock at ffffffff8151f051
#7 [ffff881fac2b3bf8] LNetMEAttach at ffffffffa057f99f [lnet]
#8 [ffff881fac2b3c68] ptl_send_rpc at ffffffffa078b532 [ptlrpc]
#9 [ffff881fac2b3d28] ptlrpc_send_new_req at ffffffffa0781993 [ptlrpc]
#10 [ffff881fac2b3d98] ptlrpc_check_set at ffffffffa07840a8 [ptlrpc]
#11 [ffff881fac2b3e38] ptlrpcd_check at ffffffffa07b4ed0 [ptlrpc]
#12 [ffff881fac2b3e68] ptlrpcd at ffffffffa07b516e [ptlrpc]
#13 [ffff881fac2b3f48] kernel_thread at ffffffff8100c0ca

crash> bt 5560 -l
PID: 5560 TASK: ffff881fb5e87540 CPU: 8 COMMAND: "ptlrpcd"
#0 [ffff8820b0c07e90] crash_nmi_callback at ffffffff81029956
/usr/src/debug/kernel-lustre215-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86/include/asm/paravirt.h: 115
#1 [ffff8820b0c07ea0] notifier_call_chain at ffffffff815222f5
/usr/src/debug/kernel-lustre215-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/kernel/notifier.c: 95
#2 [ffff8820b0c07ee0] atomic_notifier_call_chain at ffffffff8152235a
/usr/src/debug/kernel-lustre215-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/kernel/notifier.c: 192
#3 [ffff8820b0c07ef0] notify_die at ffffffff81095f1e
/usr/src/debug/kernel-lustre215-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/kernel/notifier.c: 573
#4 [ffff8820b0c07f20] do_nmi at ffffffff8151ff2f
/usr/src/debug/kernel-lustre215-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86/kernel/traps.c: 513
#5 [ffff8820b0c07f50] nmi at ffffffff8151f7e0
/usr/src/debug/kernel-lustre215-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86_64/kernel/entry.S
[exception RIP: _spin_lock+33]
RIP: ffffffff8151f051 RSP: ffff881fac2b3bf0 RFLAGS: 00000297
RAX: 0000000000000d28 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000000000d27 RSI: 0000000000000007 RDI: ffffffffa05ada40
RBP: ffff881fac2b3bf0 R8: 0000000000000246 R9: 0000000000000001
R10: ffff883343e44cc0 R11: 0000000000000000 R12: 000500000a971b57
R13: 0000000000003039 R14: 0000000000000000 R15: 000527015f9078c7
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff881fac2b3bf0] _spin_lock at ffffffff8151f051
/usr/src/debug/kernel-lustre215-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86/include/asm/spinlock.h: 127
#7 [ffff881fac2b3bf8] LNetMEAttach at ffffffffa057f99f [lnet]
/usr/src/redhat/BUILD/lustre-2.1.5/lnet/lnet/lib-me.c: 158
#8 [ffff881fac2b3c68] ptl_send_rpc at ffffffffa078b532 [ptlrpc]
/usr/src/redhat/BUILD/lustre-2.1.5/lustre/ptlrpc/niobuf.c: 614
#9 [ffff881fac2b3d28] ptlrpc_send_new_req at ffffffffa0781993 [ptlrpc]
/usr/src/redhat/BUILD/lustre-2.1.5/lustre/ptlrpc/client.c: 1437
#10 [ffff881fac2b3d98] ptlrpc_check_set at ffffffffa07840a8 [ptlrpc]
/usr/src/redhat/BUILD/lustre-2.1.5/lustre/ptlrpc/client.c: 1468
#11 [ffff881fac2b3e38] ptlrpcd_check at ffffffffa07b4ed0 [ptlrpc]
/usr/src/redhat/BUILD/lustre-2.1.5/lustre/ptlrpc/ptlrpcd.c: 240
#12 [ffff881fac2b3e68] ptlrpcd at ffffffffa07b516e [ptlrpc]
/usr/src/redhat/BUILD/lustre-2.1.5/lustre/ptlrpc/ptlrpcd.c: 326
#13 [ffff881fac2b3f48] kernel_thread at ffffffff8100c0ca
/usr/src/debug////////kernel-lustre215-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86/kernel/entry_64.S: 1213

Comment by Niu Yawei (Inactive) [ 08/Nov/13 ]

Could you provide the full stack trace (for all tasks on all CPUs) as well? What about the memmory usage? Can this situation be recovered?

Comment by Lai Siyao [ 08/Nov/13 ]

In debug logs there are lots of messages like this:

00000400:02000400:2.0:1383201992.510338:0:7620:0:(lib-move.c:1454:lnet_send()) No route to 12345-10.153.1.199@o2ib233 via 10.151.27.60@o2ib (all routers down)

It looks like routers are down, and all connections are down too, and clients kept reconnecting to MDS (but failed), so ptlrpcd are always 100% busy.

Comment by Mahmoud Hanafi [ 08/Nov/13 ]

We were not able to recover from this and had to dump the system.
See attached files for memory and stack track info

We have 7 other filesystem that use the same routers and clients and they were not experiencing this issue.

There is "some" evidence that this may have been triggered by running a "lfs setquota" command.

Comment by Niu Yawei (Inactive) [ 11/Nov/13 ]

I agree with Lai, the log shows that no router for o2ib to o2ib233, and the stack trace shows all ptlrpcd threads were busy on acquiring LNET_LOCK (which I think is caused by the router problem).

Mahmoud, I didn't see anything related to quota in the log, what kind of evidence indicating it's triggered by 'lfs setquota' command?

Comment by Jodi Levi (Inactive) [ 13/Nov/13 ]

Amir,
Could you take a look at this one?
Thank you!

Comment by Amir Shehata (Inactive) [ 14/Nov/13 ]

Would it be possible to grab the route configuration for one of the nodes that have the problem? Also if you could please highlight the problematic routes.

As basic sanity check, please make sure that routers are actually configured as routers, IE: forwarding="enabled". If not, then the node will drop all messages not destined to itself.

This error message is hit whenever LNET tries to send a message to a final destination that exists on a net that do not have a route from the net the current node is on. Thus an appropriate route is chosen.

Furthermore, the routes can exist but they might be down because the NID specified is not reachable. Is it possible to try and do an lctl ping <nid> from the node that's having the problem. That should return a list of NIDs of the target router and their status (up/down). If one of these nids are down and avoid_asym_router_failure is 1 (which it is by default), then the entire router is considered down and when sending messages we'll hit the above error.

NOTE: if you have a router that has multiple nids, but one of the nids is "unused" (IE, it sends/receives no messages), this would cause that NID to be considered down, and will lead to the above described scenario.

Comment by John Fuchs-Chesney (Inactive) [ 12/Mar/14 ]

Hello Mahmoud,
Just want to check in to see if this is still a 'live' issue for you?
If the problem has been dealt with, can we mark this issue as 'resolved'?
Thanks,
~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 29/Jul/14 ]

Hello again Mahmoud,
Please advise if you want us to keep this ticket open?

If we don't hear back we'll mark it as resolved, with no fix, and we can re-open it if requested to do that.

Thanks,
~ jfc.

Comment by Mahmoud Hanafi [ 29/Oct/14 ]

please close.

Comment by Peter Jones [ 29/Oct/14 ]

ok thanks Mahmoud

Generated at Sat Feb 10 01:40:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.