[LU-4195] MDT Slow with ptlrpcd using 100% cpu. Created: 01/Nov/13 Updated: 15/Sep/16 Resolved: 29/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server running 2.1.5-2nas |
||
| Attachments: |
|
| Severity: | 3 |
| Epic: | hang, server |
| Rank (Obsolete): | 11369 |
| Description |
|
mdt response very slow. Top showed ptlrpcd running at 100% cpu. Console showed errors. Was able to run debug trace. See attached files. Lustre: Service thread pid 7065 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Call Trace: |
| Comments |
| Comment by Mahmoud Hanafi [ 01/Nov/13 ] |
|
Uploaded the following files to ftp site. LU4195.lustre-log.dump.selected.tgz I also have crash dump that can be uploaded if needed. |
| Comment by Peter Jones [ 02/Nov/13 ] |
|
Lai What do you advise here? Peter |
| Comment by Niu Yawei (Inactive) [ 07/Nov/13 ] |
|
Mahmoud, do you know what kind of application/operation caused the problem? I see there are quite lot of transaction commits in the log, looks like heavy load was on mds. Is it possible to get a full stack trace (especially the stack trace for ptlrpcd) by sysrq when this happened again? Thanks. |
| Comment by Mahmoud Hanafi [ 07/Nov/13 ] |
|
Hope this helps crash> bt 5560 crash> bt 5560 -l |
| Comment by Niu Yawei (Inactive) [ 08/Nov/13 ] |
|
Could you provide the full stack trace (for all tasks on all CPUs) as well? What about the memmory usage? Can this situation be recovered? |
| Comment by Lai Siyao [ 08/Nov/13 ] |
|
In debug logs there are lots of messages like this: 00000400:02000400:2.0:1383201992.510338:0:7620:0:(lib-move.c:1454:lnet_send()) No route to 12345-10.153.1.199@o2ib233 via 10.151.27.60@o2ib (all routers down) It looks like routers are down, and all connections are down too, and clients kept reconnecting to MDS (but failed), so ptlrpcd are always 100% busy. |
| Comment by Mahmoud Hanafi [ 08/Nov/13 ] |
|
We were not able to recover from this and had to dump the system. We have 7 other filesystem that use the same routers and clients and they were not experiencing this issue. There is "some" evidence that this may have been triggered by running a "lfs setquota" command. |
| Comment by Niu Yawei (Inactive) [ 11/Nov/13 ] |
|
I agree with Lai, the log shows that no router for o2ib to o2ib233, and the stack trace shows all ptlrpcd threads were busy on acquiring LNET_LOCK (which I think is caused by the router problem). Mahmoud, I didn't see anything related to quota in the log, what kind of evidence indicating it's triggered by 'lfs setquota' command? |
| Comment by Jodi Levi (Inactive) [ 13/Nov/13 ] |
|
Amir, |
| Comment by Amir Shehata (Inactive) [ 14/Nov/13 ] |
|
Would it be possible to grab the route configuration for one of the nodes that have the problem? Also if you could please highlight the problematic routes. As basic sanity check, please make sure that routers are actually configured as routers, IE: forwarding="enabled". If not, then the node will drop all messages not destined to itself. This error message is hit whenever LNET tries to send a message to a final destination that exists on a net that do not have a route from the net the current node is on. Thus an appropriate route is chosen. Furthermore, the routes can exist but they might be down because the NID specified is not reachable. Is it possible to try and do an lctl ping <nid> from the node that's having the problem. That should return a list of NIDs of the target router and their status (up/down). If one of these nids are down and avoid_asym_router_failure is 1 (which it is by default), then the entire router is considered down and when sending messages we'll hit the above error. NOTE: if you have a router that has multiple nids, but one of the nids is "unused" (IE, it sends/receives no messages), this would cause that NID to be considered down, and will lead to the above described scenario. |
| Comment by John Fuchs-Chesney (Inactive) [ 12/Mar/14 ] |
|
Hello Mahmoud, |
| Comment by John Fuchs-Chesney (Inactive) [ 29/Jul/14 ] |
|
Hello again Mahmoud, If we don't hear back we'll mark it as resolved, with no fix, and we can re-open it if requested to do that. Thanks, |
| Comment by Mahmoud Hanafi [ 29/Oct/14 ] |
|
please close. |
| Comment by Peter Jones [ 29/Oct/14 ] |
|
ok thanks Mahmoud |