[LU-4006] LNET Messages staying in Queue Created: 25/Sep/13 Updated: 26/Jan/24 Resolved: 06/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9, Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jason Hill (Inactive) | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4 | ||
| Environment: |
RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 10722 | ||||
| Description |
|
We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2. We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful. Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday. While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor. |
| Comments |
| Comment by James Nunez (Inactive) [ 25/Sep/13 ] |
|
Jason, Also, is this the latest 2.4.0 that you are using, i.e. 2.4.1-RC2? Thanks, |
| Comment by Jason Hill (Inactive) [ 25/Sep/13 ] |
|
James, The 2.4.X variant was actually a 2.4.0RC or pre-RC. I will inquire with my colleagues who were running the test, I don't have direct knowledge on that side. |
| Comment by Liang Zhen (Inactive) [ 25/Sep/13 ] |
|
/proc/sys/lnet/stats is not queued message, it's counter for already sent/received bytes, and it will remain constant if all peers were down and no IO went through LNet. |
| Comment by Blake Caldwell [ 25/Sep/13 ] |
|
Here is what we observed with 2.4.0RC. All Titan osc clients were evicted at this point, but we noticed messages on the client side coming from the servers. [2013-09-03 15:58:54][c13-0c1s0n0]LNet: 15622:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.226.46@o2ib235 portal 4 match 1443971358080092 offset 192 length 192: 2 On the server side at 04:36:07 PM: The number of refs and queue were decreasing. Some had more than 8k messages queued. There were 11 servers with messages still in their queue at 4:40 PM. This entry would appear every 100s in the lctl dk debug output: 00000800:00000100:3.0:1378240617.211971:0:3150:0:(o2iblnd_cb.c:2844:kiblnd_cm_callback()) 10.36.230.181@o2ib234: ADDR ERROR -110 At 04:48:54 PM Connectivity was restored to this lnet router (up vs. down as above) and all lnet messages had drained. |
| Comment by Jodi Levi (Inactive) [ 25/Sep/13 ] |
|
Liang, |
| Comment by Jason Hill (Inactive) [ 25/Sep/13 ] |
|
Liang: on page 32-36 in the 1.8 Lustre Operations manual routerstat references /proc/sys/lnet/stats – the first field is msgs_alloc which as we understood it was the count of the currently processing messages for a host. We are looking at a more "global" scale – knowing what is happening on a per-peer basis is interesting but harder to alert and monitor on. Are we incorrect about the first field in /proc/sys/lnet/stats? – |
| Comment by David Dillow [ 25/Sep/13 ] |
|
It is worth noting that the machine was taken down just after 8am on Sep 3rd, and we still had several thousand messages queued to the machine that afternoon; even some of the evictions were happening way late. |
| Comment by David Dillow [ 25/Sep/13 ] |
|
Also, the messages that had been queued up were being sent to the routers after they came back, except they were to LNETs that no longer existed on Titan – this is how we noticed the problem during the Atlas test on the 3rd. Titan had been down long enough (6+ hours) that it should have been completely evicted long before then, and non of this traffic should have made it into the Gemini side. |
| Comment by Liang Zhen (Inactive) [ 26/Sep/13 ] |
|
yes you are right, allocated message in lnet/stats is number of global messages (all together). However, LNet itself will not automatically send huge amount of messages, router checker will send some ping messages to routers but at low frequency, e.g. 60 seconds by default, so I think those traffics are not really from LNet itself but from upper layer services. btw, you can see queued message on each NI (or lnet network) by check lnet/nis. |
| Comment by Jason Hill (Inactive) [ 01/Oct/13 ] |
|
Gathering data now from a handful of servers. Our monitoring alerts when the allocated message count exceeds 30,000 for 2 samples, each sample being taken every 30 seconds. Data upload forthcoming. I used the following to generate the data: echo +rpctrace > /proc/sys/lnet/debug; lctl dk > /dev/null; sleep 60; lctl dk > /tmp/rpctrace.$HOSTNAME.20131010; echo -rpctrace > /proc/sys/lnet/debug; lctl dk > /dev/null |
| Comment by Jason Hill (Inactive) [ 01/Oct/13 ] |
|
output of lctl dk with +rpctrace enabled in the debug flag. |
| Comment by Isaac Huang (Inactive) [ 04/Oct/13 ] |
|
To summarize, my understanding is that after a client cluster (both the clients and routers that connect them to servers) disappears all at once, queued messages on servers drain out very very slowly. Please correct me if this is wrong. I believe there's two things that caused the problem: How to fix them: I think we should fix 2 first. If messages still stay a long time, then upper layers aren't unlink them fast enough, e.g. very slow evictions; in which case we should also look at upper layers for potential problems. In contrast, if we fix 1 first, then such potential problems at upper layers won't have a chance to show up at all. However, neither fix would be trivial. |
| Comment by Liang Zhen (Inactive) [ 05/Oct/13 ] |
|
Isaac, thanks for looking into it, and I agree that it's because very slow draining of queued messages, there could be many clients and much fewer routers, so each router has thousands of queued messages. |
| Comment by Isaac Huang (Inactive) [ 10/Oct/13 ] |
|
Liang, I haven't looked at the locking changes, but I have no plan to queue messages on MD. How often does Lustre/PTLRPC send multiple messages from a same MD? A quick code scan under lustre/ptlrpc/ showed zero such use. The FF server collectives does this, but only after gossip confirms the target node to be alive, which minimizes the chance of sending a message to a dead node in the first place. My current plan is to simply save a pointer to the latest message on MD: NULL the pointer when the message goes to LND, replace it when there's another message sent over the same MD, and complete the message if any on LNetMDUnlink. If Lustre in the future does send multiple messages from a same MD, then we'd still have a chance to abort the most recent message, which is the one that's most likely still on the queue. This should cover most cases. Also, I don't see a need to add a new MD flag. This should always be the unlink behavior. |
| Comment by Liang Zhen (Inactive) [ 11/Oct/13 ] |
|
one my concern is, if we don't add any new flag (or explicit API for abort), then we are changing semantic of current API, for now LNetMDUnlink will not implicitly abort any inflight PUT/GET (although Lustre doesn't rely on this), so would it be reasonable to add a new flag to allow Unlink to abort inflight message? |
| Comment by Isaac Huang (Inactive) [ 16/Oct/13 ] |
|
If I haven't missed something, I don't see any semantic change. Currently callers would eventually get a SENT event with unlinked flag set and likely an error status code. With the proposed change, it's still the same thing, i.e. SENT event with unlinked=1 and status=-ECANCL; the only difference is that now the event could come much sooner. Callers are supposed to handle that event anyway, sooner or later. I don't think this is a semantic change. The API semantics never (and can't) say how soon such piggybacked unlink event would happen. If for some reason Lustre can't handle an instantaneous piggybacked unlink, then it's a bug in Lustre that should be fixed. |
| Comment by Isaac Huang (Inactive) [ 22/Oct/13 ] |
|
Patch posted: http://review.whamcloud.com/#/c/8041/ Though the idea was straightforward, the original approach I suggested turned out too messy to implement: MD needs to keep a pointer to a message which is not reference counted, very tricky to remove a message from its peer/NI queue, and so on. Instead I chose to abort messages in lnet_post_send_locked(). The drawback is it'd take an additional LND timeout for messages to be aborted. But the advantages are: much much simpler code, and all queued messages on an unlinked MD will be aborted rather just one. I think this is a good trade-off between instantaneous unlink and code complexity. |
| Comment by Liang Zhen (Inactive) [ 22/Oct/13 ] |
|
Yes I think this way is better, the other reason I proposed to use new flag for MD is, even if we only track the last message that associated with the MD, we still need some complex locking operations because different messages are protected by different locks, and this way can totally avoid complex lock operations. |
| Comment by Jason Hill (Inactive) [ 22/Oct/13 ] |
|
Can I get an idea of the time to solution? Are we looking at something that will be tested and ready to install in the next month or something longer? Thx. |
| Comment by James A Simmons [ 22/Oct/13 ] |
|
There is a patch ready for testing |
| Comment by Isaac Huang (Inactive) [ 22/Oct/13 ] |
|
Liang, yes it's a change in the case you described. But I'd rather regard it as an abuse of an obscure part of the API semantics rather than a valid use case we'd support, to unlink a MD where there's active message you don't want to give up. I've pushed an update that added a comment over LNetMDUnlink: "As a result, active messages associated with the MD may get aborted.". |
| Comment by Isaac Huang (Inactive) [ 30/Oct/13 ] |
|
Patch updated: can't use LNET_MD_FLAG_ZOMBIE, which can be set as a result of MD auto unlink, where it's not abort and active messages should not be canceled as a result (e.g. a REPLY message from a MD exhausted by the corresponding GET) - the only way is to add a new flag LNET_MD_FLAG_ABORTED, set by LNetM[DE]Unlink. |
| Comment by Isaac Huang (Inactive) [ 01/Nov/13 ] |
|
Hi, we've tested the patch with our internal tests systems, and there'd be some additional test at Hyperion. Meanwhile if you want to test it at ORNL, I'd suggest to:
|
| Comment by James A Simmons [ 09/Dec/13 ] |
|
Will this patch be cherry picked to b2_4 and b2_5? |
| Comment by James Nunez (Inactive) [ 09/Dec/13 ] |
|
Patch http://review.whamcloud.com/#/c/8041/ landed to master. I'll look into plans for landing in b2_4 and b2_5. |
| Comment by James Nunez (Inactive) [ 09/Dec/13 ] |
|
James, Has anyone at ORNL tested the patch and, if so, did it fix or help the number of messages in the queue? Feedback from you will help us determine if the patch is cherry-picked/back ported to b2_4 and/or b2_5. Thanks, |
| Comment by James A Simmons [ 11/Dec/13 ] |
|
Is it an only server side patch? If so I can arrange to have it tested. |
| Comment by James Nunez (Inactive) [ 16/Dec/13 ] |
|
James, The patch can be applied to the servers only and you should see a benefit from it, but the patch is for both clients and servers. This configuration is fine for testing, but we recommend patching both clients and servers in production. Thanks, |
| Comment by James Nunez (Inactive) [ 31/Dec/13 ] |
|
Jason or James, Is this still an issue or should we close this ticket? Thanks, |
| Comment by James A Simmons [ 02/Jan/14 ] |
|
I plan to test this patch at scale with 2.4 clients. |
| Comment by James A Simmons [ 09/Jan/14 ] |
|
Had a discussion at work about testing this patch. It was decided not to test with this patch in the near future since it a small case. You can close this ticket. If we run into it in the future we can reopen this ticket again. |
| Comment by James Nunez (Inactive) [ 09/Jan/14 ] |
|
James, James |
| Comment by James A Simmons [ 22/Jan/14 ] |
|
January 21 we preformed a test shot with our Luste 2.4 file system with 2.4 clients. Before we were on 1.8 clients. During startup we ran into this issue. We have another test Feburary 4th and will perform the upgrade the 28th. I plan to run with this patch server side to see if resolves the issues we are seeing. |