[LU-4006] LNET Messages staying in Queue Created: 25/Sep/13  Updated: 26/Jan/24  Resolved: 06/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9, Lustre 2.4.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3

Type: Bug Priority: Minor
Reporter: Jason Hill (Inactive) Assignee: Isaac Huang (Inactive)
Resolution: Fixed Votes: 0
Labels: mn4
Environment:

RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client


Attachments: Text File LU-4006.tgz     File lnet_stats.sh    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 10722

 Description   

We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2.

We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful.

Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday.

While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor.



 Comments   
Comment by James Nunez (Inactive) [ 25/Sep/13 ]

Jason,
Please attached your script. It might be helpful to see how you are monitor the messages.

Also, is this the latest 2.4.0 that you are using, i.e. 2.4.1-RC2?

Thanks,
James

Comment by Jason Hill (Inactive) [ 25/Sep/13 ]

James,

The 2.4.X variant was actually a 2.4.0RC or pre-RC. I will inquire with my colleagues who were running the test, I don't have direct knowledge on that side.

Comment by Liang Zhen (Inactive) [ 25/Sep/13 ]

/proc/sys/lnet/stats is not queued message, it's counter for already sent/received bytes, and it will remain constant if all peers were down and no IO went through LNet.
Queued messages can be observed via /proc/sys/lnet/peers, also, upper layer will control number of queued message to LNet as well (both mdc and osc have their own message queue).

Comment by Blake Caldwell [ 25/Sep/13 ]

Here is what we observed with 2.4.0RC. All Titan osc clients were evicted at this point, but we noticed messages on the client side coming from the servers.

[2013-09-03 15:58:54][c13-0c1s0n0]LNet: 15622:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.226.46@o2ib235 portal 4 match 1443971358080092 offset 192 length 192: 2
[2013-09-03 15:58:54][c13-0c1s0n0]LNet: 15622:0:(lib-move.c:1828:lnet_parse_put()) Skipped 16 previous similar messages

On the server side at 04:36:07 PM:
[root@atlas-oss4f7 ~]# cat /proc/sys/lnet/peers
nid refs state last max rtr min tx min queue
[snip]
10.36.230.208@o2ib234 3201 down -1 8 8 8 -3187 -124974 843288

The number of refs and queue were decreasing. Some had more than 8k messages queued. There were 11 servers with messages still in their queue at 4:40 PM.

This entry would appear every 100s in the lctl dk debug output:

00000800:00000100:3.0:1378240617.211971:0:3150:0:(o2iblnd_cb.c:2844:kiblnd_cm_callback()) 10.36.230.181@o2ib234: ADDR ERROR -110
00000800:00000100:3.0:1378240617.211980:0:3150:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting messages for 10.36.230.181@o2ib234: connection failed

At 04:48:54 PM Connectivity was restored to this lnet router (up vs. down as above) and all lnet messages had drained.
10.36.230.208@o2ib234 6 up -1 8 8 8 8 -124974 0

Comment by Jodi Levi (Inactive) [ 25/Sep/13 ]

Liang,
Would you be able to comment on this one?
Thank you!

Comment by Jason Hill (Inactive) [ 25/Sep/13 ]

Liang:

on page 32-36 in the 1.8 Lustre Operations manual routerstat references /proc/sys/lnet/stats – the first field is msgs_alloc which as we understood it was the count of the currently processing messages for a host. We are looking at a more "global" scale – knowing what is happening on a per-peer basis is interesting but harder to alert and monitor on. Are we incorrect about the first field in /proc/sys/lnet/stats?


-Jason

Comment by David Dillow [ 25/Sep/13 ]

It is worth noting that the machine was taken down just after 8am on Sep 3rd, and we still had several thousand messages queued to the machine that afternoon; even some of the evictions were happening way late.

Comment by David Dillow [ 25/Sep/13 ]

Also, the messages that had been queued up were being sent to the routers after they came back, except they were to LNETs that no longer existed on Titan – this is how we noticed the problem during the Atlas test on the 3rd. Titan had been down long enough (6+ hours) that it should have been completely evicted long before then, and non of this traffic should have made it into the Gemini side.

Comment by Liang Zhen (Inactive) [ 26/Sep/13 ]

yes you are right, allocated message in lnet/stats is number of global messages (all together). However, LNet itself will not automatically send huge amount of messages, router checker will send some ping messages to routers but at low frequency, e.g. 60 seconds by default, so I think those traffics are not really from LNet itself but from upper layer services. btw, you can see queued message on each NI (or lnet network) by check lnet/nis.
Also, it could be helpful if you can get D_RPCTRACE from the server, to see what kind of messages are queued to LNet, so we can analyse what's going on.

Comment by Jason Hill (Inactive) [ 01/Oct/13 ]

Gathering data now from a handful of servers. Our monitoring alerts when the allocated message count exceeds 30,000 for 2 samples, each sample being taken every 30 seconds. Data upload forthcoming. I used the following to generate the data:

echo +rpctrace > /proc/sys/lnet/debug; lctl dk > /dev/null; sleep 60; lctl dk > /tmp/rpctrace.$HOSTNAME.20131010; echo -rpctrace > /proc/sys/lnet/debug; lctl dk > /dev/null

Comment by Jason Hill (Inactive) [ 01/Oct/13 ]

output of lctl dk with +rpctrace enabled in the debug flag.

Comment by Isaac Huang (Inactive) [ 04/Oct/13 ]

To summarize, my understanding is that after a client cluster (both the clients and routers that connect them to servers) disappears all at once, queued messages on servers drain out very very slowly. Please correct me if this is wrong.

I believe there's two things that caused the problem:
1. When a router is dead, messages already queued on that router still stay on the queue, and it'd take the LND very long time to drain the queue. As Blake pointed out previously, there was a connection error once about every 100s, so the LND drained about 8 messages every 100 seconds. It'd take hours and hours to drain thousands of queued messages if the router stays down.
2. Even after upper layers have already evicted the clients, the messages would still stay queued. This is because LNetMDUnlink wouldn't be able to unlink the MDs as the queued messages still hold references on them. This is very stupid - it's a waste of network resources to let the network drain messages already aborted by upper layers.

How to fix them:
1. When a router becomes dead, all messages queued on it except the ones whose final destinations are the router itself (e.g. router pinger traffic) should be retried on other available routes, and completed immediately with error if no route is available.
2. When upper layer calls LNetMDUnlink and the MD is busy, we should try to abort the message that references the MD, rather than simply wait for the network to complete it. If the message is queued at the LND already, it'd be difficult to abort it. But if it's still queued on the peer of the next hop at the LNet layer, it should not be hard to complete it immediately with error. Since at most peer_credits messages can be queued at LND at a time, the majority of queued messages would be drained immediately, which is a low-hanging fruit we can get much easier than aborting messages queued at LNDs already.

I think we should fix 2 first. If messages still stay a long time, then upper layers aren't unlink them fast enough, e.g. very slow evictions; in which case we should also look at upper layers for potential problems. In contrast, if we fix 1 first, then such potential problems at upper layers won't have a chance to show up at all. However, neither fix would be trivial.

Comment by Liang Zhen (Inactive) [ 05/Oct/13 ]

Isaac, thanks for looking into it, and I agree that it's because very slow draining of queued messages, there could be many clients and much fewer routers, so each router has thousands of queued messages.
I'm thinking that if we want to have a relatively simple way of aborting LNet message, we probably need a new flag for MD, e.g. LNET_MD_CAN_ABORT (or LNET_MD_EXCL), and LNet allow exact one message to associate with this MD (LNet will return error if user tries to send multiple messages on MD with this flag), so we don't need to maintain a list of message on each MD, and scan all them on unlink, which is more complex on the latest LNet because locking changes. This should satisfy current use-case in Lustre/ptlrpc, although it is still not trivial.
Also, we need to add new RPC API in ptlrpc to allow abort.

Comment by Isaac Huang (Inactive) [ 10/Oct/13 ]

Liang, I haven't looked at the locking changes, but I have no plan to queue messages on MD. How often does Lustre/PTLRPC send multiple messages from a same MD? A quick code scan under lustre/ptlrpc/ showed zero such use. The FF server collectives does this, but only after gossip confirms the target node to be alive, which minimizes the chance of sending a message to a dead node in the first place.

My current plan is to simply save a pointer to the latest message on MD: NULL the pointer when the message goes to LND, replace it when there's another message sent over the same MD, and complete the message if any on LNetMDUnlink. If Lustre in the future does send multiple messages from a same MD, then we'd still have a chance to abort the most recent message, which is the one that's most likely still on the queue. This should cover most cases. Also, I don't see a need to add a new MD flag. This should always be the unlink behavior.

Comment by Liang Zhen (Inactive) [ 11/Oct/13 ]

one my concern is, if we don't add any new flag (or explicit API for abort), then we are changing semantic of current API, for now LNetMDUnlink will not implicitly abort any inflight PUT/GET (although Lustre doesn't rely on this), so would it be reasonable to add a new flag to allow Unlink to abort inflight message?

Comment by Isaac Huang (Inactive) [ 16/Oct/13 ]

If I haven't missed something, I don't see any semantic change. Currently callers would eventually get a SENT event with unlinked flag set and likely an error status code. With the proposed change, it's still the same thing, i.e. SENT event with unlinked=1 and status=-ECANCL; the only difference is that now the event could come much sooner. Callers are supposed to handle that event anyway, sooner or later. I don't think this is a semantic change. The API semantics never (and can't) say how soon such piggybacked unlink event would happen. If for some reason Lustre can't handle an instantaneous piggybacked unlink, then it's a bug in Lustre that should be fixed.

Comment by Isaac Huang (Inactive) [ 22/Oct/13 ]

Patch posted: http://review.whamcloud.com/#/c/8041/

Though the idea was straightforward, the original approach I suggested turned out too messy to implement: MD needs to keep a pointer to a message which is not reference counted, very tricky to remove a message from its peer/NI queue, and so on.

Instead I chose to abort messages in lnet_post_send_locked(). The drawback is it'd take an additional LND timeout for messages to be aborted. But the advantages are: much much simpler code, and all queued messages on an unlinked MD will be aborted rather just one. I think this is a good trade-off between instantaneous unlink and code complexity.

Comment by Liang Zhen (Inactive) [ 22/Oct/13 ]

Yes I think this way is better, the other reason I proposed to use new flag for MD is, even if we only track the last message that associated with the MD, we still need some complex locking operations because different messages are protected by different locks, and this way can totally avoid complex lock operations.
Still, I think that unlink implies abort is a slight semantic change, e.g, if user create a MD with threshold == LNET_MD_THRESH_INF, and ping arbitrary number of peers then unlink immediately, and reply on callback to count results, it will just work with current LNet, but it might fail some pings if unlink implied abort queued messages. But I think it is probably fine if nobody replies this sematic(at least Lustre doesn't), so I don't insist on this.

Comment by Jason Hill (Inactive) [ 22/Oct/13 ]

Can I get an idea of the time to solution? Are we looking at something that will be tested and ready to install in the next month or something longer?

Thx.

-Jason

Comment by James A Simmons [ 22/Oct/13 ]

There is a patch ready for testing

Comment by Isaac Huang (Inactive) [ 22/Oct/13 ]

Liang, yes it's a change in the case you described. But I'd rather regard it as an abuse of an obscure part of the API semantics rather than a valid use case we'd support, to unlink a MD where there's active message you don't want to give up. I've pushed an update that added a comment over LNetMDUnlink: "As a result, active messages associated with the MD may get aborted.".

Comment by Isaac Huang (Inactive) [ 30/Oct/13 ]

Patch updated: can't use LNET_MD_FLAG_ZOMBIE, which can be set as a result of MD auto unlink, where it's not abort and active messages should not be canceled as a result (e.g. a REPLY message from a MD exhausted by the corresponding GET) - the only way is to add a new flag LNET_MD_FLAG_ABORTED, set by LNetM[DE]Unlink.

Comment by Isaac Huang (Inactive) [ 01/Nov/13 ]

Hi, we've tested the patch with our internal tests systems, and there'd be some additional test at Hyperion. Meanwhile if you want to test it at ORNL, I'd suggest to:

  • Put it on a few clients first. Although it's supposed to solve a server-side problem, it changes code path for every outgoing message.
  • Then put it on a couple of servers, avoid the MDS/MGS though.
Comment by James A Simmons [ 09/Dec/13 ]

Will this patch be cherry picked to b2_4 and b2_5?

Comment by James Nunez (Inactive) [ 09/Dec/13 ]

Patch http://review.whamcloud.com/#/c/8041/ landed to master.

I'll look into plans for landing in b2_4 and b2_5.

Comment by James Nunez (Inactive) [ 09/Dec/13 ]

James,

Has anyone at ORNL tested the patch and, if so, did it fix or help the number of messages in the queue? Feedback from you will help us determine if the patch is cherry-picked/back ported to b2_4 and/or b2_5.

Thanks,
James

Comment by James A Simmons [ 11/Dec/13 ]

Is it an only server side patch? If so I can arrange to have it tested.

Comment by James Nunez (Inactive) [ 16/Dec/13 ]

James,

The patch can be applied to the servers only and you should see a benefit from it, but the patch is for both clients and servers. This configuration is fine for testing, but we recommend patching both clients and servers in production.

Thanks,
James

Comment by James Nunez (Inactive) [ 31/Dec/13 ]

Jason or James,

Is this still an issue or should we close this ticket?

Thanks,
James

Comment by James A Simmons [ 02/Jan/14 ]

I plan to test this patch at scale with 2.4 clients.

Comment by James A Simmons [ 09/Jan/14 ]

Had a discussion at work about testing this patch. It was decided not to test with this patch in the near future since it a small case. You can close this ticket. If we run into it in the future we can reopen this ticket again.

Comment by James Nunez (Inactive) [ 09/Jan/14 ]

James,
Thank you for the update. I'm going to close this ticket, but we can reopen it if this patch does not solve your problem.

James

Comment by James A Simmons [ 22/Jan/14 ]

January 21 we preformed a test shot with our Luste 2.4 file system with 2.4 clients. Before we were on 1.8 clients. During startup we ran into this issue. We have another test Feburary 4th and will perform the upgrade the 28th. I plan to run with this patch server side to see if resolves the issues we are seeing.

Generated at Sat Feb 10 01:38:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.