[LU-2187] Why are we losing messages? Created: 15/Oct/12 Updated: 13/Feb/14 Resolved: 13/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | sequoia | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5230 | ||||||||
| Description |
|
According to the analysis of Why are messages being lost that can result in |
| Comments |
| Comment by Doug Oucharek (Inactive) [ 18/Oct/12 ] |
|
Question: Do we know if the Lustre messages are being "lost" or are just delayed to the point that they are considered lost by the Lustre code in question? The IB network is reliable, but can delay messages to the point we take action assuming the message is lost. |
| Comment by Christopher Morrone [ 18/Oct/12 ] |
|
I can't say definitively, but I strongly suspect that the messages aren't really lost, but this is more bad Lustre behaviour making poor AT assumptions or delaying send of messages or something. |
| Comment by Isaac Huang (Inactive) [ 21/Oct/12 ] |
|
There's a few things to do to diagnose it:
|
| Comment by Christopher Morrone [ 22/Oct/12 ] |
|
Isaac, I think it is very unlikely that we'll see anything useful in there. I'll certainly keep an eye on it, but I do not believe that we're seeing real network problems here. We've investigated the network pretty thoroughly and it all checks out. I think we need to look higher in the stack. I don't think we've really lost the message. But my suspicion is that the server sent the reply too slowly, and the client sent the retry before the server every replied once. So how do we investigate that problem? |
| Comment by Doug Oucharek (Inactive) [ 25/Oct/12 ] |
|
When this happens, are there any ptlrpc timeout logs in the console? Theoretically, if there is a resend, there should be a timeout log. |
| Comment by Oleg Drokin [ 08/Nov/12 ] |
|
I remember a set of patches from llnl that disabled all that "noise" like printing about resent RPCs. I wonder if this i applied by default on their deployments? |
| Comment by Doug Oucharek (Inactive) [ 13/Nov/12 ] |
|
Chris: Have you been able to "turn up" the RPC logging to see if there are resend logs? |
| Comment by Johann Lombardi (Inactive) [ 14/Nov/12 ] |
|
Chris, could you please advise whether you quiet the message in ptlrpc_expire_one_request() displayed when a timeout happens? |
| Comment by Christopher Morrone [ 14/Nov/12 ] |
|
We do change it from D_WARNING to D_NETERROR. Our code is in git on github.com/chaos/lustre. Our most recent branch is 2.3.54-llnl. |
| Comment by Isaac Huang (Inactive) [ 16/Nov/12 ] |
|
Chris, D_NETERROR messages only go to the Lustre debug log without an explicit 'echo +neterror > /proc/sys/lnet/printk'. |
| Comment by Christopher Morrone [ 13/Feb/14 ] |
|
We will never get back to this one. |