[LU-2187] Why are we losing messages? Created: 15/Oct/12  Updated: 13/Feb/14  Resolved: 13/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Doug Oucharek (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: sequoia

Issue Links:
Related
is related to LU-1717 mdt_recovery.c:611:mdt_steal_ack_lock... Resolved
Severity: 3
Rank (Obsolete): 5230

 Description   

According to the analysis of LU-1717 we are frequently losing Lustre messages on Sequoia's IB network. We have no LNet routers, and IB is a reliable network. We are not seeing any timeouts or lnet errors that would suggest that we are seeing IB transmission problems.

Why are messages being lost that can result in LU-1717 error messages? I'm worried that we're papering over a larger problem by silencing those errors.



 Comments   
Comment by Doug Oucharek (Inactive) [ 18/Oct/12 ]

Question: Do we know if the Lustre messages are being "lost" or are just delayed to the point that they are considered lost by the Lustre code in question?

The IB network is reliable, but can delay messages to the point we take action assuming the message is lost.

Comment by Christopher Morrone [ 18/Oct/12 ]

I can't say definitively, but I strongly suspect that the messages aren't really lost, but this is more bad Lustre behaviour making poor AT assumptions or delaying send of messages or something.

Comment by Isaac Huang (Inactive) [ 21/Oct/12 ]

There's a few things to do to diagnose it:

  • /proc/sys/lnet/* would be very useful. Please gather them once such errors happen again.
  • "lctl --net o2ib0 conn_list (or list_conn)" gives useful data at o2iblnd layer. Please run it once such errors happen again.
  • Many LNet/LND errors don't go to dmesg by default, please on a node where such errors have occurred, do a 'echo +neterror > /proc/sys/lnet/printk' after each reboot.
  • Please also check IB errors by running 'ibcheckerrors'.
Comment by Christopher Morrone [ 22/Oct/12 ]

Isaac, I think it is very unlikely that we'll see anything useful in there. I'll certainly keep an eye on it, but I do not believe that we're seeing real network problems here. We've investigated the network pretty thoroughly and it all checks out.

I think we need to look higher in the stack. I don't think we've really lost the message. But my suspicion is that the server sent the reply too slowly, and the client sent the retry before the server every replied once.

So how do we investigate that problem?

Comment by Doug Oucharek (Inactive) [ 25/Oct/12 ]

When this happens, are there any ptlrpc timeout logs in the console? Theoretically, if there is a resend, there should be a timeout log.

Comment by Oleg Drokin [ 08/Nov/12 ]

I remember a set of patches from llnl that disabled all that "noise" like printing about resent RPCs. I wonder if this i applied by default on their deployments?

Comment by Doug Oucharek (Inactive) [ 13/Nov/12 ]

Chris: Have you been able to "turn up" the RPC logging to see if there are resend logs?

Comment by Johann Lombardi (Inactive) [ 14/Nov/12 ]

Chris, could you please advise whether you quiet the message in ptlrpc_expire_one_request() displayed when a timeout happens?

Comment by Christopher Morrone [ 14/Nov/12 ]

We do change it from D_WARNING to D_NETERROR. Our code is in git on github.com/chaos/lustre. Our most recent branch is 2.3.54-llnl.

Comment by Isaac Huang (Inactive) [ 16/Nov/12 ]

Chris, D_NETERROR messages only go to the Lustre debug log without an explicit 'echo +neterror > /proc/sys/lnet/printk'.

Comment by Christopher Morrone [ 13/Feb/14 ]

We will never get back to this one.

Generated at Sat Feb 10 01:23:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.