[LU-2418] Add Way to Detect Dropped Packets on Production Systems Created: 30/Nov/12  Updated: 13/Jun/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Doug Oucharek (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-8223 De-Noise LNet neterr logs so they can... Open
Rank (Obsolete): 5735

 Description   

Today, it is very difficult to confirm whether timeouts in Lustre are due to dropped packets in LNet. This is due to two reasons:

1- neterrors are off by default so logging does not show dropped packets.
2- the errors counter is never incremented (see LU-2223).

My understanding is that neterrors are off by default because there is too much "noise" when they are on. That begs the question: how can logs which are issued that frequently be considered errors?

I think this issue can be address in one of two ways:

1- Clean up the neterror logs so they are not noisy and then leave neterrors on by default.
2- Add a set of new counters to LNet to count the reasons for dropped packets.



 Comments   
Comment by Isaac Huang (Inactive) [ 04/Dec/12 ]

I remembered that they were too noisy in sites where there's always 10s of nodes down for maintenance e.g., but there's repeated attempts to communicate with them e.g. router pinger or upper layers. So if a same error happened with 50 nodes, there'd be 50 such error messages, instead of one that says this error happened with these 50 nodes.

It became even worse at sites where console outputs of servers were gathered into one place.

Comment by Doug Oucharek (Inactive) [ 13/Jun/16 ]

LU-8223 implements one part of this solution: get neterrors on by default.

Generated at Sat Feb 10 01:25:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.