[LU-2418] Add Way to Detect Dropped Packets on Production Systems Created: 30/Nov/12 Updated: 13/Jun/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major |
| Reporter: | Doug Oucharek (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 5735 | ||||||||
| Description |
|
Today, it is very difficult to confirm whether timeouts in Lustre are due to dropped packets in LNet. This is due to two reasons: 1- neterrors are off by default so logging does not show dropped packets. My understanding is that neterrors are off by default because there is too much "noise" when they are on. That begs the question: how can logs which are issued that frequently be considered errors? I think this issue can be address in one of two ways: 1- Clean up the neterror logs so they are not noisy and then leave neterrors on by default. |
| Comments |
| Comment by Isaac Huang (Inactive) [ 04/Dec/12 ] |
|
I remembered that they were too noisy in sites where there's always 10s of nodes down for maintenance e.g., but there's repeated attempts to communicate with them e.g. router pinger or upper layers. So if a same error happened with 50 nodes, there'd be 50 such error messages, instead of one that says this error happened with these 50 nodes. It became even worse at sites where console outputs of servers were gathered into one place. |
| Comment by Doug Oucharek (Inactive) [ 13/Jun/16 ] |
|
LU-8223 implements one part of this solution: get neterrors on by default. |