Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2418

Add Way to Detect Dropped Packets on Production Systems

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.4.0
    • Fix Version/s: None
    • Labels:
      None
    • Rank (Obsolete):
      5735

      Description

      Today, it is very difficult to confirm whether timeouts in Lustre are due to dropped packets in LNet. This is due to two reasons:

      1- neterrors are off by default so logging does not show dropped packets.
      2- the errors counter is never incremented (see LU-2223).

      My understanding is that neterrors are off by default because there is too much "noise" when they are on. That begs the question: how can logs which are issued that frequently be considered errors?

      I think this issue can be address in one of two ways:

      1- Clean up the neterror logs so they are not noisy and then leave neterrors on by default.
      2- Add a set of new counters to LNet to count the reasons for dropped packets.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wc-triage WC Triage
                Reporter:
                doug Doug Oucharek (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: