Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2418

Add Way to Detect Dropped Packets on Production Systems

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.4.0
    • None
    • 5735

    Description

      Today, it is very difficult to confirm whether timeouts in Lustre are due to dropped packets in LNet. This is due to two reasons:

      1- neterrors are off by default so logging does not show dropped packets.
      2- the errors counter is never incremented (see LU-2223).

      My understanding is that neterrors are off by default because there is too much "noise" when they are on. That begs the question: how can logs which are issued that frequently be considered errors?

      I think this issue can be address in one of two ways:

      1- Clean up the neterror logs so they are not noisy and then leave neterrors on by default.
      2- Add a set of new counters to LNet to count the reasons for dropped packets.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              doug Doug Oucharek (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: