Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2418

Add Way to Detect Dropped Packets on Production Systems

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.4.0
    • 5735

    Description

      Today, it is very difficult to confirm whether timeouts in Lustre are due to dropped packets in LNet. This is due to two reasons:

      1- neterrors are off by default so logging does not show dropped packets.
      2- the errors counter is never incremented (see LU-2223).

      My understanding is that neterrors are off by default because there is too much "noise" when they are on. That begs the question: how can logs which are issued that frequently be considered errors?

      I think this issue can be address in one of two ways:

      1- Clean up the neterror logs so they are not noisy and then leave neterrors on by default.
      2- Add a set of new counters to LNet to count the reasons for dropped packets.

      Attachments

        Issue Links

          Activity

            [LU-2418] Add Way to Detect Dropped Packets on Production Systems

            LU-8223 implements one part of this solution: get neterrors on by default.

            doug Doug Oucharek (Inactive) added a comment - LU-8223 implements one part of this solution: get neterrors on by default.
            isaac Isaac Huang (Inactive) added a comment - - edited

            I remembered that they were too noisy in sites where there's always 10s of nodes down for maintenance e.g., but there's repeated attempts to communicate with them e.g. router pinger or upper layers. So if a same error happened with 50 nodes, there'd be 50 such error messages, instead of one that says this error happened with these 50 nodes.

            It became even worse at sites where console outputs of servers were gathered into one place.

            isaac Isaac Huang (Inactive) added a comment - - edited I remembered that they were too noisy in sites where there's always 10s of nodes down for maintenance e.g., but there's repeated attempts to communicate with them e.g. router pinger or upper layers. So if a same error happened with 50 nodes, there'd be 50 such error messages, instead of one that says this error happened with these 50 nodes. It became even worse at sites where console outputs of servers were gathered into one place.

            People

              wc-triage WC Triage
              doug Doug Oucharek (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: