Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4006

LNET Messages staying in Queue

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 1.8.9, Lustre 2.4.1
    • RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client
    • 3
    • 10722

    Description

      We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2.

      We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful.

      Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday.

      While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor.

      Attachments

        1. lnet_stats.sh
          2 kB
          Jason Hill
        2. LU-4006.tgz
          0.2 kB
          Jason Hill

        Activity

          People

            isaac Isaac Huang (Inactive)
            hilljjornl Jason Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: