Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 1.8.9, Lustre 2.4.1
    • RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client
    • 3
    • 10722

    Description

      We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2.

      We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful.

      Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday.

      While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor.

      Attachments

        1. lnet_stats.sh
          2 kB
          Jason Hill
        2. LU-4006.tgz
          0.2 kB
          Jason Hill

        Activity

          [LU-4006] LNET Messages staying in Queue
          adilger Andreas Dilger made changes -
          Link New: This issue is related to EX-9066 [ EX-9066 ]
          pjones Peter Jones made changes -
          Link Original: This issue is related to LDEV-38 [ LDEV-38 ]
          pjones Peter Jones made changes -
          Link New: This issue is related to LDEV-38 [ LDEV-38 ]
          pjones Peter Jones made changes -
          Link New: This issue is related to DDN-158 [ DDN-158 ]
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.5.3 [ 11100 ]
          Labels Original: mn4 mq314 New: mn4
          pjones Peter Jones made changes -
          Labels Original: mn4 mq214 New: mn4 mq314
          pjones Peter Jones made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
          pjones Peter Jones made changes -
          Labels Original: mn4 New: mn4 mq214
          pjones Peter Jones made changes -
          Resolution Original: Fixed [ 1 ]
          Status Original: Closed [ 6 ] New: Reopened [ 4 ]

          January 21 we preformed a test shot with our Luste 2.4 file system with 2.4 clients. Before we were on 1.8 clients. During startup we ran into this issue. We have another test Feburary 4th and will perform the upgrade the 28th. I plan to run with this patch server side to see if resolves the issues we are seeing.

          simmonsja James A Simmons added a comment - January 21 we preformed a test shot with our Luste 2.4 file system with 2.4 clients. Before we were on 1.8 clients. During startup we ran into this issue. We have another test Feburary 4th and will perform the upgrade the 28th. I plan to run with this patch server side to see if resolves the issues we are seeing.

          People

            isaac Isaac Huang (Inactive)
            hilljjornl Jason Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: