Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 1.8.9, Lustre 2.4.1
    • RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client
    • 3
    • 10722

    Description

      We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2.

      We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful.

      Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday.

      While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor.

      Attachments

        1. lnet_stats.sh
          2 kB
        2. LU-4006.tgz
          0.2 kB

        Activity

          [LU-4006] LNET Messages staying in Queue

          January 21 we preformed a test shot with our Luste 2.4 file system with 2.4 clients. Before we were on 1.8 clients. During startup we ran into this issue. We have another test Feburary 4th and will perform the upgrade the 28th. I plan to run with this patch server side to see if resolves the issues we are seeing.

          simmonsja James A Simmons added a comment - January 21 we preformed a test shot with our Luste 2.4 file system with 2.4 clients. Before we were on 1.8 clients. During startup we ran into this issue. We have another test Feburary 4th and will perform the upgrade the 28th. I plan to run with this patch server side to see if resolves the issues we are seeing.

          James,
          Thank you for the update. I'm going to close this ticket, but we can reopen it if this patch does not solve your problem.

          James

          jamesanunez James Nunez (Inactive) added a comment - James, Thank you for the update. I'm going to close this ticket, but we can reopen it if this patch does not solve your problem. James

          Had a discussion at work about testing this patch. It was decided not to test with this patch in the near future since it a small case. You can close this ticket. If we run into it in the future we can reopen this ticket again.

          simmonsja James A Simmons added a comment - Had a discussion at work about testing this patch. It was decided not to test with this patch in the near future since it a small case. You can close this ticket. If we run into it in the future we can reopen this ticket again.

          I plan to test this patch at scale with 2.4 clients.

          simmonsja James A Simmons added a comment - I plan to test this patch at scale with 2.4 clients.

          Jason or James,

          Is this still an issue or should we close this ticket?

          Thanks,
          James

          jamesanunez James Nunez (Inactive) added a comment - Jason or James, Is this still an issue or should we close this ticket? Thanks, James

          James,

          The patch can be applied to the servers only and you should see a benefit from it, but the patch is for both clients and servers. This configuration is fine for testing, but we recommend patching both clients and servers in production.

          Thanks,
          James

          jamesanunez James Nunez (Inactive) added a comment - James, The patch can be applied to the servers only and you should see a benefit from it, but the patch is for both clients and servers. This configuration is fine for testing, but we recommend patching both clients and servers in production. Thanks, James

          Is it an only server side patch? If so I can arrange to have it tested.

          simmonsja James A Simmons added a comment - Is it an only server side patch? If so I can arrange to have it tested.

          James,

          Has anyone at ORNL tested the patch and, if so, did it fix or help the number of messages in the queue? Feedback from you will help us determine if the patch is cherry-picked/back ported to b2_4 and/or b2_5.

          Thanks,
          James

          jamesanunez James Nunez (Inactive) added a comment - James, Has anyone at ORNL tested the patch and, if so, did it fix or help the number of messages in the queue? Feedback from you will help us determine if the patch is cherry-picked/back ported to b2_4 and/or b2_5. Thanks, James

          Patch http://review.whamcloud.com/#/c/8041/ landed to master.

          I'll look into plans for landing in b2_4 and b2_5.

          jamesanunez James Nunez (Inactive) added a comment - Patch http://review.whamcloud.com/#/c/8041/ landed to master. I'll look into plans for landing in b2_4 and b2_5.

          Will this patch be cherry picked to b2_4 and b2_5?

          simmonsja James A Simmons added a comment - Will this patch be cherry picked to b2_4 and b2_5?

          People

            isaac Isaac Huang (Inactive)
            hilljjornl Jason Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: