Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 1.8.9, Lustre 2.4.1
    • RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client
    • 3
    • 10722

    Description

      We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2.

      We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful.

      Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday.

      While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor.

      Attachments

        1. lnet_stats.sh
          2 kB
        2. LU-4006.tgz
          0.2 kB

        Activity

          [LU-4006] LNET Messages staying in Queue

          I plan to test this patch at scale with 2.4 clients.

          simmonsja James A Simmons added a comment - I plan to test this patch at scale with 2.4 clients.

          Jason or James,

          Is this still an issue or should we close this ticket?

          Thanks,
          James

          jamesanunez James Nunez (Inactive) added a comment - Jason or James, Is this still an issue or should we close this ticket? Thanks, James

          James,

          The patch can be applied to the servers only and you should see a benefit from it, but the patch is for both clients and servers. This configuration is fine for testing, but we recommend patching both clients and servers in production.

          Thanks,
          James

          jamesanunez James Nunez (Inactive) added a comment - James, The patch can be applied to the servers only and you should see a benefit from it, but the patch is for both clients and servers. This configuration is fine for testing, but we recommend patching both clients and servers in production. Thanks, James

          Is it an only server side patch? If so I can arrange to have it tested.

          simmonsja James A Simmons added a comment - Is it an only server side patch? If so I can arrange to have it tested.

          James,

          Has anyone at ORNL tested the patch and, if so, did it fix or help the number of messages in the queue? Feedback from you will help us determine if the patch is cherry-picked/back ported to b2_4 and/or b2_5.

          Thanks,
          James

          jamesanunez James Nunez (Inactive) added a comment - James, Has anyone at ORNL tested the patch and, if so, did it fix or help the number of messages in the queue? Feedback from you will help us determine if the patch is cherry-picked/back ported to b2_4 and/or b2_5. Thanks, James

          Patch http://review.whamcloud.com/#/c/8041/ landed to master.

          I'll look into plans for landing in b2_4 and b2_5.

          jamesanunez James Nunez (Inactive) added a comment - Patch http://review.whamcloud.com/#/c/8041/ landed to master. I'll look into plans for landing in b2_4 and b2_5.

          Will this patch be cherry picked to b2_4 and b2_5?

          simmonsja James A Simmons added a comment - Will this patch be cherry picked to b2_4 and b2_5?

          Hi, we've tested the patch with our internal tests systems, and there'd be some additional test at Hyperion. Meanwhile if you want to test it at ORNL, I'd suggest to:

          • Put it on a few clients first. Although it's supposed to solve a server-side problem, it changes code path for every outgoing message.
          • Then put it on a couple of servers, avoid the MDS/MGS though.
          isaac Isaac Huang (Inactive) added a comment - Hi, we've tested the patch with our internal tests systems, and there'd be some additional test at Hyperion. Meanwhile if you want to test it at ORNL, I'd suggest to: Put it on a few clients first. Although it's supposed to solve a server-side problem, it changes code path for every outgoing message. Then put it on a couple of servers, avoid the MDS/MGS though.

          Patch updated: can't use LNET_MD_FLAG_ZOMBIE, which can be set as a result of MD auto unlink, where it's not abort and active messages should not be canceled as a result (e.g. a REPLY message from a MD exhausted by the corresponding GET) - the only way is to add a new flag LNET_MD_FLAG_ABORTED, set by LNetM[DE]Unlink.

          isaac Isaac Huang (Inactive) added a comment - Patch updated: can't use LNET_MD_FLAG_ZOMBIE, which can be set as a result of MD auto unlink, where it's not abort and active messages should not be canceled as a result (e.g. a REPLY message from a MD exhausted by the corresponding GET) - the only way is to add a new flag LNET_MD_FLAG_ABORTED, set by LNetM [DE] Unlink.

          Liang, yes it's a change in the case you described. But I'd rather regard it as an abuse of an obscure part of the API semantics rather than a valid use case we'd support, to unlink a MD where there's active message you don't want to give up. I've pushed an update that added a comment over LNetMDUnlink: "As a result, active messages associated with the MD may get aborted.".

          isaac Isaac Huang (Inactive) added a comment - Liang, yes it's a change in the case you described. But I'd rather regard it as an abuse of an obscure part of the API semantics rather than a valid use case we'd support, to unlink a MD where there's active message you don't want to give up. I've pushed an update that added a comment over LNetMDUnlink: "As a result, active messages associated with the MD may get aborted.".

          There is a patch ready for testing

          simmonsja James A Simmons added a comment - There is a patch ready for testing

          People

            isaac Isaac Huang (Inactive)
            hilljjornl Jason Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: