Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 1.8.9, Lustre 2.4.1
    • RHEL5 server, SLES11 SP1 router/client as well as RHEL6 server w/ SLES 11 SP1 or SP2 client
    • 3
    • 10722

    Description

      We'll need some more information on data to gather server side, but when the Titan compute platform is shut down the value of the queued messages in /proc/sys/lnet/stats on the server remains constant until the target platform returns to service. We have seen this during the weekly maintenance on Titan as well as during a large scale test shot with 2.4.0 Servers and 2.4.0 clients using SLES11 SP2.

      We have a home-grown monitor for the backlog of messages for a particular server (and LNET RTR, but at the time of reporting the LNET RTR's are all down from a hardware perspective) – We can attach that script if it may be useful.

      Please provide the data gathering techniques we should employ to make problem diagnosis more informative. We will likely have a shot at data gathering every Tuesday.

      While there are a large number of LNET messages queued (to what I assume are the LNET peers for the routers), LNET messages continue to be processed for other peers (either directly connected or through other routers); which is why I marked this as Minor.

      Attachments

        1. lnet_stats.sh
          2 kB
        2. LU-4006.tgz
          0.2 kB

        Activity

          [LU-4006] LNET Messages staying in Queue

          Patch http://review.whamcloud.com/#/c/8041/ landed to master.

          I'll look into plans for landing in b2_4 and b2_5.

          jamesanunez James Nunez (Inactive) added a comment - Patch http://review.whamcloud.com/#/c/8041/ landed to master. I'll look into plans for landing in b2_4 and b2_5.

          Will this patch be cherry picked to b2_4 and b2_5?

          simmonsja James A Simmons added a comment - Will this patch be cherry picked to b2_4 and b2_5?

          Hi, we've tested the patch with our internal tests systems, and there'd be some additional test at Hyperion. Meanwhile if you want to test it at ORNL, I'd suggest to:

          • Put it on a few clients first. Although it's supposed to solve a server-side problem, it changes code path for every outgoing message.
          • Then put it on a couple of servers, avoid the MDS/MGS though.
          isaac Isaac Huang (Inactive) added a comment - Hi, we've tested the patch with our internal tests systems, and there'd be some additional test at Hyperion. Meanwhile if you want to test it at ORNL, I'd suggest to: Put it on a few clients first. Although it's supposed to solve a server-side problem, it changes code path for every outgoing message. Then put it on a couple of servers, avoid the MDS/MGS though.

          Patch updated: can't use LNET_MD_FLAG_ZOMBIE, which can be set as a result of MD auto unlink, where it's not abort and active messages should not be canceled as a result (e.g. a REPLY message from a MD exhausted by the corresponding GET) - the only way is to add a new flag LNET_MD_FLAG_ABORTED, set by LNetM[DE]Unlink.

          isaac Isaac Huang (Inactive) added a comment - Patch updated: can't use LNET_MD_FLAG_ZOMBIE, which can be set as a result of MD auto unlink, where it's not abort and active messages should not be canceled as a result (e.g. a REPLY message from a MD exhausted by the corresponding GET) - the only way is to add a new flag LNET_MD_FLAG_ABORTED, set by LNetM [DE] Unlink.

          Liang, yes it's a change in the case you described. But I'd rather regard it as an abuse of an obscure part of the API semantics rather than a valid use case we'd support, to unlink a MD where there's active message you don't want to give up. I've pushed an update that added a comment over LNetMDUnlink: "As a result, active messages associated with the MD may get aborted.".

          isaac Isaac Huang (Inactive) added a comment - Liang, yes it's a change in the case you described. But I'd rather regard it as an abuse of an obscure part of the API semantics rather than a valid use case we'd support, to unlink a MD where there's active message you don't want to give up. I've pushed an update that added a comment over LNetMDUnlink: "As a result, active messages associated with the MD may get aborted.".

          There is a patch ready for testing

          simmonsja James A Simmons added a comment - There is a patch ready for testing

          Can I get an idea of the time to solution? Are we looking at something that will be tested and ready to install in the next month or something longer?

          Thx.

          -Jason

          hilljjornl Jason Hill (Inactive) added a comment - Can I get an idea of the time to solution? Are we looking at something that will be tested and ready to install in the next month or something longer? Thx. – -Jason

          Yes I think this way is better, the other reason I proposed to use new flag for MD is, even if we only track the last message that associated with the MD, we still need some complex locking operations because different messages are protected by different locks, and this way can totally avoid complex lock operations.
          Still, I think that unlink implies abort is a slight semantic change, e.g, if user create a MD with threshold == LNET_MD_THRESH_INF, and ping arbitrary number of peers then unlink immediately, and reply on callback to count results, it will just work with current LNet, but it might fail some pings if unlink implied abort queued messages. But I think it is probably fine if nobody replies this sematic(at least Lustre doesn't), so I don't insist on this.

          liang Liang Zhen (Inactive) added a comment - Yes I think this way is better, the other reason I proposed to use new flag for MD is, even if we only track the last message that associated with the MD, we still need some complex locking operations because different messages are protected by different locks, and this way can totally avoid complex lock operations. Still, I think that unlink implies abort is a slight semantic change, e.g, if user create a MD with threshold == LNET_MD_THRESH_INF, and ping arbitrary number of peers then unlink immediately, and reply on callback to count results, it will just work with current LNet, but it might fail some pings if unlink implied abort queued messages. But I think it is probably fine if nobody replies this sematic(at least Lustre doesn't), so I don't insist on this.

          Patch posted: http://review.whamcloud.com/#/c/8041/

          Though the idea was straightforward, the original approach I suggested turned out too messy to implement: MD needs to keep a pointer to a message which is not reference counted, very tricky to remove a message from its peer/NI queue, and so on.

          Instead I chose to abort messages in lnet_post_send_locked(). The drawback is it'd take an additional LND timeout for messages to be aborted. But the advantages are: much much simpler code, and all queued messages on an unlinked MD will be aborted rather just one. I think this is a good trade-off between instantaneous unlink and code complexity.

          isaac Isaac Huang (Inactive) added a comment - Patch posted: http://review.whamcloud.com/#/c/8041/ Though the idea was straightforward, the original approach I suggested turned out too messy to implement: MD needs to keep a pointer to a message which is not reference counted, very tricky to remove a message from its peer/NI queue, and so on. Instead I chose to abort messages in lnet_post_send_locked(). The drawback is it'd take an additional LND timeout for messages to be aborted. But the advantages are: much much simpler code, and all queued messages on an unlinked MD will be aborted rather just one. I think this is a good trade-off between instantaneous unlink and code complexity.

          If I haven't missed something, I don't see any semantic change. Currently callers would eventually get a SENT event with unlinked flag set and likely an error status code. With the proposed change, it's still the same thing, i.e. SENT event with unlinked=1 and status=-ECANCL; the only difference is that now the event could come much sooner. Callers are supposed to handle that event anyway, sooner or later. I don't think this is a semantic change. The API semantics never (and can't) say how soon such piggybacked unlink event would happen. If for some reason Lustre can't handle an instantaneous piggybacked unlink, then it's a bug in Lustre that should be fixed.

          isaac Isaac Huang (Inactive) added a comment - If I haven't missed something, I don't see any semantic change. Currently callers would eventually get a SENT event with unlinked flag set and likely an error status code. With the proposed change, it's still the same thing, i.e. SENT event with unlinked=1 and status=-ECANCL; the only difference is that now the event could come much sooner. Callers are supposed to handle that event anyway, sooner or later. I don't think this is a semantic change. The API semantics never (and can't) say how soon such piggybacked unlink event would happen. If for some reason Lustre can't handle an instantaneous piggybacked unlink, then it's a bug in Lustre that should be fixed.

          one my concern is, if we don't add any new flag (or explicit API for abort), then we are changing semantic of current API, for now LNetMDUnlink will not implicitly abort any inflight PUT/GET (although Lustre doesn't rely on this), so would it be reasonable to add a new flag to allow Unlink to abort inflight message?

          liang Liang Zhen (Inactive) added a comment - one my concern is, if we don't add any new flag (or explicit API for abort), then we are changing semantic of current API, for now LNetMDUnlink will not implicitly abort any inflight PUT/GET (although Lustre doesn't rely on this), so would it be reasonable to add a new flag to allow Unlink to abort inflight message?

          People

            isaac Isaac Huang (Inactive)
            hilljjornl Jason Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: