Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2898

More timely notification of clients in case of eviction

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 6985

    Description

      There have been periodic complaints about lustre not really knowing when it was evicted from a server node, as this could only be known in case an RPC is sent.
      Frequently this would be handled by a periodic ping, but with this functionality being turned down to happen in rarer cases, it more and more converts into the case of an app initiating an RPC and being evicted all of a sudden due to an eviction that has happened quite a while ago.

      As such we probably need a somewhat better way of notifying clients of their eviction so that they can reconnect somewhat more eagerly and with a bit less damage to whatever it is that might be running in userspace.

      Attachments

        Issue Links

          Activity

            [LU-2898] More timely notification of clients in case of eviction
            nozaki Hiroya Nozaki (Inactive) added a comment - - edited

            Recovering serveres try to retrieve clients' information from last_rcvd files and see if they've been connected. And next, the serveres send callback pings to the clients in order to make them reconnect.
            this is a basic recovering motion in FEFS, thought lots of trivial functions are included in it.

            oh, and I want you to know one thing, that is ... when we handle a large system like K, ping often eats up lnet resources such as credit ... I'm not so good there, thought ... so I think you'll need a mesure against the problem. And which is why we restrict the retry number of times of callback ping to 5 times.

            nozaki Hiroya Nozaki (Inactive) added a comment - - edited Recovering serveres try to retrieve clients' information from last_rcvd files and see if they've been connected. And next, the serveres send callback pings to the clients in order to make them reconnect. this is a basic recovering motion in FEFS, thought lots of trivial functions are included in it. oh, and I want you to know one thing, that is ... when we handle a large system like K, ping often eats up lnet resources such as credit ... I'm not so good there, thought ... so I think you'll need a mesure against the problem. And which is why we restrict the retry number of times of callback ping to 5 times.
            rread Robert Read added a comment -

            I see. Well, that's not ideal, but at least we know what the reason is.

            BTW, if the clients are not pinging, how did they all know to reconnect to the recovering server?

            rread Robert Read added a comment - I see. Well, that's not ideal, but at least we know what the reason is. BTW, if the clients are not pinging, how did they all know to reconnect to the recovering server?
            nozaki Hiroya Nozaki (Inactive) added a comment - - edited

            Hi, Robert.

            I've often seen lots of clients are evicted when server recovering. It appeaers that a server cannot catch up with a great number of coming reconnect reqs, about 90k * (target-disks).
            As a reasult, clients whose recon reqs haven't been handled by the server are evicted.

            nozaki Hiroya Nozaki (Inactive) added a comment - - edited Hi, Robert. I've often seen lots of clients are evicted when server recovering. It appeaers that a server cannot catch up with a great number of coming reconnect reqs, about 90k * (target-disks). As a reasult, clients whose recon reqs haven't been handled by the server are evicted.
            rread Robert Read added a comment -

            I agree that reconnects are probably valid, but I'm not sure all evicts are necessarily valid or unavoidable. If they are occurring frequently then we should at least try to find out what is causing them.

            rread Robert Read added a comment - I agree that reconnects are probably valid, but I'm not sure all evicts are necessarily valid or unavoidable. If they are occurring frequently then we should at least try to find out what is causing them.
            green Oleg Drokin added a comment -

            There might be many reasons for reconnect, I guess. All of them are valid one way or another. Like one-off AST loss or such.

            green Oleg Drokin added a comment - There might be many reasons for reconnect, I guess. All of them are valid one way or another. Like one-off AST loss or such.
            rread Robert Read added a comment - - edited

            My first thought was that this does seem like a special case of imperative recovery, but limited to a specific client, and we could call it "imperative reconnect." But perhaps a simpler ldlm callback is sufficient since if there is a network split the client wouldn't be able to reconnect anyway.

            Do we understand why these seemingly idle clients are being evicted in the first place? Is there an issue there?

            rread Robert Read added a comment - - edited My first thought was that this does seem like a special case of imperative recovery, but limited to a specific client, and we could call it "imperative reconnect." But perhaps a simpler ldlm callback is sufficient since if there is a network split the client wouldn't be able to reconnect anyway. Do we understand why these seemingly idle clients are being evicted in the first place? Is there an issue there?
            green Oleg Drokin added a comment -

            Fujitsu as the first site to disable pinging in most of the cases hit this esp. often so they created a patch to avert this issue that makes servers to notify MGS of eviction and MGS in turn would send messages to clients to come in contact with servers and reconnect as needed (sort of like reverse imperative recovery I guess).
            The contributed patch against fefs is here http://review.whamcloud.com/#change,5457 (And is not directly applicable to the master tree, but gives an idea of how they did it).

            I imagine it might have been much easier to just send a specially crafted ldlm callback to let it know we are evicting him (and this would require a lot less infrastructure changes), but that would not handle a case of severed communication between this particular server and client where as MGS connectivity of both would remain unaffected.

            Additionally since the case outlined as most severe by Fujitsu is that of a new application started, there is a possible workaround of doing "df" before a new job starts from whatever job scheduling framework might be there, but still there is a feeling that this case should be handled more transparently inside of Lustre.

            green Oleg Drokin added a comment - Fujitsu as the first site to disable pinging in most of the cases hit this esp. often so they created a patch to avert this issue that makes servers to notify MGS of eviction and MGS in turn would send messages to clients to come in contact with servers and reconnect as needed (sort of like reverse imperative recovery I guess). The contributed patch against fefs is here http://review.whamcloud.com/#change,5457 (And is not directly applicable to the master tree, but gives an idea of how they did it). I imagine it might have been much easier to just send a specially crafted ldlm callback to let it know we are evicting him (and this would require a lot less infrastructure changes), but that would not handle a case of severed communication between this particular server and client where as MGS connectivity of both would remain unaffected. Additionally since the case outlined as most severe by Fujitsu is that of a new application started, there is a possible workaround of doing "df" before a new job starts from whatever job scheduling framework might be there, but still there is a feeling that this case should be handled more transparently inside of Lustre.

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: