Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2898

More timely notification of clients in case of eviction

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 6985

    Description

      There have been periodic complaints about lustre not really knowing when it was evicted from a server node, as this could only be known in case an RPC is sent.
      Frequently this would be handled by a periodic ping, but with this functionality being turned down to happen in rarer cases, it more and more converts into the case of an app initiating an RPC and being evicted all of a sudden due to an eviction that has happened quite a while ago.

      As such we probably need a somewhat better way of notifying clients of their eviction so that they can reconnect somewhat more eagerly and with a bit less damage to whatever it is that might be running in userspace.

      Attachments

        Issue Links

          Activity

            [LU-2898] More timely notification of clients in case of eviction
            mrasobarnett Matt Rásó-Barnett made changes -
            Link New: This issue is related to EXR-427 [ EXR-427 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LR-9 [ LR-9 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LR-3 [ LR-3 ]
            nozaki Hiroya Nozaki (Inactive) added a comment - - edited

            Recovering serveres try to retrieve clients' information from last_rcvd files and see if they've been connected. And next, the serveres send callback pings to the clients in order to make them reconnect.
            this is a basic recovering motion in FEFS, thought lots of trivial functions are included in it.

            oh, and I want you to know one thing, that is ... when we handle a large system like K, ping often eats up lnet resources such as credit ... I'm not so good there, thought ... so I think you'll need a mesure against the problem. And which is why we restrict the retry number of times of callback ping to 5 times.

            nozaki Hiroya Nozaki (Inactive) added a comment - - edited Recovering serveres try to retrieve clients' information from last_rcvd files and see if they've been connected. And next, the serveres send callback pings to the clients in order to make them reconnect. this is a basic recovering motion in FEFS, thought lots of trivial functions are included in it. oh, and I want you to know one thing, that is ... when we handle a large system like K, ping often eats up lnet resources such as credit ... I'm not so good there, thought ... so I think you'll need a mesure against the problem. And which is why we restrict the retry number of times of callback ping to 5 times.
            rread Robert Read added a comment -

            I see. Well, that's not ideal, but at least we know what the reason is.

            BTW, if the clients are not pinging, how did they all know to reconnect to the recovering server?

            rread Robert Read added a comment - I see. Well, that's not ideal, but at least we know what the reason is. BTW, if the clients are not pinging, how did they all know to reconnect to the recovering server?
            nozaki Hiroya Nozaki (Inactive) added a comment - - edited

            Hi, Robert.

            I've often seen lots of clients are evicted when server recovering. It appeaers that a server cannot catch up with a great number of coming reconnect reqs, about 90k * (target-disks).
            As a reasult, clients whose recon reqs haven't been handled by the server are evicted.

            nozaki Hiroya Nozaki (Inactive) added a comment - - edited Hi, Robert. I've often seen lots of clients are evicted when server recovering. It appeaers that a server cannot catch up with a great number of coming reconnect reqs, about 90k * (target-disks). As a reasult, clients whose recon reqs haven't been handled by the server are evicted.
            rread Robert Read added a comment -

            I agree that reconnects are probably valid, but I'm not sure all evicts are necessarily valid or unavoidable. If they are occurring frequently then we should at least try to find out what is causing them.

            rread Robert Read added a comment - I agree that reconnects are probably valid, but I'm not sure all evicts are necessarily valid or unavoidable. If they are occurring frequently then we should at least try to find out what is causing them.
            green Oleg Drokin added a comment -

            There might be many reasons for reconnect, I guess. All of them are valid one way or another. Like one-off AST loss or such.

            green Oleg Drokin added a comment - There might be many reasons for reconnect, I guess. All of them are valid one way or another. Like one-off AST loss or such.
            rread Robert Read added a comment - - edited

            My first thought was that this does seem like a special case of imperative recovery, but limited to a specific client, and we could call it "imperative reconnect." But perhaps a simpler ldlm callback is sufficient since if there is a network split the client wouldn't be able to reconnect anyway.

            Do we understand why these seemingly idle clients are being evicted in the first place? Is there an issue there?

            rread Robert Read added a comment - - edited My first thought was that this does seem like a special case of imperative recovery, but limited to a specific client, and we could call it "imperative reconnect." But perhaps a simpler ldlm callback is sufficient since if there is a network split the client wouldn't be able to reconnect anyway. Do we understand why these seemingly idle clients are being evicted in the first place? Is there an issue there?
            green Oleg Drokin made changes -
            Link New: This issue is related to LU-2467 [ LU-2467 ]

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: