Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18698

Client/Server grant disagreement leads to hang

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The client and server can disagree about grant, leading to messages like:

      LustreError: 3095:0:(tgt_grant.c:764:tgt_grant_check()) lustre-OST0001: cli edf3483d-7u19-944c-348c-7aa6fgr68648 claims 917387 GRANT, real grant 0

      Often times, this self mitigates. However, in rare cases, the client will hang indefinitely getting ENOSPC and this message will repeat on the server continuously. Other clients appear to progress fine and the server has sufficient space.

      It's not clear the root cause of this behavior, but we can at least allow the client to continuing writing in exceptional cases.

      Attachments

        Activity

          [LU-18698] Client/Server grant disagreement leads to hang

          I think that's what you'll need to do - the client should do that and you'll have to take a look.  We're in "bug" rather than "understood behavior" territory

          paf0186 Patrick Farrell added a comment - I think that's what you'll need to do - the client should do that and you'll have to take a look.  We're in "bug" rather than "understood behavior" territory
          timday Tim Day added a comment -

          Yeah, that's what it seems like. I'm not sure why the client doesn't fall back to sync write. Is there some why for the server to force this behavior on the client side? Once I have some cycles, I want to add more debug to this - try to get some reliable log of grant state changes.

          timday Tim Day added a comment - Yeah, that's what it seems like. I'm not sure why the client doesn't fall back to sync write. Is there some why for the server to force this behavior on the client side? Once I have some cycles, I want to add more debug to this - try to get some reliable log of grant state changes.

          I would prefer if possible to use the sync write behavior, but it seems like maybe the server is refusing to give the client more grant...?

          paf0186 Patrick Farrell added a comment - I would prefer if possible to use the sync write behavior, but it seems like maybe the server is refusing to give the client more grant...?
          timday Tim Day added a comment -

          Definitely open to alternative ideas.

          timday Tim Day added a comment - Definitely open to alternative ideas.
          timday Tim Day added a comment -

          Intuitively, I agree. The client ought to continue writing normally. But I have seen the opposite. It's not super clear why this is the case - and it's fairly rare. My current thought is to just give the client grant if it looks stuck and hope it unsticks. In most cases, when there's a mismatch - I see the client/server self-mitigate and I'm hoping to force that behavior.

          timday Tim Day added a comment - Intuitively, I agree. The client ought to continue writing normally. But I have seen the opposite. It's not super clear why this is the case - and it's fairly rare. My current thought is to just give the client grant if it looks stuck and hope it unsticks. In most cases, when there's a mismatch - I see the client/server self-mitigate and I'm hoping to force that behavior.

          "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57981
          Subject: LU-18698 target: inflate grant on consistent grant error
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: fd8c2c87dbe7d8f067b2173aec99fc9d4918759d

          gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57981 Subject: LU-18698 target: inflate grant on consistent grant error Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fd8c2c87dbe7d8f067b2173aec99fc9d4918759d

          Certainly the client could continue writing synchronously in this case; the point of grant is to allow async writes but you can do sync writes without grant.  That's actually the behavior it should do, not hang.  (ie, that's what I would've expected...)

          The grant code is a bit complex and probably still has an arithmetic/conversion error in it somewhere.

          paf0186 Patrick Farrell added a comment - Certainly the client could continue writing synchronously in this case; the point of grant is to allow async writes but you can do sync writes without grant.  That's actually the behavior it should do, not hang.  (ie, that's what I would've expected...) The grant code is a bit complex and probably still has an arithmetic/conversion error in it somewhere.

          People

            timday Tim Day
            timday Tim Day
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: