Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16064

RPC from evicted client can corrupt data

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      when a client gets evicted OST cancels its locks, but don't wait for its RPCs to complete. this way another client can get a conflicting lock and modify data, but then in-progress RPC from the evicted client can modify data as well. then we get a situation when the healty client holding LDLM lock has some data/state in his cache which don't match actual data stored on OST.

      Attachments

        Issue Links

          Activity

            [LU-16064] RPC from evicted client can corrupt data

            Alex Zhuravlev, just to clarify, I think this situation has existed forever, and is not a new regression introduced by a recent change?

            yes, I think it's a very old issue from the early days.

            Shouldn't the evicted client have to drop its cached locks/dirty data, and OST would bump the export generation and block any writes in progress with the old generation?

            it's a reverse problem - OST has evicted client so locks (which can be re-granted), but the client's RPC is already under processing, there is no way to interrupt this.

            bzzz Alex Zhuravlev added a comment - Alex Zhuravlev, just to clarify, I think this situation has existed forever, and is not a new regression introduced by a recent change? yes, I think it's a very old issue from the early days. Shouldn't the evicted client have to drop its cached locks/dirty data, and OST would bump the export generation and block any writes in progress with the old generation? it's a reverse problem - OST has evicted client so locks (which can be re-granted), but the client's RPC is already under processing, there is no way to interrupt this.

            bzzz, just to clarify, I think this situation has existed forever, and is not a new regression introduced by a recent change? I'm all for fixing this, of course. Shouldn't the evicted client have to drop its cached locks/dirty data, and OST would bump the export generation and block any writes in progress with the old generation?

            Of course, the reverse issue is true, that in most cases, only the client evicted client is writing the object (or at least that offset), and evicting it and returning an error to userspace (which probably doesn't check the result) is more likely to cause data loss than this "multiple writers to same offset during eviction"?

            adilger Andreas Dilger added a comment - bzzz , just to clarify, I think this situation has existed forever, and is not a new regression introduced by a recent change? I'm all for fixing this, of course. Shouldn't the evicted client have to drop its cached locks/dirty data, and OST would bump the export generation and block any writes in progress with the old generation? Of course, the reverse issue is true, that in most cases, only the client evicted client is writing the object (or at least that offset), and evicting it and returning an error to userspace (which probably doesn't check the result) is more likely to cause data loss than this "multiple writers to same offset during eviction"?

            Alex,

            If I right understand - this problem should be solved in different way and fix will be much simple.
            BRW code should pickup an extra ldlm lock reference when IO processed.
            So this lock can't be granted until brw code will release own reference - so no data consistence problem.

            what you think about it?

            shadow Alexey Lyashkov added a comment - Alex, If I right understand - this problem should be solved in different way and fix will be much simple. BRW code should pickup an extra ldlm lock reference when IO processed. So this lock can't be granted until brw code will release own reference - so no data consistence problem. what you think about it?

            the approach taken in the patch has a prolem - MDS can get stuck if RPC being processed needs to evict own client. not sure how to handle this yet.. thinking.

            bzzz Alex Zhuravlev added a comment - the approach taken in the patch has a prolem - MDS can get stuck if RPC being processed needs to evict own client. not sure how to handle this yet.. thinking.
            pjones Peter Jones added a comment - - edited

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48102
            Subject: LU-16064 ldlm: block lock cancellation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1c767540c426e6b0bf13772f1a2f73b8f50cc6c2

            pjones Peter Jones added a comment - - edited "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48102 Subject: LU-16064 ldlm: block lock cancellation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1c767540c426e6b0bf13772f1a2f73b8f50cc6c2

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: