[LU-15433] evicted client can corrupt mirrored file Created: 11/Jan/22  Updated: 26/Jul/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: flr-improvement

Severity: 3
Rank (Obsolete): 9223372036854775807
Epic Link: FLR tech debt review

 Description   
  • with LU-14642 landed MDS doesn't transfer new layout to OST anymore
  • lfs mirror resync doesn't write to primary replica, so primary replica's object still has old layout version
  • evicted from MDS client thinks layout lock is still granted and sends OST_WRITE with old layout version to read primary replica's object which still has old version as well
  • OST modifies that object
  • replicas are out of sync


 Comments   
Comment by Andreas Dilger [ 11/Jan/22 ]

Is this really possible? The MDS would mark the other replicas stale as soon as the client tried to write, so the client would have to be evicted, "lfs mirror resync" run and finish on the file to clear the STALE flag from the mirrors, and then the client would write to the primary before it detected that it was evicted? It seems like a long time for the client to not detect that it is evicted?

One possible solution would be for the MDS to bump the layout version of a file if it evicts a client that is writing to it. Then the evicted client would have to re-fetch the layout before it could write again. Alternately, flag the primary with an "EVICTED" flag that only triggers the layout version to be increased when "lfs mirror resync" is actually run. That would avoid bumping the version repeatedly and hurting other clients writing to the same file, without any danger that the STALE flag is cleared.

Comment by Alex Zhuravlev [ 11/Jan/22 ]

that's the point AFAIU - it takes time for an evicted client to notice eviction and this window is more than enough for something like resync if the file is small enough?
another important thing is that LU-14642 changes the original model where layout version was distributed by MDS to the model where the distribution is done by client. I guess Bobi Jam can comment on this better.

Comment by Colin Faber [ 25/Jul/22 ]

Given that most of the fixes for LU-14642 have now been merged into maste / master-next what do you guys want to do with this issue??

Comment by Alex Zhuravlev [ 26/Jul/22 ]

will try https://review.whamcloud.com/#/c/46707/ again

Generated at Sat Feb 10 03:18:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.