[LU-16064] RPC from evicted client can corrupt data Created: 02/Aug/22  Updated: 01/Dec/23

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16224 rw_seq_cst_vs_drop_caches dies with S... Resolved
is related to LU-16345 ofd_commitrw_read() can be passed non... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

when a client gets evicted OST cancels its locks, but don't wait for its RPCs to complete. this way another client can get a conflicting lock and modify data, but then in-progress RPC from the evicted client can modify data as well. then we get a situation when the healty client holding LDLM lock has some data/state in his cache which don't match actual data stored on OST.



 Comments   
Comment by Peter Jones [ 25/Aug/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48102
Subject: LU-16064 ldlm: block lock cancellation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1c767540c426e6b0bf13772f1a2f73b8f50cc6c2

Comment by Alex Zhuravlev [ 28/Nov/22 ]

the approach taken in the patch has a prolem - MDS can get stuck if RPC being processed needs to evict own client. not sure how to handle this yet.. thinking.

Comment by Alexey Lyashkov [ 01/Dec/23 ]

Alex,

If I right understand - this problem should be solved in different way and fix will be much simple.
BRW code should pickup an extra ldlm lock reference when IO processed.
So this lock can't be granted until brw code will release own reference - so no data consistence problem.

what you think about it?

Generated at Sat Feb 10 03:23:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.