[LU-8379] Possible badness with VBR and lock replay Created: 07/Jul/16  Updated: 08/Jul/16  Resolved: 08/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As part of working on LU-8347, Jinshan raises some valid points I think.

Let's say we have a VBR enabled system, recovery completed, but a client failed to join that had outstanding write locks and some dirty data.

Now before this client rejoins, another client reads some data from a file that was modified (but not yet replayed) but this first client.

questions:
1. when the first client reconnect, would it be accepted (file version is not changed because there were no writes, so I think YES?)
2. when it is accepted - that means second client actually accessed stale data - this is bad, right?
3. As part of read second client also obtained some read locks on the data that client one is going to change - this is not good.
4. What happens with replayed locks from client 1 in presence of already granted conflicted write locks?
5. In fact if somebody has a write lock on a file - that does not change file version yet, right? Only writes do - so what's to prevent conflicting writes to be in flight for such files?

I am capturing this here to better understand all these scenarios before categorizing this with a priority accordingly.



 Comments   
Comment by Oleg Drokin [ 07/Jul/16 ]

Mike, as the VBR expert, I hope you can provide some insight here

Comment by Mikhail Pershin [ 08/Jul/16 ]

if the client missed a recovery window then it will be evicted and not allowed to rejoin. The scenario you are thinking of is so called 'delayed recovery' which is possible option with VBR but was never accepted to the production because of similar situations due to late replays. So today VBR helps joined clients to complete their recovery if some client is missing, but it doesn't allow missed clients to rejoin later.

Meanwhile, just in theory, I think the write lock for such file shouldn't be replayed 'lately' if any other client has already newer PR/PW lock on the same resource and such delayed client has to be considered as 'unlucky' for delayed replay and be evicted. Whole delayed recovery was considered as possible if client is lucky and their resources are untouched, i.e. it is allowed only if there are no any conflicts otherwise client is evicted. That means the main task for delayed recovery is to determine conflicts.

Comment by Oleg Drokin [ 08/Jul/16 ]

ok, in the face of that I am going to close this ticket.

Thanks!

Generated at Sat Feb 10 02:17:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.