[LU-11836] DOM read-open resend vs getattr deadlock Created: 06/Jan/19  Updated: 11/Sep/19  Resolved: 11/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: DoM2

Issue Links:
Related
is related to LU-11952 open+create resend can recreate a fil... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

DOM read-on-open may cause resend when reply buffer is larger then client buffer, that is OK in general, client just re-allocate buffer and resend request. The problem occurs when between first reply and resend the new request on the same file is arrived, e.g. getattr.
Whole scenario in that case:
1. OPEN takes PARENT WRITE lock and new CHILD PR/PW lock
2. The CHILD lock on server gets PARENT handle from the client as remote handle (resource change)
3. Due to resend condition in reply_in_callback() the client didn't finish that resource replacement, so that lock handle is still PARENT lock handle, while it is CHILD one on server
4. Getattr on server locks the CHILD and cause BL AST to PR/PW lock from OPEN
5. client gets BL AST but lock handle refers to PARENT lock, so CHILD lock on server will never receive cancel from that BL AST
6. Meanwhile OPEN resend is arrived on server and try to get WRITE lock on PARENT but it is blocked by getattr process waiting for CHILD cancel, so OPEN resend is waiting on PARENT lock and cannot complete OPEN to send reply with blocked CHILD lock. Deadlock.

That specific combination exists only with DOM files (PR/PW modes causes conflicts with getattr) and only with read-on-open feature because it produces resent without reconnect.



 Comments   
Comment by Mikhail Pershin [ 06/Jan/19 ]

This issue happens from time to time in racer.sh with DOM files. I have a reproducer for that scenario and is working on patch.

Comment by Gerrit Updater [ 20/Jan/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34072
Subject: LU-11836 ldlm: fix enqueue reply vs bl_ast race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e52819e9792f865391b280d1c6b2f862823d91e7

Comment by Gerrit Updater [ 15/Feb/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34264
Subject: LU-11836 ldlm: don't convert wrong resource
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 599247c27bc6f6351d7f2c4c1313e0686635893d

Comment by Mikhail Pershin [ 15/Feb/19 ]

this issue should be resolved with proper open resent/reconstruct handling. As noted by Vitaly that is just not right to take parent lock on server again while we already have child lock, that cause reverse lock ordering. Meanwhile this intersects with LU-11952 which also requires similar fixes in OPEN reconstruct.

Comment by Gerrit Updater [ 15/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34264/
Subject: LU-11836 ldlm: don't convert wrong resource
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2bc71659db69335ba1c93dab44dc733dc0849d0c

Comment by Peter Jones [ 16/Mar/19 ]

Landed for 2.13

Comment by Mikhail Pershin [ 16/Mar/19 ]

Re-open ticket, there are still things to resolve

Comment by Peter Jones [ 11/Sep/19 ]

It looks like the remaining work would be landed under LU-11952. If a separate ticket is needed please open one and link to this ticket - thanks

Generated at Sat Feb 10 02:47:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.