[LU-10669] Potential race condition when unlinking MD Created: 14/Feb/18  Updated: 05/Dec/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-11734 LNet crashes with 2.12.0-rc1: lnet_at... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There could be a potential race condition that could cause an MD to be unlinked twice. The first unlink will decrement the references counter and the second unlink would cause an assert on the reference counter.

two unlink paths can be hit at the same time, when request is expired and transfer finishes.

These code paths need to be investigated in more details.



 Comments   
Comment by Amir Shehata (Inactive) [ 15/Feb/18 ]

After more investigation, I don't see a possible scenario where the MD reference count can be decremented because of an RPC expiry. When an RPC expires the actual process of cleaning up happens in ptlrpc_expire_one_request(). Two functions are called: ptlrpc_unregister_reply() and ptlrpc_unregister_bulk(). Both of these functions end up calling LNetMDUnlink(). LNetMDUnlink() doesn't free the md unless there are no more reference counters on it. All this processing is done within the resource lock. lnet_finalize() is the only path where the refcount is decremented. Therefore for the md refcount to be < 0, lnet_finalize() must've been called on the same msg/md pair twice.

Since this issue has only been seen on OPA, I suspect that there could be a scenario where the OPA driver could be notifying the LND twice of the same message. I'm adding a patch to not assert on this scenario, but rather print some information to verify that we're hitting this case.

Comment by Gerrit Updater [ 15/Feb/18 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/31313
Subject: LU-10669 lnet: do not assert in lnet_msg_detach_md()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: efc0fce69188d92d200d280e7db80d4031396610

Generated at Sat Feb 10 02:37:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.