[LU-13807] LNet: Net delete dead lock deadlock Created: 20/Jul/20  Updated: 07/Aug/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12233 Deadlock on LNet shutdown Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While using an LUTF script which deletes a net dynamically via

lnetctl import --del config.yaml

We ran into a dead lock.

It appears like a local NI is in recovery. A message is about to get sent to this NI. We call lnet_nid2peerni_locked() which locks the mutex. However this mutex is held by the process trying to delete the net, leading to a dead lock.

The code should first clean up the recovery queue. IE remove all instances of this NI from the recovery queue before deleting the NI.

A similar processing is done in LNetNIFini()


Generated at Sat Feb 10 03:04:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.