[LU-5601] LFSCK 5: detach the child object (shard) from parent object when destroy the striped file Created: 10/Sep/14  Updated: 30/Nov/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Unresolved Votes: 1
Labels: None

Severity: 3
Rank (Obsolete): 15669

 Description   

As describe in LU-5395: there is potential deadlock race condition between object destroy and layout LFSCK. Consider the following scenario:

1) The LFSCK thread obtained the parent object firstly, at that time, the parent object has not been destroyed yet.

2) One RPC service thread destroyed the parent and all its children objects. Because the LFSCK is referencing the parent object, then the parent object will be marked as dying in RAM. On the other hand, the parent object is referencing all its children objects, then all children objects will be marked as dying in RAM also.

3) The LFSCK thread tries to find some child object with the parent object referenced. Then it will find that the child object is dying. According to the object visibility rules: the object with dying flag cannot be returned to others. So the LFSCK thread has to wait until the dying object has been purged from RAM, then it can allocate a new object (with the same FID) in RAM. Unfortunately, the LFSCK thread itself is referencing the parent object, and cause the parent object cannot be purged, then cause the child object cannot be purged also. So the LFSCK thread will fall into deadlock.

To avoid above deadlock, we can detach the child object from the parent object in LOD when destroy the striped directory. Such detachment should be done carefully to avoid race with other users.



 Comments   
Comment by Andreas Dilger [ 08/Oct/14 ]

Alex, when you have some time, could you please describe what needs to be done to fix this issue? We are trying to scope the amount of effort needed to address the LFSCK4 technical debts, and without knowing how to fix this problem we can't make any estimate on how long it will take.

Generated at Sat Feb 10 01:52:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.