Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.6.0
-
None
-
3
-
15669
Description
As describe in LU-5395: there is potential deadlock race condition between object destroy and layout LFSCK. Consider the following scenario:
1) The LFSCK thread obtained the parent object firstly, at that time, the parent object has not been destroyed yet.
2) One RPC service thread destroyed the parent and all its children objects. Because the LFSCK is referencing the parent object, then the parent object will be marked as dying in RAM. On the other hand, the parent object is referencing all its children objects, then all children objects will be marked as dying in RAM also.
3) The LFSCK thread tries to find some child object with the parent object referenced. Then it will find that the child object is dying. According to the object visibility rules: the object with dying flag cannot be returned to others. So the LFSCK thread has to wait until the dying object has been purged from RAM, then it can allocate a new object (with the same FID) in RAM. Unfortunately, the LFSCK thread itself is referencing the parent object, and cause the parent object cannot be purged, then cause the child object cannot be purged also. So the LFSCK thread will fall into deadlock.
To avoid above deadlock, we can detach the child object from the parent object in LOD when destroy the striped directory. Such detachment should be done carefully to avoid race with other users.