Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • None
    • 12608

    Description

      Currently, there are race cases between LFSCK and recovery, especially on OST side, the LFSCK may make the OST to become read-only and cause the recovery to be failed. We need more consideration about those race cases, may be recovery firstly, then LFSCK.

      Attachments

        Activity

          [LU-4609] handle race between LFSCK and recovery

          The patch has been landed to master.

          yong.fan nasf (Inactive) added a comment - The patch has been landed to master.
          yong.fan nasf (Inactive) added a comment - The patch: http://review.whamcloud.com/#/c/10010/

          Firstly, the new RPC will be blocked by our current recovery mechanism automatically, in spite of it is for LFSCK (via OUT RPC) or for other normal operations.

          The second is about when the local LFSCK should be auto resumed when the server restart/remount: before the recovery finished or not. There are two cases as following:

          1) OSS restart/remount. The local LFSCK on OSS is mainly for rebuilding LAST_ID files.

          1.1) If before the OSS restart/remount, we have already known that some LAST_ID files crashed and should be rebuilt, then should we allow the recovery to re-create something based on the wrong LAST_ID or not? If yes, it may cause more damage, right?

          1.2) If before the OSS restart/remount, there was no crashed LAST_ID files found during the LFSCK scanning. Then after the OSS restart/remount, as long as the LFSCK will not be misguided by the re-created objects during the recovery, then it is well for the LFSCK and recovery run in parallel.

          2) MDS restart/remount. As explained former, the only case for the LFSCK to create MDT-object is during handle orphan OST-object. But such MDT-object will not be in the recovery queue.

          I agree with you that the above are our current known situations, there may be more in the future. But we can try to resolve the new cases when they are appear. The worst case is that the new cases are very difficult to be handled, then we can consider to pause the LFSCK at that time. But before such cases happen, I do not think there are some strong reasons to do that now.

          yong.fan nasf (Inactive) added a comment - Firstly, the new RPC will be blocked by our current recovery mechanism automatically, in spite of it is for LFSCK (via OUT RPC) or for other normal operations. The second is about when the local LFSCK should be auto resumed when the server restart/remount: before the recovery finished or not. There are two cases as following: 1) OSS restart/remount. The local LFSCK on OSS is mainly for rebuilding LAST_ID files. 1.1) If before the OSS restart/remount, we have already known that some LAST_ID files crashed and should be rebuilt, then should we allow the recovery to re-create something based on the wrong LAST_ID or not? If yes, it may cause more damage, right? 1.2) If before the OSS restart/remount, there was no crashed LAST_ID files found during the LFSCK scanning. Then after the OSS restart/remount, as long as the LFSCK will not be misguided by the re-created objects during the recovery, then it is well for the LFSCK and recovery run in parallel. 2) MDS restart/remount. As explained former, the only case for the LFSCK to create MDT-object is during handle orphan OST-object. But such MDT-object will not be in the recovery queue. I agree with you that the above are our current known situations, there may be more in the future. But we can try to resolve the new cases when they are appear. The worst case is that the new cases are very difficult to be handled, then we can consider to pause the LFSCK at that time. But before such cases happen, I do not think there are some strong reasons to do that now.

          The problem is that these are only the situations that you know about now. There may be other cases now or in the future that can cause inconsistency. It would just be safer to block all new LFSCK RPCs at the target until it is finished recovery and the local MDT has finished orphan recovery.

          adilger Andreas Dilger added a comment - The problem is that these are only the situations that you know about now. There may be other cases now or in the future that can cause inconsistency. It would just be safer to block all new LFSCK RPCs at the target until it is finished recovery and the local MDT has finished orphan recovery.

          Two main race conditions should be considered:

          1) The LFSCK create some MDT-object which should be created by the client recovery RPC. In fact, such race does not exist for current LFSCK, because the LFSCK will not create MDT-object unless for handling orphan OST-object. According to current implementation, when a file is created, its OST-object(s) will be recorded by the LFSCK, so the OST-object will not be an orphan unless the MDT-object has already been committed to the storage before the LFSCK start. But under such case, it needs NOT to be recovered.

          2) The client recovery RPC create some missed OST-object which may misguide the LFSCK and cause the LFSCK to regard the LAST_ID files to be crashed, and then set the OST device as read-only. In fact, such race can be resolved via the way that update LAST_ID file before create. I have made a patch for that:

          http://review.whamcloud.com/10010

          yong.fan nasf (Inactive) added a comment - Two main race conditions should be considered: 1) The LFSCK create some MDT-object which should be created by the client recovery RPC. In fact, such race does not exist for current LFSCK, because the LFSCK will not create MDT-object unless for handling orphan OST-object. According to current implementation, when a file is created, its OST-object(s) will be recorded by the LFSCK, so the OST-object will not be an orphan unless the MDT-object has already been committed to the storage before the LFSCK start. But under such case, it needs NOT to be recovered. 2) The client recovery RPC create some missed OST-object which may misguide the LFSCK and cause the LFSCK to regard the LAST_ID files to be crashed, and then set the OST device as read-only. In fact, such race can be resolved via the way that update LAST_ID file before create. I have made a patch for that: http://review.whamcloud.com/10010

          LFSCK, being a normal service, shouldn't worry about this - correspondent requests are put on hold during recovery.

          bzzz Alex Zhuravlev added a comment - LFSCK, being a normal service, shouldn't worry about this - correspondent requests are put on hold during recovery.

          Consider the DNE case:

          1) both MDT0 and MDT1, and OST0/OST1 work well.

          2) the admin triggers layout LFSCK on OST0/OST1 via MDT0, and MDT1 does not take part in the LFSCK scanning.

          3) MDT1 crashed for some reason.

          4) MDT1 restart and recovery with OST0/OST1, but at that time, the layout LFSCK is still running on OST0/OST1.

          Then how can we make the recovery run firstly before LFSCK on OST0/OST1? If we paused the layout LFSCK to give the way to the recovery, then the LFSCK may have to resume from the beginning, because we do not know how the recovery will affect the layout. On the other hand, for a large cluster, it is normal that some OST/MDT may fail during the LFSCK, so it is normal that recovery happened during the LFSCK. If the LFSCK has to be paused and resumed from beginning when recovery happened, then the LFSCK will become almost unusable.

          According to our original thought, the LFSCK should be robust enough to allow MDT/OST leave/join the LFSCK dynamically.

          yong.fan nasf (Inactive) added a comment - Consider the DNE case: 1) both MDT0 and MDT1, and OST0/OST1 work well. 2) the admin triggers layout LFSCK on OST0/OST1 via MDT0, and MDT1 does not take part in the LFSCK scanning. 3) MDT1 crashed for some reason. 4) MDT1 restart and recovery with OST0/OST1, but at that time, the layout LFSCK is still running on OST0/OST1. Then how can we make the recovery run firstly before LFSCK on OST0/OST1? If we paused the layout LFSCK to give the way to the recovery, then the LFSCK may have to resume from the beginning, because we do not know how the recovery will affect the layout. On the other hand, for a large cluster, it is normal that some OST/MDT may fail during the LFSCK, so it is normal that recovery happened during the LFSCK. If the LFSCK has to be paused and resumed from beginning when recovery happened, then the LFSCK will become almost unusable. According to our original thought, the LFSCK should be robust enough to allow MDT/OST leave/join the LFSCK dynamically.

          I agree with Alex, that LFSCK should NOT be modifying the filesystem during recovery. This might accidentally trigger a full LFSCK run because of transient inconsistencies which may impact usability significantly (e.g. make OST read-only, impact performance significantly), and it may also cause recovery to fail (e.g. modify the MDS namespace to create parent inodes for orphan objects before a client can replay) due to conflicts and prevent further recovery from being done by the client.

          Instead, recovery should be allowed to complete, and LFSCK RPCs should not be processed during this time (like any new RPCs from clients during recovery), nor should local modifications be done while the target is in recovery. If an inconsistency is detected during recovery, then this should be saved and re-checked after recovery has completed. Since recovery is normally expected to be relatively fast (done within a few minutes) compared to a full LFSCK (may take many hours), this will not slow down LFSCK processing significantly.

          adilger Andreas Dilger added a comment - I agree with Alex, that LFSCK should NOT be modifying the filesystem during recovery. This might accidentally trigger a full LFSCK run because of transient inconsistencies which may impact usability significantly (e.g. make OST read-only, impact performance significantly), and it may also cause recovery to fail (e.g. modify the MDS namespace to create parent inodes for orphan objects before a client can replay) due to conflicts and prevent further recovery from being done by the client. Instead, recovery should be allowed to complete, and LFSCK RPCs should not be processed during this time (like any new RPCs from clients during recovery), nor should local modifications be done while the target is in recovery. If an inconsistency is detected during recovery, then this should be saved and re-checked after recovery has completed. Since recovery is normally expected to be relatively fast (done within a few minutes) compared to a full LFSCK (may take many hours), this will not slow down LFSCK processing significantly.

          People

            yong.fan nasf (Inactive)
            yong.fan nasf (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: