Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8288

handle error due to file with "no stripe info" rewritten before lfsck is run

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      This is a followup on the filesystem recovery efforts from LU-8071, in particular the comment:

      If you think that the layout LFSCK made wrong decision when re-generated the
      "nagtest.toobig.stripes" LOV EA, we need to make new patch to recover it. 
      

      More than just making a wrong decision, lfsck can actually corrupt files when it is run. The case is where the MDT loses stripe information, and then the file is rewritten (or appeneded to?), and then lfsck is run.

      In general, it would be good if lfsck can handle "conflicts" more gracefully. I understand that it may not know which object is the right one, but it should not pick them arbitrarily since that can result in a mixed-data file. Additionally, at the time when lfsck is run, it has information about what file an object is associated with, and that could be exposed to the user in the name of the file placed in lost+found.

      Attachments

        Issue Links

          Activity

            [LU-8288] handle error due to file with "no stripe info" rewritten before lfsck is run
            mdiep Minh Diep added a comment -

            Landed for 2.10

            mdiep Minh Diep added a comment - Landed for 2.10
            jaylan Jay Lan (Inactive) added a comment - - edited

            Could you port this patch to b2_7_fe and land to b2_9_fe? Thanks!

            jaylan Jay Lan (Inactive) added a comment - - edited Could you port this patch to b2_7_fe and land to b2_9_fe? Thanks!

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21562/
            Subject: LU-8288 lfsck: handle dangling LOV EA reference
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 17cc912fd5b40965d14a89a268cbf2d63b2fe21b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21562/ Subject: LU-8288 lfsck: handle dangling LOV EA reference Project: fs/lustre-release Branch: master Current Patch Set: Commit: 17cc912fd5b40965d14a89a268cbf2d63b2fe21b

            Per earlier discussion in this ticket, it would be worthwhile to backport the PFL patches to increase the MDT and OST inode size, as well as the patch to improve the fid xattr to store the total stripe count and stripe size on each OST object. That would allow LFSCK to reconstruct the layout properly, even in the case where some OST objects are totally missing. Having clients send this information with each write will ensure that this information is stored on each OST object for later use if needed.

            adilger Andreas Dilger added a comment - Per earlier discussion in this ticket, it would be worthwhile to backport the PFL patches to increase the MDT and OST inode size, as well as the patch to improve the fid xattr to store the total stripe count and stripe size on each OST object. That would allow LFSCK to reconstruct the layout properly, even in the case where some OST objects are totally missing. Having clients send this information with each write will ensure that this information is stored on each OST object for later use if needed.
            pjones Peter Jones added a comment -

            That matches my understanding Nathan

            pjones Peter Jones added a comment - That matches my understanding Nathan

            It looks like there were many iterations on this patch, but it is ready for final review and then landing. Please confirm.

            Also, once the patch is finalized, we will need a backport to the 2.7 FE branch as well as master. Thanks!

            ndauchy Nathan Dauchy (Inactive) added a comment - It looks like there were many iterations on this patch, but it is ready for final review and then landing. Please confirm. Also, once the patch is finalized, we will need a backport to the 2.7 FE branch as well as master. Thanks!

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21562
            Subject: LU-8288 lfsck: handle dangling LOV EA reference
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 61dc2ac65258fceb30bf0549e76b8ff7eace2d29

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21562 Subject: LU-8288 lfsck: handle dangling LOV EA reference Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 61dc2ac65258fceb30bf0549e76b8ff7eace2d29

            I'm also a bit confused why the "CLIENT step5" layout was not reconstructed with all 12 of the original stripes? Was lfsck still running on the other OSTs?

            Only the OST-object that has ever been modified (write/setattr) after creation has PFID EA, then the LFSCK will handle it as orphan if no MDT-object reference it. In this case, I am not sure whether the original 12 tripes all have been modified before MDT-object LOV EA removed.

            Why were there two objects allocated on OST index 0, and why was a new object allocated on OST index 0 (objid 491045) in place of the manually recreated

            That is also my concern. Currently, for a given striped file, it has at most one OST-object on the specified OST. In this case, I am afraid that the wrong OST-object is written?

            yong.fan nasf (Inactive) added a comment - I'm also a bit confused why the "CLIENT step5" layout was not reconstructed with all 12 of the original stripes? Was lfsck still running on the other OSTs? Only the OST-object that has ever been modified (write/setattr) after creation has PFID EA, then the LFSCK will handle it as orphan if no MDT-object reference it. In this case, I am not sure whether the original 12 tripes all have been modified before MDT-object LOV EA removed. Why were there two objects allocated on OST index 0, and why was a new object allocated on OST index 0 (objid 491045) in place of the manually recreated That is also my concern. Currently, for a given striped file, it has at most one OST-object on the specified OST. In this case, I am afraid that the wrong OST-object is written?

            The problem that was seen in "CLIENT step5" could be fixed with the fidea changes being implemented for PFL Phase 3a. In particular, the current fidea does not store the total number of stripes in the layout, so old stripes found on the OST (e.g. with "stripe_idx = 4" in this case) would currently be added to the file layout and the stripe count increased. With the new PFL fidea the total stripe count is also saved with each OST object, which could be used in this case to determine whether the orphan OST objects are part of the same layout or not.

            It may be in the common case that the use of default stripe counts means the total stripe count is also the same between multiple sets of orphan OST objects. However, I don't think that would be a problem. It would avoid the case seen here where stale objects with a higher stripe index are added to the recreated file with fewer stripes. If the file was recreated, then all objects should be present, so if old orphan objects have the same stripe count they will not be added to the layout and be put into lost+found instead. If the old orphan objects have a different stripe count then they should not be added to the existing file.

            I'm also a bit confused why the "CLIENT step5" layout was not reconstructed with all 12 of the original stripes? Was lfsck still running on the other OSTs? Why were there two objects allocated on OST index 0, and why was a new object allocated on OST index 0 (objid 491045) in place of the manually recreated object (objid 491044)?

            adilger Andreas Dilger added a comment - The problem that was seen in "CLIENT step5" could be fixed with the fidea changes being implemented for PFL Phase 3a. In particular, the current fidea does not store the total number of stripes in the layout, so old stripes found on the OST (e.g. with "stripe_idx = 4" in this case) would currently be added to the file layout and the stripe count increased. With the new PFL fidea the total stripe count is also saved with each OST object, which could be used in this case to determine whether the orphan OST objects are part of the same layout or not. It may be in the common case that the use of default stripe counts means the total stripe count is also the same between multiple sets of orphan OST objects. However, I don't think that would be a problem. It would avoid the case seen here where stale objects with a higher stripe index are added to the recreated file with fewer stripes. If the file was recreated, then all objects should be present, so if old orphan objects have the same stripe count they will not be added to the layout and be put into lost+found instead. If the old orphan objects have a different stripe count then they should not be added to the existing file. I'm also a bit confused why the "CLIENT step5" layout was not reconstructed with all 12 of the original stripes? Was lfsck still running on the other OSTs? Why were there two objects allocated on OST index 0, and why was a new object allocated on OST index 0 (objid 491045) in place of the manually recreated object (objid 491044)?
            yong.fan nasf (Inactive) added a comment - - edited

            2.1) If nobody modified such new created OST-object before the layout LFSCK finding out the real orphan OST-object, then the layout LFSCK will drop the new created OST-object and replace it with the real orphan OST-object. Otherwise,
            2.2) Since the new created OST-object contains new data, we cannot drop it, to make the user to realise that there were some conflict, the layout LFSCK will generate new file under .lustre/lost+found with the name $FID-$infix-$conflict_version, that contains the old data.

            In fact, the key issue is that during the layout LFSCK 1st phase scanning (orphan OST-object will be detected in the 2nd phase scanning), if it finds that some LOV EA references a non-existing OST-object, it does not know exactly whether it is the OST-object lost or the LOV EA corrupted. If it is the former case, creating the lost OST-object can make the system to be available as fast as possible; but if it is the latter case, correcting the LOV EA is better choice. So two possible solutions for that:

            1) Postpone the layout LFSCK preparing decision for dangling reference case until orphan OST-objects handled properly. That means the 3rd phase scanning introduced, that will much affect the whole LFSCK framework.

            2) Never re-create the lost OST-object.

            Andreas, how do you think for that ?

            yong.fan nasf (Inactive) added a comment - - edited 2.1) If nobody modified such new created OST-object before the layout LFSCK finding out the real orphan OST-object, then the layout LFSCK will drop the new created OST-object and replace it with the real orphan OST-object. Otherwise, 2.2) Since the new created OST-object contains new data, we cannot drop it, to make the user to realise that there were some conflict, the layout LFSCK will generate new file under .lustre/lost+found with the name $FID-$infix-$conflict_version, that contains the old data. In fact, the key issue is that during the layout LFSCK 1st phase scanning (orphan OST-object will be detected in the 2nd phase scanning), if it finds that some LOV EA references a non-existing OST-object, it does not know exactly whether it is the OST-object lost or the LOV EA corrupted. If it is the former case, creating the lost OST-object can make the system to be available as fast as possible; but if it is the latter case, correcting the LOV EA is better choice. So two possible solutions for that: 1) Postpone the layout LFSCK preparing decision for dangling reference case until orphan OST-objects handled properly. That means the 3rd phase scanning introduced, that will much affect the whole LFSCK framework. 2) Never re-create the lost OST-object. Andreas, how do you think for that ?

            People

              yong.fan nasf (Inactive)
              ndauchy Nathan Dauchy (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: