Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4701 LFSCK phase II technical debts
  3. LU-4895

LFSCK should not create empty OST objects for danging layout references by default

Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • 13531

    Description

      Per discussion today, LFSCK phase 2 should not create empty OST objects by default for MDT LOV layouts that reference missing OST objects. By default LFSCK should log an error (see LU-4610) and leave the LOV layout referencing the missing OST object (read/write/stat to the object should continue to return an error).

      The administrator can specify an option to delete files with dangling links, or create empty objects to fix the dangling reference. Otherwise, it should leave the dangling reference unfixed.

      There should be a generic mechanism for specifying different repair options, including a way of specifying defaults for all of the repair options in a file.

      Attachments

        Issue Links

          Activity

            [LU-4895] LFSCK should not create empty OST objects for danging layout references by default

            The patch has been landed to master.

            yong.fan nasf (Inactive) added a comment - The patch has been landed to master.
            yong.fan nasf (Inactive) added a comment - Here is the patch: http://review.whamcloud.com/9989

            Then the LFSCK will give two options for dangling reference case:

            1) report error via LFSCK log, currently it is CDEBUG(D_LFSCK). (by default)
            2) re-create the lost OST-object.

            Generally, there should be another option to delete the file with dangling reference, but because layout LFSCK does not understand the namespace, we can consider to enhance it in the LFSCK phase 3.

            yong.fan nasf (Inactive) added a comment - Then the LFSCK will give two options for dangling reference case: 1) report error via LFSCK log, currently it is CDEBUG(D_LFSCK). (by default) 2) re-create the lost OST-object. Generally, there should be another option to delete the file with dangling reference, but because layout LFSCK does not understand the namespace, we can consider to enhance it in the LFSCK phase 3.

            The bad effect of automatically creating objects for dangling layout references is that this hides filesystem corruption from users, and means that users may read bad (zero) data from the repaired file. That may cause the application to compute the wrong result instead of causing an error and alerting the user that the file data was lost.

            adilger Andreas Dilger added a comment - The bad effect of automatically creating objects for dangling layout references is that this hides filesystem corruption from users, and means that users may read bad (zero) data from the repaired file. That may cause the application to compute the wrong result instead of causing an error and alerting the user that the file data was lost.
            yong.fan nasf (Inactive) added a comment - - edited

            Sorry, I still not quite understand the shortcoming (or bad effect) of creating the lost OST-object. On the other hand, as my understand, the option of deleting the file just because of losing some OST-object may be not a good choice. Be as a distributed filesystem, lost one stripe does not mean lost all. We should try to keep the data as much as possible instead of destroying something just like requirement in the solution architecture document.

            Another reason for NOT deleting the file which lost some of its stripe: some of the LOV EA slot may be wrong, means that the LOV EA may be invalid and claims non-exist OST-object, but as the LFSCK processing, we can find out the lost OST-object when handle orphans. If the LFSCK deletes the file at the first-stage scanning, then we will lose the chance to repair the bad LOV EA.

            The third reason is that: the layout LFSCK does not understand the namespace, so it it some hack for the layout LFSCK to remove name entry from its parent directory (the worse case is that it may has no [valid] linkEA), especially for multiple linked files.

            So my suggestion is that: if we really want to give an option to the administrator to delete the file that lost some of stripe(s), then we can link the file to .lustre/lost+found/MDTxxxx/ with special name, and if the LFSCK can repair it finally, then unlink it from the .lustre/lost+found/MDTxxxx/, and its original name entry is still in the normal namespace; otherwise, the administrator can easily to know which files have dangling reference, and can destroy them manually if want.

            So the options for dangling reference case will be two:

            1) Link the file to .lustre/lost+found/MDTxxxx/ with special name without creating the lost OST-object. (by default)

            2) Keep it in the namespace and re-create the lost OST-object (as the current LFSCK does).

            Similar for MDT-MDT consistency. The MDT-MDT LFSCK can find back the linkEA entries as much as possible (unless the linkEA entries exceeds the limitation). Then after the whole LFSCK, the administrator can easily check something under .lustre/lost+found/MDTxxxx.

            Honestly, I do NOT hope that the LFSCK keeps the dangling inconsistency cases there only with some error reported unless "dryrun" is specified; otherwise, add something under .lustre/lost+found/MDTxxxx/ is more convenient for the administrator. Because the error log can be removed and may be over-wriiten, but the record under .lustre/lost+found/ will be there.

            yong.fan nasf (Inactive) added a comment - - edited Sorry, I still not quite understand the shortcoming (or bad effect) of creating the lost OST-object. On the other hand, as my understand, the option of deleting the file just because of losing some OST-object may be not a good choice. Be as a distributed filesystem, lost one stripe does not mean lost all. We should try to keep the data as much as possible instead of destroying something just like requirement in the solution architecture document. Another reason for NOT deleting the file which lost some of its stripe: some of the LOV EA slot may be wrong, means that the LOV EA may be invalid and claims non-exist OST-object, but as the LFSCK processing, we can find out the lost OST-object when handle orphans. If the LFSCK deletes the file at the first-stage scanning, then we will lose the chance to repair the bad LOV EA. The third reason is that: the layout LFSCK does not understand the namespace, so it it some hack for the layout LFSCK to remove name entry from its parent directory (the worse case is that it may has no [valid] linkEA), especially for multiple linked files. So my suggestion is that: if we really want to give an option to the administrator to delete the file that lost some of stripe(s), then we can link the file to .lustre/lost+found/MDTxxxx/ with special name, and if the LFSCK can repair it finally, then unlink it from the .lustre/lost+found/MDTxxxx/, and its original name entry is still in the normal namespace; otherwise, the administrator can easily to know which files have dangling reference, and can destroy them manually if want. So the options for dangling reference case will be two: 1) Link the file to .lustre/lost+found/MDTxxxx/ with special name without creating the lost OST-object. (by default) 2) Keep it in the namespace and re-create the lost OST-object (as the current LFSCK does). Similar for MDT-MDT consistency. The MDT-MDT LFSCK can find back the linkEA entries as much as possible (unless the linkEA entries exceeds the limitation). Then after the whole LFSCK, the administrator can easily check something under .lustre/lost+found/MDTxxxx. Honestly, I do NOT hope that the LFSCK keeps the dangling inconsistency cases there only with some error reported unless "dryrun" is specified; otherwise, add something under .lustre/lost+found/MDTxxxx/ is more convenient for the administrator. Because the error log can be removed and may be over-wriiten, but the record under .lustre/lost+found/ will be there.

            People

              yong.fan nasf (Inactive)
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: