Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10349

debug and cleanup of corrupted PFID, unmatched MDT-object and OST-object pairs

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.7.0, Lustre 2.9.0
    • None
    • servers: lustre-2.7.3-1nasS_mofed33v3g_2.6.32_642.15.1.el6.20170609.x86_64.lustre273.x86_64
      clients: lustre-client-2.9.0-2.3nasC_mofed34v1_4.4.74_92.32.1.20170808_nasa.x86_64
    • 3
    • 9223372036854775807

    Description

      This ticket is created to handle NASA-specific debugging of corrupted PFIDs discussed in LU-10248, as well as ensure ports of a fix for lfsck handling of repaired_unmatched_pair, and any other related questions to ensure proper running of lfsck and cleanup of the filesystem.

      servers: lustre-2.7.3-1nasS_mofed33v3g_2.6.32_642.15.1.el6.20170609.x86_64.lustre273.x86_64 (basically the old FE branch, plus several cherry-picked patches)

      clients: lustre-client-2.9.0-2.3nasC_mofed34v1_4.4.74_92.32.1.20170808_nasa.x86_64

      We hope to upgrade both to a 2.10.2-based build in the near future.

      Attachments

        Issue Links

          Activity

            [LU-10349] debug and cleanup of corrupted PFID, unmatched MDT-object and OST-object pairs

            The above patch #30612 for b2_10 is at
            https://review.whamcloud.com/#/c/30628/1
            Could you land this?

            jaylan Jay Lan (Inactive) added a comment - The above patch #30612 for b2_10 is at https://review.whamcloud.com/#/c/30628/1 Could you land this?

            ndauchy,

            The patch https://review.whamcloud.com/#/c/30612/ for fixing the issue of unexpected inconsistent owner has already been landed to master. And it has been ported to b2_7_fe branch via the patch https://review.whamcloud.com/30613. You can use related patch to resolve your system trouble. Please let me you what else you need.

            yong.fan nasf (Inactive) added a comment - ndauchy , The patch https://review.whamcloud.com/#/c/30612/ for fixing the issue of unexpected inconsistent owner has already been landed to master. And it has been ported to b2_7_fe branch via the patch https://review.whamcloud.com/30613 . You can use related patch to resolve your system trouble. Please let me you what else you need.

            The patch for repair unexpected inconsistent owner on b2_7_fe:
            https://review.whamcloud.com/30613

            yong.fan nasf (Inactive) added a comment - The patch for repair unexpected inconsistent owner on b2_7_fe: https://review.whamcloud.com/30613

            So I would suggest to keep the system unchanged since it is without influence now.

            OK, we will proceed with removing the additional OSTs from the file system, and wait for install of the patched lfsck on the 2.7.3 servers and perform the cleanup later.  Thanks!

            ndauchy Nathan Dauchy (Inactive) added a comment - So I would suggest to keep the system unchanged since it is without influence now. OK, we will proceed with removing the additional OSTs from the file system, and wait for install of the patched lfsck on the 2.7.3 servers and perform the cleanup later.  Thanks!

            Regardless, it will be a while before we can complete the server rebuild and take a downtime to apply it. In the meantime I need to move forward with removing additional OSTs from the file system to free up the hardware for spares. Do you recommend we run the current lfsck in non-dry-run mode, or just go ahead and remove the OST now and wait for the lfsck updates?

            There are two kinds inconsistency reported by the layout LFSCK, one is inconsistent owner information that cause by some known layout LFSCK issue, can be ignored. Another is the unmatched MDT-object and OST-object pairs. Such inconsistency will NOT affect normal system access unless enabling I/O verification (disable by default) explicitly. So I would suggest to keep the system unchanged since it is without influence now.

            Review #16135 is in 'Need Code-Review' state for more than 1 year. Is it OK to cherry-pick as it is, Fan?

            Yes, I think so. Such patch has already been landed to b2_8_fe and master. It is not on b2_7_fe may because b2_7 was some old at that time and such issue is not very serious.

            yong.fan nasf (Inactive) added a comment - Regardless, it will be a while before we can complete the server rebuild and take a downtime to apply it. In the meantime I need to move forward with removing additional OSTs from the file system to free up the hardware for spares. Do you recommend we run the current lfsck in non-dry-run mode, or just go ahead and remove the OST now and wait for the lfsck updates? There are two kinds inconsistency reported by the layout LFSCK, one is inconsistent owner information that cause by some known layout LFSCK issue, can be ignored. Another is the unmatched MDT-object and OST-object pairs. Such inconsistency will NOT affect normal system access unless enabling I/O verification (disable by default) explicitly. So I would suggest to keep the system unchanged since it is without influence now. Review #16135 is in 'Need Code-Review' state for more than 1 year. Is it OK to cherry-pick as it is, Fan? Yes, I think so. Such patch has already been landed to b2_8_fe and master. It is not on b2_7_fe may because b2_7 was some old at that time and such issue is not very serious.

            We do not have either.

            Review #16135 is in 'Need Code-Review' state for more than 1 year. Is it OK to cherry-pick as it is, Fan?

            jaylan Jay Lan (Inactive) added a comment - We do not have either. Review #16135 is in 'Need Code-Review' state for more than 1 year. Is it OK to cherry-pick as it is, Fan?

            Since we are based on 2.7.3 FE branch, I'm pretty confident we have the first of those patches, but not the second since it looks like you just created the backport (and it failed a build test). Will confirm though.

            Regardless, it will be a while before we can complete the server rebuild and take a downtime to apply it. In the meantime I need to move forward with removing additional OSTs from the file system to free up the hardware for spares. Do you recommend we run the current lfsck in non-dry-run mode, or just go ahead and remove the OST now and wait for the lfsck updates?

            Thanks!

            ndauchy Nathan Dauchy (Inactive) added a comment - Since we are based on 2.7.3 FE branch, I'm pretty confident we have the first of those patches, but not the second since it looks like you just created the backport (and it failed a build test). Will confirm though. Regardless, it will be a while before we can complete the server rebuild and take a downtime to apply it. In the meantime I need to move forward with removing additional OSTs from the file system to free up the hardware for spares. Do you recommend we run the current lfsck in non-dry-run mode, or just go ahead and remove the OST now and wait for the lfsck updates? Thanks!

            ndauchy, would you please to check whether or not your branch has the following two patches:
            https://review.whamcloud.com/#/c/16135/
            https://review.whamcloud.com/#/c/30447/

            If not, please try to apply them and re-run the dryun mode layout LFSCK? Thanks!

            yong.fan nasf (Inactive) added a comment - ndauchy , would you please to check whether or not your branch has the following two patches: https://review.whamcloud.com/#/c/16135/ https://review.whamcloud.com/#/c/30447/ If not, please try to apply them and re-run the dryun mode layout LFSCK? Thanks!
            pjones Peter Jones added a comment -

            Fan Yong is already assisting here

            pjones Peter Jones added a comment - Fan Yong is already assisting here

            At this time, it sounds like we should go with "choice 2" in the comment from nasf:
            https://jira.hpdd.intel.com/browse/LU-10248?focusedCommentId=215528&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-215528

            We will leave the inconsistencies the way they are until the lfsck patch is available and we can upgrade the servers to include it.

            The only caveat is that we will be continuing to drain and completely remove multiple OSTs from this file system (to free up the hardware for spares) and I want to make sure that won't combine with the PFID issues and confuse lfsck. One OST is already gone, and the last dry-run lfsck did not appear to die on it so hopefully we are OK on that front... it's just the debug logs getting overwritten with the repaired_inconsistent_owner errors might have masked other problems. Please let me know ASAP if I should not remove additional OSTs.

             

            ndauchy Nathan Dauchy (Inactive) added a comment - At this time, it sounds like we should go with "choice 2" in the comment from nasf: https://jira.hpdd.intel.com/browse/LU-10248?focusedCommentId=215528&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-215528 We will leave the inconsistencies the way they are until the lfsck patch is available and we can upgrade the servers to include it. The only caveat is that we will be continuing to drain and completely remove multiple OSTs from this file system (to free up the hardware for spares) and I want to make sure that won't combine with the PFID issues and confuse lfsck. One OST is already gone, and the last dry-run lfsck did not appear to die on it so hopefully we are OK on that front... it's just the debug logs getting overwritten with the repaired_inconsistent_owner errors might have masked other problems. Please let me know ASAP if I should not remove additional OSTs.  

            People

              yong.fan nasf (Inactive)
              ndauchy Nathan Dauchy (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: