[LU-10349] debug and cleanup of corrupted PFID, unmatched MDT-object and OST-object pairs Created: 07/Dec/17  Updated: 26/Mar/18  Resolved: 26/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Nathan Dauchy (Inactive) Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

servers: lustre-2.7.3-1nasS_mofed33v3g_2.6.32_642.15.1.el6.20170609.x86_64.lustre273.x86_64
clients: lustre-client-2.9.0-2.3nasC_mofed34v1_4.4.74_92.32.1.20170808_nasa.x86_64


Issue Links:
Duplicate
duplicates LU-10422 layout LFSCK try to fix consistent ow... Resolved
duplicates LU-6420 layout LFSCK fixing dangling/unmatche... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This ticket is created to handle NASA-specific debugging of corrupted PFIDs discussed in LU-10248, as well as ensure ports of a fix for lfsck handling of repaired_unmatched_pair, and any other related questions to ensure proper running of lfsck and cleanup of the filesystem.

servers: lustre-2.7.3-1nasS_mofed33v3g_2.6.32_642.15.1.el6.20170609.x86_64.lustre273.x86_64 (basically the old FE branch, plus several cherry-picked patches)

clients: lustre-client-2.9.0-2.3nasC_mofed34v1_4.4.74_92.32.1.20170808_nasa.x86_64

We hope to upgrade both to a 2.10.2-based build in the near future.



 Comments   
Comment by Nathan Dauchy (Inactive) [ 07/Dec/17 ]

At this time, it sounds like we should go with "choice 2" in the comment from nasf:
https://jira.hpdd.intel.com/browse/LU-10248?focusedCommentId=215528&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-215528

We will leave the inconsistencies the way they are until the lfsck patch is available and we can upgrade the servers to include it.

The only caveat is that we will be continuing to drain and completely remove multiple OSTs from this file system (to free up the hardware for spares) and I want to make sure that won't combine with the PFID issues and confuse lfsck. One OST is already gone, and the last dry-run lfsck did not appear to die on it so hopefully we are OK on that front... it's just the debug logs getting overwritten with the repaired_inconsistent_owner errors might have masked other problems. Please let me know ASAP if I should not remove additional OSTs.

 

Comment by Peter Jones [ 07/Dec/17 ]

Fan Yong is already assisting here

Comment by nasf (Inactive) [ 08/Dec/17 ]

ndauchy, would you please to check whether or not your branch has the following two patches:
https://review.whamcloud.com/#/c/16135/
https://review.whamcloud.com/#/c/30447/

If not, please try to apply them and re-run the dryun mode layout LFSCK? Thanks!

Comment by Nathan Dauchy (Inactive) [ 08/Dec/17 ]

Since we are based on 2.7.3 FE branch, I'm pretty confident we have the first of those patches, but not the second since it looks like you just created the backport (and it failed a build test). Will confirm though.

Regardless, it will be a while before we can complete the server rebuild and take a downtime to apply it. In the meantime I need to move forward with removing additional OSTs from the file system to free up the hardware for spares. Do you recommend we run the current lfsck in non-dry-run mode, or just go ahead and remove the OST now and wait for the lfsck updates?

Thanks!

Comment by Jay Lan (Inactive) [ 08/Dec/17 ]

We do not have either.

Review #16135 is in 'Need Code-Review' state for more than 1 year. Is it OK to cherry-pick as it is, Fan?

Comment by nasf (Inactive) [ 12/Dec/17 ]

Regardless, it will be a while before we can complete the server rebuild and take a downtime to apply it. In the meantime I need to move forward with removing additional OSTs from the file system to free up the hardware for spares. Do you recommend we run the current lfsck in non-dry-run mode, or just go ahead and remove the OST now and wait for the lfsck updates?

There are two kinds inconsistency reported by the layout LFSCK, one is inconsistent owner information that cause by some known layout LFSCK issue, can be ignored. Another is the unmatched MDT-object and OST-object pairs. Such inconsistency will NOT affect normal system access unless enabling I/O verification (disable by default) explicitly. So I would suggest to keep the system unchanged since it is without influence now.

Review #16135 is in 'Need Code-Review' state for more than 1 year. Is it OK to cherry-pick as it is, Fan?

Yes, I think so. Such patch has already been landed to b2_8_fe and master. It is not on b2_7_fe may because b2_7 was some old at that time and such issue is not very serious.

Comment by Nathan Dauchy (Inactive) [ 12/Dec/17 ]

So I would suggest to keep the system unchanged since it is without influence now.

OK, we will proceed with removing the additional OSTs from the file system, and wait for install of the patched lfsck on the 2.7.3 servers and perform the cleanup later.  Thanks!

Comment by nasf (Inactive) [ 20/Dec/17 ]

The patch for repair unexpected inconsistent owner on b2_7_fe:
https://review.whamcloud.com/30613

Comment by nasf (Inactive) [ 26/Jan/18 ]

ndauchy,

The patch https://review.whamcloud.com/#/c/30612/ for fixing the issue of unexpected inconsistent owner has already been landed to master. And it has been ported to b2_7_fe branch via the patch https://review.whamcloud.com/30613. You can use related patch to resolve your system trouble. Please let me you what else you need.

Comment by Jay Lan (Inactive) [ 29/Jan/18 ]

The above patch #30612 for b2_10 is at
https://review.whamcloud.com/#/c/30628/1
Could you land this?

Comment by Peter Jones [ 29/Jan/18 ]

Jay

That is being tracked under LU-10422 and it will land as soon as the reviews have completed

Peter

Comment by Peter Jones [ 26/Mar/18 ]

AFAICT this is now resolved with the LU-10422 fix landed to b2_10

Generated at Sat Feb 10 02:34:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.