[LU-10349] debug and cleanup of corrupted PFID, unmatched MDT-object and OST-object pairs Created: 07/Dec/17 Updated: 26/Mar/18 Resolved: 26/Mar/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Nathan Dauchy (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
servers: lustre-2.7.3-1nasS_mofed33v3g_2.6.32_642.15.1.el6.20170609.x86_64.lustre273.x86_64 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This ticket is created to handle NASA-specific debugging of corrupted PFIDs discussed in servers: lustre-2.7.3-1nasS_mofed33v3g_2.6.32_642.15.1.el6.20170609.x86_64.lustre273.x86_64 (basically the old FE branch, plus several cherry-picked patches) clients: lustre-client-2.9.0-2.3nasC_mofed34v1_4.4.74_92.32.1.20170808_nasa.x86_64 We hope to upgrade both to a 2.10.2-based build in the near future. |
| Comments |
| Comment by Nathan Dauchy (Inactive) [ 07/Dec/17 ] |
|
At this time, it sounds like we should go with "choice 2" in the comment from nasf: We will leave the inconsistencies the way they are until the lfsck patch is available and we can upgrade the servers to include it. The only caveat is that we will be continuing to drain and completely remove multiple OSTs from this file system (to free up the hardware for spares) and I want to make sure that won't combine with the PFID issues and confuse lfsck. One OST is already gone, and the last dry-run lfsck did not appear to die on it so hopefully we are OK on that front... it's just the debug logs getting overwritten with the repaired_inconsistent_owner errors might have masked other problems. Please let me know ASAP if I should not remove additional OSTs.
|
| Comment by Peter Jones [ 07/Dec/17 ] |
|
Fan Yong is already assisting here |
| Comment by nasf (Inactive) [ 08/Dec/17 ] |
|
ndauchy, would you please to check whether or not your branch has the following two patches: If not, please try to apply them and re-run the dryun mode layout LFSCK? Thanks! |
| Comment by Nathan Dauchy (Inactive) [ 08/Dec/17 ] |
|
Since we are based on 2.7.3 FE branch, I'm pretty confident we have the first of those patches, but not the second since it looks like you just created the backport (and it failed a build test). Will confirm though. Regardless, it will be a while before we can complete the server rebuild and take a downtime to apply it. In the meantime I need to move forward with removing additional OSTs from the file system to free up the hardware for spares. Do you recommend we run the current lfsck in non-dry-run mode, or just go ahead and remove the OST now and wait for the lfsck updates? Thanks! |
| Comment by Jay Lan (Inactive) [ 08/Dec/17 ] |
|
We do not have either. Review #16135 is in 'Need Code-Review' state for more than 1 year. Is it OK to cherry-pick as it is, Fan? |
| Comment by nasf (Inactive) [ 12/Dec/17 ] |
There are two kinds inconsistency reported by the layout LFSCK, one is inconsistent owner information that cause by some known layout LFSCK issue, can be ignored. Another is the unmatched MDT-object and OST-object pairs. Such inconsistency will NOT affect normal system access unless enabling I/O verification (disable by default) explicitly. So I would suggest to keep the system unchanged since it is without influence now.
Yes, I think so. Such patch has already been landed to b2_8_fe and master. It is not on b2_7_fe may because b2_7 was some old at that time and such issue is not very serious. |
| Comment by Nathan Dauchy (Inactive) [ 12/Dec/17 ] |
OK, we will proceed with removing the additional OSTs from the file system, and wait for install of the patched lfsck on the 2.7.3 servers and perform the cleanup later. Thanks! |
| Comment by nasf (Inactive) [ 20/Dec/17 ] |
|
The patch for repair unexpected inconsistent owner on b2_7_fe: |
| Comment by nasf (Inactive) [ 26/Jan/18 ] |
|
The patch https://review.whamcloud.com/#/c/30612/ for fixing the issue of unexpected inconsistent owner has already been landed to master. And it has been ported to b2_7_fe branch via the patch https://review.whamcloud.com/30613. You can use related patch to resolve your system trouble. Please let me you what else you need. |
| Comment by Jay Lan (Inactive) [ 29/Jan/18 ] |
|
The above patch #30612 for b2_10 is at |
| Comment by Peter Jones [ 29/Jan/18 ] |
|
Jay That is being tracked under Peter |
| Comment by Peter Jones [ 26/Mar/18 ] |
|
AFAICT this is now resolved with the |