[LU-15016] OI Scrub backup and rebuild Created: 17/Sep/21  Updated: 28/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: LFSCK

Issue Links:
Related
is related to LU-12268 LDISKFS-fs error: ldiskfs_find_dest_d... Resolved
is related to LU-12265 LustreError: 141027:0:(osd_iam_lfix.c... Reopened
Rank (Obsolete): 9223372036854775807

 Description   

There are two separate changes could be done to improve this situation in the future instead of the MDT being taken offline and waiting for a full OI rebuild to finish:

  • handling this IAM error more gracefully, by resetting the IAM block with the corrupt magic, and maybe scanning the rest of the IAM file to recover any unlinked IAM blocks (but this may not be better than just rebuilding the whole IAM file, together with the next option). Then triggering an internal OI Scrub to re-insert any missing FIDs into the existing OI file. That should be done under LU-12265.
  • have the "resetoi" code save a backup of the OI files (eg. oi.16.N.bak) to do FID->inode lookups that are missing from the new OI file, while the new OI files are being rebuilt. That would allow most of the FID lookups to finish with the old OI during the rebuild (though not all, if it had some error). The OI backups would be deleted after the OI Scrub is finished.

Once these functions are implemented separately, then it should be possible to combine them, and add an osd-ldiskfs.*.resetoi=N parameter can trigger "rename oi.16.N to oi.16.N.bak and rebuild" transparently to the running system.



 Comments   
Comment by Andreas Dilger [ 28/Jan/22 ]

"Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45071
Subject: LU-12265 osd: fix corrupted OI file online
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c0b2d11c325e042f724447ee45bc1ca1d2ff5379

Generated at Sat Feb 10 03:14:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.