[LU-16689] upgrade to 2.15.2 lost sever top level directories Created: 30/Mar/23 Updated: 13/May/23 Resolved: 13/May/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Andreas Dilger |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After upgrading filesystem from 2.12 to 2.15.2 Several top level directories got corrupted. [root@nbp11-srv1 ~]# ls -l /nobackupp11/
debugfs: stat ylin4
Not thinking I delete these via ldiskfs. The data is still there how can we recover the director data.
|
| Comments |
| Comment by Mahmoud Hanafi [ 30/Mar/23 ] |
|
I started a lfsck dry-run on MDT0 getting a lot of these errors that are for files with hard links Mar 30 12:32:28 nbp11-srv1 kernel: ret_from_fork+0x1f/0x40 on MDT2 getting these errors Mar 30 12:33:43 nbp11-srv5 kernel: Lustre: nbp11-MDT0002-osd: layout LFSCK master found bad lmm_oi for [0x2400ecb78:0x1e8bb:0x0]: rc = 56 These are the files for the bad directories. |
| Comment by Mahmoud Hanafi [ 30/Mar/23 ] |
|
I recovered the files. I found the parent fid and cd into /fs/.lustre/fid/fidnum then just move all contents to a newly created directory I still like to understand what caused the corruption. |
| Comment by Dongyang Li [ 30/Mar/23 ] |
|
Hi Mahmoud, 2 questions: |
| Comment by Andreas Dilger [ 31/Mar/23 ] |
|
This looks like |
| Comment by Andreas Dilger [ 31/Mar/23 ] |
|
It may be that mounting the MDT with "-o resetoi" would have rebuilt the OI files without having to move them from lost+found, in case someone finds this ticket in the future. |
| Comment by Mahmoud Hanafi [ 31/Mar/23 ] |
|
I used debugfs to dump all fid in /REMOTE_DIR on each MDT. Then I did a lookup of the fid2path to match the directories that were missing. I then cd into the /fs/.lustre/fid/fidnum and moved all contents to its new location. Dry-run lfsck still running and finding lots of these These are files under the directories that gotten corrupted. |
| Comment by Andreas Dilger [ 31/Mar/23 ] |
|
Mahmoud, do you have any logs from the mount after the upgrade that indicate OI Scrub has been run/completed on the MDTs? It would be worthwhile to check the state of the OI files on the MDTs to confirm that they are correct: mds# lctl get_param osd-ldiskfs.*.oi_scrub osd-ldiskfs.testfs-MDT0000.oi_scrub= name: OI_scrub magic: 0x4c5fd252 oi_files: 64 status: completed flags: param: time_since_last_completed: 6 seconds time_since_latest_start: 6 seconds time_since_last_checkpoint: 6 seconds : The important information here is that it reports oi_files: 64 and not some other number (which is what the If this is showing "oi_files: 1" or 2 or similar, my recommendation would be to mount the MDTs with "-o resetoi" to force a rebuild of the OI files, or alternately mount MDTs with ldiskfs and move the "oi.16.X" files out of the filesystem and then remount as Lustre and it should rebuild them automatically at mount (this will take a few minutes). Having a small number of OI files will cause scalability/performance issues. |
| Comment by Peter Jones [ 21/Apr/23 ] |
|
Mahmoud I'm just checking in on this one. Presumably you have the Peter |
| Comment by Peter Jones [ 13/May/23 ] |
|
Closing this as a duplicate of |