[LU-4102] lots of multiply-claimed blocks in e2fsck - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 1.8.8
Labels:
- mn1
Environment:
e2fsprogs 1.41.90.wc2

Severity:
3
Rank (Obsolete):
11017

Description

After a power loss, an older e2fsck (e2fsprogs 1.41.90.wc2) was run on the OSTs. It found tons of multiply-claimed blocks, including for the /O directory. Here's an example of one of the inodes:

File ... (inode #17825793, mod time Wed Aug 15 19:02:25 2012)
  has 1 multiply-claimed block(s), shared with 1 file(s):
        /O (inode #84934657, mod time Wed Aug 15 19:02:25 2012)
Clone multiply-claimed blocks? yes

Inode 17825793 doesn't have an associated directory entry, it eventually gets put into lost+found.

So the questions are:

how could this have happened? My slightly-informed-probably-wrong theory is that the journal got corrupted and it replayed some old inodes back into existence. I noticed there were a lot of patches dealing with journal checksums committed after 1.41.90.
what's the best way to deal with these? Cloning takes forever when you are talking about TB sized files. I tested the delete extended option, and it looks like it deletes both sides of the file. It would be nice if it just deleted the unlinked side. Right now my plan is to create a debugfs script from the read-only e2fsck output, but if there is a better way, that would be good.

Thanks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

e2fsck.patch1
2 kB
19/Oct/13 4:40 AM
lfs2_unusual_syslog_msgs_2013-10-19.txt
67 kB
19/Oct/13 5:06 PM
ost15.log.tgz
599 kB
14/Oct/13 10:09 PM

Issue Links

is related to

LU-4745 Interop 2.5.0<->2.6 failure on test suite conf-sanity test_52: ll_recover_lost_found_objs failed

Resolved

LU-16171 e2fsck should handle multiply-claimed blocks better

Resolved

Activity

[LU-4102] lots of multiply-claimed blocks in e2fsck

Kit Westneat (Inactive) added a comment - 28/Oct/13 5:27 PM

Hi Andreas,

One of the targets just went RO:
Oct 28 17:09:45 lfs-oss-2-6 kernel: LDISKFS-fs error (device dm-50): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 577corrupted: 24074 blocks free in bitmap,
24075 - in gd

It looks like I missed it when disabling targets. I had forgotten that I used the shared=ignore flag when cleaning it up, so the clean bill of health from e2fsck was an illusion.

I've marked it deactivated on the MDT. Hopefully it can hold until Wednesday.

Kit Westneat (Inactive) added a comment - 28/Oct/13 5:27 PM Hi Andreas, One of the targets just went RO: Oct 28 17:09:45 lfs-oss-2-6 kernel: LDISKFS-fs error (device dm-50): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 577corrupted: 24074 blocks free in bitmap, 24075 - in gd It looks like I missed it when disabling targets. I had forgotten that I used the shared=ignore flag when cleaning it up, so the clean bill of health from e2fsck was an illusion. I've marked it deactivated on the MDT. Hopefully it can hold until Wednesday.

Andreas Dilger added a comment - 25/Oct/13 4:31 PM

I think it seems reasonable, so long as the new ll_recover_lost_found_objs fixes the shared block problem to a large extent.

It should be possible to do a test run a full test against a raw e2image file for rach of the OSTs. This would reduce the risk of problems during the actual repair, give some confidence that the remaining problems will be repaired, and also minimize the system downtime because the debugfs scripts can be generated while the system is still running.

Loopback mount the raw image file, run "ll_recover_lost_found_objs -n" against it and unmount. Generate and run the debugfs script against the raw image file, then run "e2fsck -fy" on the image to see what is left. If all goes well, the debugfs script can be used on the real device.

Andreas Dilger added a comment - 25/Oct/13 4:31 PM I think it seems reasonable, so long as the new ll_recover_lost_found_objs fixes the shared block problem to a large extent. It should be possible to do a test run a full test against a raw e2image file for rach of the OSTs. This would reduce the risk of problems during the actual repair, give some confidence that the remaining problems will be repaired, and also minimize the system downtime because the debugfs scripts can be generated while the system is still running. Loopback mount the raw image file, run "ll_recover_lost_found_objs -n" against it and unmount. Generate and run the debugfs script against the raw image file, then run "e2fsck -fy" on the image to see what is left. If all goes well, the debugfs script can be used on the real device.

Kit Westneat (Inactive) added a comment - 25/Oct/13 4:05 PM

Here's the list of corruption and a general plan of attack:
ost_15 unattached inodes
ost_19 multiply-claimed block (2780 inodes), unattached inodes
ost_28 multiply-claimed blocks (52 inodes), unattached inodes
ost_32 unattached inodes - will take multiple passes
ost_34 multiply-claimed blocks (48 inodes), unattached inodes
ost_36 multiply-claimed blocks (1792 inodes), unattached inodes
ost_3 multiply-claimed blocks (376 inodes), unattached inodes
ost_45 multiply-claimed blocks (362 inodes), unattached inodes

Plan:
1) using bind mount, run new ll_recover -n on osts with multiply-claimed blocks
2) use list of duplicate files to create a debugfs script to clri and unlink each "evil twin" file
3) take downtime to execute debugfs script
4) [1 hour] run e2fsck -n on OSTs to make sure all multiply-claimed blocks are gone
a) prepare script in advance to move files to lost+found if any are left
5) [1 hour] run e2fsck -p on all OSTs, as well as ll_recover
a) it's possible that there could be unknown issues at this stage
6) [30 min] clri any multiply-claimed block files in l+f, delete all other files
a) prepare script to nuke l+f
7) [30 min] rerun e2fsck -p to verify that all OSTs are clean
a) again, it's possible that there could be unknown issues at this stage

So I am thinking 3 hours + 4 hours for unknown issues + 1 hour for startup/shutdown. What do you think of this plan/schedule?

Kit Westneat (Inactive) added a comment - 25/Oct/13 4:05 PM Here's the list of corruption and a general plan of attack: ost_15 unattached inodes ost_19 multiply-claimed block (2780 inodes), unattached inodes ost_28 multiply-claimed blocks (52 inodes), unattached inodes ost_32 unattached inodes - will take multiple passes ost_34 multiply-claimed blocks (48 inodes), unattached inodes ost_36 multiply-claimed blocks (1792 inodes), unattached inodes ost_3 multiply-claimed blocks (376 inodes), unattached inodes ost_45 multiply-claimed blocks (362 inodes), unattached inodes Plan: 1) using bind mount, run new ll_recover -n on osts with multiply-claimed blocks 2) use list of duplicate files to create a debugfs script to clri and unlink each "evil twin" file 3) take downtime to execute debugfs script 4) [1 hour] run e2fsck -n on OSTs to make sure all multiply-claimed blocks are gone a) prepare script in advance to move files to lost+found if any are left 5) [1 hour] run e2fsck -p on all OSTs, as well as ll_recover a) it's possible that there could be unknown issues at this stage 6) [30 min] clri any multiply-claimed block files in l+f, delete all other files a) prepare script to nuke l+f 7) [30 min] rerun e2fsck -p to verify that all OSTs are clean a) again, it's possible that there could be unknown issues at this stage So I am thinking 3 hours + 4 hours for unknown issues + 1 hour for startup/shutdown. What do you think of this plan/schedule?

Kit Westneat (Inactive) added a comment - 24/Oct/13 8:31 PM

Sorry I was unclear which disks, I meant the system disks are formatted as ext3, where I was building the sparse file. I got a non-sparse e2image I am copying over to webspace. Nathan also set up a Lustre filesystem I will use to dump a sparse image to.

Kit Westneat (Inactive) added a comment - 24/Oct/13 8:31 PM Sorry I was unclear which disks, I meant the system disks are formatted as ext3, where I was building the sparse file. I got a non-sparse e2image I am copying over to webspace. Nathan also set up a Lustre filesystem I will use to dump a sparse image to.

Andreas Dilger added a comment - 24/Oct/13 8:24 PM

NB - you can avoid the 2TB limit if you stripe the file more widely so that individual objects are below 2TB. If you are running an ext4-based ldiskfs (presumably yes) but on a filesystem that was formatted a while ago, you can use "tune2fs -O huge_file" to enable larger-than-2TB files also.

Andreas Dilger added a comment - 24/Oct/13 8:24 PM NB - you can avoid the 2TB limit if you stripe the file more widely so that individual objects are below 2TB. If you are running an ext4-based ldiskfs (presumably yes) but on a filesystem that was formatted a while ago, you can use "tune2fs -O huge_file" to enable larger-than-2TB files also.

Kit Westneat (Inactive) added a comment - 24/Oct/13 5:40 PM

Actually non-sparse e2image works fine. It looks like the sparse image is having issues past 2TB. I guess the FSes on this OSS are all ext3, so that would explain it.

I'll create a dm snapshot to test the tool on.

Kit Westneat (Inactive) added a comment - 24/Oct/13 5:40 PM Actually non-sparse e2image works fine. It looks like the sparse image is having issues past 2TB. I guess the FSes on this OSS are all ext3, so that would explain it. I'll create a dm snapshot to test the tool on.

Kit Westneat (Inactive) added a comment - 24/Oct/13 5:12 PM

Hi Andreas,

I'm getting an error when I try to run e2image, I've done it on a couple OSTs:
[root@lfs-oss-2-4 ~]# /usr/bin/time e2image -r /dev/mapper/ost_lfs2_19 /scratch/ost_lfs2_19.img
e2image 1.42.7.wc1 (12-Apr-2013)
seek: Invalid argument
Command exited with non-zero status 1

I got an strace, would that be useful?

Kit Westneat (Inactive) added a comment - 24/Oct/13 5:12 PM Hi Andreas, I'm getting an error when I try to run e2image, I've done it on a couple OSTs: [root@lfs-oss-2-4 ~] # /usr/bin/time e2image -r /dev/mapper/ost_lfs2_19 /scratch/ost_lfs2_19.img e2image 1.42.7.wc1 (12-Apr-2013) seek: Invalid argument Command exited with non-zero status 1 I got an strace, would that be useful?

Andreas Dilger added a comment - 24/Oct/13 8:35 AM

Patch at http://review.whamcloud.com/8061

Andreas Dilger added a comment - 24/Oct/13 8:35 AM Patch at http://review.whamcloud.com/8061

Andreas Dilger added a comment - 24/Oct/13 8:35 AM

Kit, I've pushed a patch for ll_recover_lost_found_objs which should report any inconsistent objects in the O/* directory tree as discussed above. It should be run with the "-n" option on the "O" directory (instead of the "lost+found" as it usually does. This should report inodes which incorrectly report in their FID xattr that they are a different object ID. It worked OK in my simple testing here, but I would strongly recommend to run this on a test copy of the OST first. This should be best tested against a sparse copy of one of the problematic OSTs, using "e2image -r" and then mounting the raw image with "-o loop".

Please let me know how this works out.

Andreas Dilger added a comment - 24/Oct/13 8:35 AM Kit, I've pushed a patch for ll_recover_lost_found_objs which should report any inconsistent objects in the O/* directory tree as discussed above. It should be run with the "-n" option on the "O" directory (instead of the "lost+found" as it usually does. This should report inodes which incorrectly report in their FID xattr that they are a different object ID. It worked OK in my simple testing here, but I would strongly recommend to run this on a test copy of the OST first. This should be best tested against a sparse copy of one of the problematic OSTs, using "e2image -r" and then mounting the raw image with "-o loop". Please let me know how this works out.

Andreas Dilger added a comment - 23/Oct/13 3:18 PM

I'd mistakenly looked at a partially-downloaded tarball, and didn't see most of the remaining problematic OSTs that are still disabled:

23 IN osc lfs2-OST0022-osc lfs2-mdtlov_UUID 5
25 IN osc lfs2-OST0003-osc lfs2-mdtlov_UUID 5
27 IN osc lfs2-OST0013-osc lfs2-mdtlov_UUID 5
34 IN osc lfs2-OST001c-osc lfs2-mdtlov_UUID 5
35 IN osc lfs2-OST0024-osc lfs2-mdtlov_UUID 5
47 IN osc lfs2-OST0026-osc lfs2-mdtlov_UUID 5
50 IN osc lfs2-OST000f-osc lfs2-mdtlov_UUID 5

The corruption appears to be a result of large chunks of the inode table being overwritten by other parts of the inode table. That means there are a large number of bad inodes that are exact copies of valid inodes. This results in objects in /O/

{seq}/d*/nnnnn actually having an LMA FID or filter_fid xattr that references a different object ID than 'nnnnn'. Our plan moving forward is that I will work on enhancing ll_recover_lost_found_objs to detect and report this mismatch so that running it on the /O directory will verify the O/{seq}

/d*/nnnnn object name maps to the same FID stored in the inode xattr.

Andreas Dilger added a comment - 23/Oct/13 3:18 PM I'd mistakenly looked at a partially-downloaded tarball, and didn't see most of the remaining problematic OSTs that are still disabled: 23 IN osc lfs2-OST0022-osc lfs2-mdtlov_UUID 5 25 IN osc lfs2-OST0003-osc lfs2-mdtlov_UUID 5 27 IN osc lfs2-OST0013-osc lfs2-mdtlov_UUID 5 34 IN osc lfs2-OST001c-osc lfs2-mdtlov_UUID 5 35 IN osc lfs2-OST0024-osc lfs2-mdtlov_UUID 5 47 IN osc lfs2-OST0026-osc lfs2-mdtlov_UUID 5 50 IN osc lfs2-OST000f-osc lfs2-mdtlov_UUID 5 The corruption appears to be a result of large chunks of the inode table being overwritten by other parts of the inode table. That means there are a large number of bad inodes that are exact copies of valid inodes. This results in objects in /O/ {seq}/d*/nnnnn actually having an LMA FID or filter_fid xattr that references a different object ID than 'nnnnn'. Our plan moving forward is that I will work on enhancing ll_recover_lost_found_objs to detect and report this mismatch so that running it on the /O directory will verify the O/{seq} /d*/nnnnn object name maps to the same FID stored in the inode xattr.

Andreas Dilger added a comment - 22/Oct/13 8:11 PM

Looking at the most recent e2fsck logs, most of the filesystems look reasonably clean (looks like ~~LU-3542~~ is being hit on most of them), except for OST 38 looks like it has a lot of problems. It seems OST 38 has corruption of the OST object directory O/0 and is putting most of the objects into /lost+found. It makes sense to run ll_recover_lost_found_objs on that OST to rebuild the object directories and move the objects back out to the O/0 hierarchy.

Andreas Dilger added a comment - 22/Oct/13 8:11 PM Looking at the most recent e2fsck logs, most of the filesystems look reasonably clean (looks like LU-3542 is being hit on most of them), except for OST 38 looks like it has a lot of problems. It seems OST 38 has corruption of the OST object directory O/0 and is putting most of the objects into /lost+found. It makes sense to run ll_recover_lost_found_objs on that OST to rebuild the object directories and move the objects back out to the O/0 hierarchy.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Oz Rentas (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 14/Oct/13 4:48 PM

Updated:: 19/Sep/22 7:48 PM

Resolved:: 19/Jul/17 1:19 PM