Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4102

lots of multiply-claimed blocks in e2fsck

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 1.8.8
    • e2fsprogs 1.41.90.wc2
    • 3
    • 11017

    Description

      After a power loss, an older e2fsck (e2fsprogs 1.41.90.wc2) was run on the OSTs. It found tons of multiply-claimed blocks, including for the /O directory. Here's an example of one of the inodes:

      File ... (inode #17825793, mod time Wed Aug 15 19:02:25 2012)
        has 1 multiply-claimed block(s), shared with 1 file(s):
              /O (inode #84934657, mod time Wed Aug 15 19:02:25 2012)
      Clone multiply-claimed blocks? yes
      
      Inode 17825793 doesn't have an associated directory entry, it eventually gets put into lost+found.
      

      So the questions are:

      • how could this have happened? My slightly-informed-probably-wrong theory is that the journal got corrupted and it replayed some old inodes back into existence. I noticed there were a lot of patches dealing with journal checksums committed after 1.41.90.
      • what's the best way to deal with these? Cloning takes forever when you are talking about TB sized files. I tested the delete extended option, and it looks like it deletes both sides of the file. It would be nice if it just deleted the unlinked side. Right now my plan is to create a debugfs script from the read-only e2fsck output, but if there is a better way, that would be good.

      Thanks.

      Attachments

        Issue Links

          Activity

            [LU-4102] lots of multiply-claimed blocks in e2fsck

            Hi Andreas,

            One of the targets just went RO:
            Oct 28 17:09:45 lfs-oss-2-6 kernel: LDISKFS-fs error (device dm-50): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 577corrupted: 24074 blocks free in bitmap,
            24075 - in gd

            It looks like I missed it when disabling targets. I had forgotten that I used the shared=ignore flag when cleaning it up, so the clean bill of health from e2fsck was an illusion.

            I've marked it deactivated on the MDT. Hopefully it can hold until Wednesday.

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Andreas, One of the targets just went RO: Oct 28 17:09:45 lfs-oss-2-6 kernel: LDISKFS-fs error (device dm-50): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 577corrupted: 24074 blocks free in bitmap, 24075 - in gd It looks like I missed it when disabling targets. I had forgotten that I used the shared=ignore flag when cleaning it up, so the clean bill of health from e2fsck was an illusion. I've marked it deactivated on the MDT. Hopefully it can hold until Wednesday.

            I think it seems reasonable, so long as the new ll_recover_lost_found_objs fixes the shared block problem to a large extent.

            It should be possible to do a test run a full test against a raw e2image file for rach of the OSTs. This would reduce the risk of problems during the actual repair, give some confidence that the remaining problems will be repaired, and also minimize the system downtime because the debugfs scripts can be generated while the system is still running.

            Loopback mount the raw image file, run "ll_recover_lost_found_objs -n" against it and unmount. Generate and run the debugfs script against the raw image file, then run "e2fsck -fy" on the image to see what is left. If all goes well, the debugfs script can be used on the real device.

            adilger Andreas Dilger added a comment - I think it seems reasonable, so long as the new ll_recover_lost_found_objs fixes the shared block problem to a large extent. It should be possible to do a test run a full test against a raw e2image file for rach of the OSTs. This would reduce the risk of problems during the actual repair, give some confidence that the remaining problems will be repaired, and also minimize the system downtime because the debugfs scripts can be generated while the system is still running. Loopback mount the raw image file, run "ll_recover_lost_found_objs -n" against it and unmount. Generate and run the debugfs script against the raw image file, then run "e2fsck -fy" on the image to see what is left. If all goes well, the debugfs script can be used on the real device.

            Here's the list of corruption and a general plan of attack:
            ost_15 unattached inodes
            ost_19 multiply-claimed block (2780 inodes), unattached inodes
            ost_28 multiply-claimed blocks (52 inodes), unattached inodes
            ost_32 unattached inodes - will take multiple passes
            ost_34 multiply-claimed blocks (48 inodes), unattached inodes
            ost_36 multiply-claimed blocks (1792 inodes), unattached inodes
            ost_3 multiply-claimed blocks (376 inodes), unattached inodes
            ost_45 multiply-claimed blocks (362 inodes), unattached inodes

            Plan:
            1) using bind mount, run new ll_recover -n on osts with multiply-claimed blocks
            2) use list of duplicate files to create a debugfs script to clri and unlink each "evil twin" file
            3) take downtime to execute debugfs script
            4) [1 hour] run e2fsck -n on OSTs to make sure all multiply-claimed blocks are gone
            a) prepare script in advance to move files to lost+found if any are left
            5) [1 hour] run e2fsck -p on all OSTs, as well as ll_recover
            a) it's possible that there could be unknown issues at this stage
            6) [30 min] clri any multiply-claimed block files in l+f, delete all other files
            a) prepare script to nuke l+f
            7) [30 min] rerun e2fsck -p to verify that all OSTs are clean
            a) again, it's possible that there could be unknown issues at this stage

            So I am thinking 3 hours + 4 hours for unknown issues + 1 hour for startup/shutdown. What do you think of this plan/schedule?

            kitwestneat Kit Westneat (Inactive) added a comment - Here's the list of corruption and a general plan of attack: ost_15 unattached inodes ost_19 multiply-claimed block (2780 inodes), unattached inodes ost_28 multiply-claimed blocks (52 inodes), unattached inodes ost_32 unattached inodes - will take multiple passes ost_34 multiply-claimed blocks (48 inodes), unattached inodes ost_36 multiply-claimed blocks (1792 inodes), unattached inodes ost_3 multiply-claimed blocks (376 inodes), unattached inodes ost_45 multiply-claimed blocks (362 inodes), unattached inodes Plan: 1) using bind mount, run new ll_recover -n on osts with multiply-claimed blocks 2) use list of duplicate files to create a debugfs script to clri and unlink each "evil twin" file 3) take downtime to execute debugfs script 4) [1 hour] run e2fsck -n on OSTs to make sure all multiply-claimed blocks are gone a) prepare script in advance to move files to lost+found if any are left 5) [1 hour] run e2fsck -p on all OSTs, as well as ll_recover a) it's possible that there could be unknown issues at this stage 6) [30 min] clri any multiply-claimed block files in l+f, delete all other files a) prepare script to nuke l+f 7) [30 min] rerun e2fsck -p to verify that all OSTs are clean a) again, it's possible that there could be unknown issues at this stage So I am thinking 3 hours + 4 hours for unknown issues + 1 hour for startup/shutdown. What do you think of this plan/schedule?

            Sorry I was unclear which disks, I meant the system disks are formatted as ext3, where I was building the sparse file. I got a non-sparse e2image I am copying over to webspace. Nathan also set up a Lustre filesystem I will use to dump a sparse image to.

            kitwestneat Kit Westneat (Inactive) added a comment - Sorry I was unclear which disks, I meant the system disks are formatted as ext3, where I was building the sparse file. I got a non-sparse e2image I am copying over to webspace. Nathan also set up a Lustre filesystem I will use to dump a sparse image to.

            NB - you can avoid the 2TB limit if you stripe the file more widely so that individual objects are below 2TB. If you are running an ext4-based ldiskfs (presumably yes) but on a filesystem that was formatted a while ago, you can use "tune2fs -O huge_file" to enable larger-than-2TB files also.

            adilger Andreas Dilger added a comment - NB - you can avoid the 2TB limit if you stripe the file more widely so that individual objects are below 2TB. If you are running an ext4-based ldiskfs (presumably yes) but on a filesystem that was formatted a while ago, you can use "tune2fs -O huge_file" to enable larger-than-2TB files also.

            Actually non-sparse e2image works fine. It looks like the sparse image is having issues past 2TB. I guess the FSes on this OSS are all ext3, so that would explain it.

            I'll create a dm snapshot to test the tool on.

            kitwestneat Kit Westneat (Inactive) added a comment - Actually non-sparse e2image works fine. It looks like the sparse image is having issues past 2TB. I guess the FSes on this OSS are all ext3, so that would explain it. I'll create a dm snapshot to test the tool on.

            Hi Andreas,

            I'm getting an error when I try to run e2image, I've done it on a couple OSTs:
            [root@lfs-oss-2-4 ~]# /usr/bin/time e2image -r /dev/mapper/ost_lfs2_19 /scratch/ost_lfs2_19.img
            e2image 1.42.7.wc1 (12-Apr-2013)
            seek: Invalid argument
            Command exited with non-zero status 1

            I got an strace, would that be useful?

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Andreas, I'm getting an error when I try to run e2image, I've done it on a couple OSTs: [root@lfs-oss-2-4 ~] # /usr/bin/time e2image -r /dev/mapper/ost_lfs2_19 /scratch/ost_lfs2_19.img e2image 1.42.7.wc1 (12-Apr-2013) seek: Invalid argument Command exited with non-zero status 1 I got an strace, would that be useful?
            adilger Andreas Dilger added a comment - Patch at http://review.whamcloud.com/8061

            Kit, I've pushed a patch for ll_recover_lost_found_objs which should report any inconsistent objects in the O/* directory tree as discussed above. It should be run with the "-n" option on the "O" directory (instead of the "lost+found" as it usually does. This should report inodes which incorrectly report in their FID xattr that they are a different object ID. It worked OK in my simple testing here, but I would strongly recommend to run this on a test copy of the OST first. This should be best tested against a sparse copy of one of the problematic OSTs, using "e2image -r" and then mounting the raw image with "-o loop".

            Please let me know how this works out.

            adilger Andreas Dilger added a comment - Kit, I've pushed a patch for ll_recover_lost_found_objs which should report any inconsistent objects in the O/* directory tree as discussed above. It should be run with the "-n" option on the "O" directory (instead of the "lost+found" as it usually does. This should report inodes which incorrectly report in their FID xattr that they are a different object ID. It worked OK in my simple testing here, but I would strongly recommend to run this on a test copy of the OST first. This should be best tested against a sparse copy of one of the problematic OSTs, using "e2image -r" and then mounting the raw image with "-o loop". Please let me know how this works out.

            I'd mistakenly looked at a partially-downloaded tarball, and didn't see most of the remaining problematic OSTs that are still disabled:

            23 IN osc lfs2-OST0022-osc lfs2-mdtlov_UUID 5
            25 IN osc lfs2-OST0003-osc lfs2-mdtlov_UUID 5
            27 IN osc lfs2-OST0013-osc lfs2-mdtlov_UUID 5
            34 IN osc lfs2-OST001c-osc lfs2-mdtlov_UUID 5
            35 IN osc lfs2-OST0024-osc lfs2-mdtlov_UUID 5
            47 IN osc lfs2-OST0026-osc lfs2-mdtlov_UUID 5
            50 IN osc lfs2-OST000f-osc lfs2-mdtlov_UUID 5
            

            The corruption appears to be a result of large chunks of the inode table being overwritten by other parts of the inode table. That means there are a large number of bad inodes that are exact copies of valid inodes. This results in objects in /O/

            {seq}/d*/nnnnn actually having an LMA FID or filter_fid xattr that references a different object ID than 'nnnnn'. Our plan moving forward is that I will work on enhancing ll_recover_lost_found_objs to detect and report this mismatch so that running it on the /O directory will verify the O/{seq}

            /d*/nnnnn object name maps to the same FID stored in the inode xattr.

            adilger Andreas Dilger added a comment - I'd mistakenly looked at a partially-downloaded tarball, and didn't see most of the remaining problematic OSTs that are still disabled: 23 IN osc lfs2-OST0022-osc lfs2-mdtlov_UUID 5 25 IN osc lfs2-OST0003-osc lfs2-mdtlov_UUID 5 27 IN osc lfs2-OST0013-osc lfs2-mdtlov_UUID 5 34 IN osc lfs2-OST001c-osc lfs2-mdtlov_UUID 5 35 IN osc lfs2-OST0024-osc lfs2-mdtlov_UUID 5 47 IN osc lfs2-OST0026-osc lfs2-mdtlov_UUID 5 50 IN osc lfs2-OST000f-osc lfs2-mdtlov_UUID 5 The corruption appears to be a result of large chunks of the inode table being overwritten by other parts of the inode table. That means there are a large number of bad inodes that are exact copies of valid inodes. This results in objects in /O/ {seq}/d*/nnnnn actually having an LMA FID or filter_fid xattr that references a different object ID than 'nnnnn'. Our plan moving forward is that I will work on enhancing ll_recover_lost_found_objs to detect and report this mismatch so that running it on the /O directory will verify the O/{seq} /d*/nnnnn object name maps to the same FID stored in the inode xattr.

            Looking at the most recent e2fsck logs, most of the filesystems look reasonably clean (looks like LU-3542 is being hit on most of them), except for OST 38 looks like it has a lot of problems. It seems OST 38 has corruption of the OST object directory O/0 and is putting most of the objects into /lost+found. It makes sense to run ll_recover_lost_found_objs on that OST to rebuild the object directories and move the objects back out to the O/0 hierarchy.

            adilger Andreas Dilger added a comment - Looking at the most recent e2fsck logs, most of the filesystems look reasonably clean (looks like LU-3542 is being hit on most of them), except for OST 38 looks like it has a lot of problems. It seems OST 38 has corruption of the OST object directory O/0 and is putting most of the objects into /lost+found. It makes sense to run ll_recover_lost_found_objs on that OST to rebuild the object directories and move the objects back out to the O/0 hierarchy.

            People

              niu Niu Yawei (Inactive)
              orentas Oz Rentas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: