Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4102

lots of multiply-claimed blocks in e2fsck

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 1.8.8
    • e2fsprogs 1.41.90.wc2
    • 3
    • 11017

    Description

      After a power loss, an older e2fsck (e2fsprogs 1.41.90.wc2) was run on the OSTs. It found tons of multiply-claimed blocks, including for the /O directory. Here's an example of one of the inodes:

      File ... (inode #17825793, mod time Wed Aug 15 19:02:25 2012)
        has 1 multiply-claimed block(s), shared with 1 file(s):
              /O (inode #84934657, mod time Wed Aug 15 19:02:25 2012)
      Clone multiply-claimed blocks? yes
      
      Inode 17825793 doesn't have an associated directory entry, it eventually gets put into lost+found.
      

      So the questions are:

      • how could this have happened? My slightly-informed-probably-wrong theory is that the journal got corrupted and it replayed some old inodes back into existence. I noticed there were a lot of patches dealing with journal checksums committed after 1.41.90.
      • what's the best way to deal with these? Cloning takes forever when you are talking about TB sized files. I tested the delete extended option, and it looks like it deletes both sides of the file. It would be nice if it just deleted the unlinked side. Right now my plan is to create a debugfs script from the read-only e2fsck output, but if there is a better way, that would be good.

      Thanks.

      Attachments

        Issue Links

          Activity

            [LU-4102] lots of multiply-claimed blocks in e2fsck

            NB - you can avoid the 2TB limit if you stripe the file more widely so that individual objects are below 2TB. If you are running an ext4-based ldiskfs (presumably yes) but on a filesystem that was formatted a while ago, you can use "tune2fs -O huge_file" to enable larger-than-2TB files also.

            adilger Andreas Dilger added a comment - NB - you can avoid the 2TB limit if you stripe the file more widely so that individual objects are below 2TB. If you are running an ext4-based ldiskfs (presumably yes) but on a filesystem that was formatted a while ago, you can use "tune2fs -O huge_file" to enable larger-than-2TB files also.

            Actually non-sparse e2image works fine. It looks like the sparse image is having issues past 2TB. I guess the FSes on this OSS are all ext3, so that would explain it.

            I'll create a dm snapshot to test the tool on.

            kitwestneat Kit Westneat (Inactive) added a comment - Actually non-sparse e2image works fine. It looks like the sparse image is having issues past 2TB. I guess the FSes on this OSS are all ext3, so that would explain it. I'll create a dm snapshot to test the tool on.

            Hi Andreas,

            I'm getting an error when I try to run e2image, I've done it on a couple OSTs:
            [root@lfs-oss-2-4 ~]# /usr/bin/time e2image -r /dev/mapper/ost_lfs2_19 /scratch/ost_lfs2_19.img
            e2image 1.42.7.wc1 (12-Apr-2013)
            seek: Invalid argument
            Command exited with non-zero status 1

            I got an strace, would that be useful?

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Andreas, I'm getting an error when I try to run e2image, I've done it on a couple OSTs: [root@lfs-oss-2-4 ~] # /usr/bin/time e2image -r /dev/mapper/ost_lfs2_19 /scratch/ost_lfs2_19.img e2image 1.42.7.wc1 (12-Apr-2013) seek: Invalid argument Command exited with non-zero status 1 I got an strace, would that be useful?
            adilger Andreas Dilger added a comment - Patch at http://review.whamcloud.com/8061

            Kit, I've pushed a patch for ll_recover_lost_found_objs which should report any inconsistent objects in the O/* directory tree as discussed above. It should be run with the "-n" option on the "O" directory (instead of the "lost+found" as it usually does. This should report inodes which incorrectly report in their FID xattr that they are a different object ID. It worked OK in my simple testing here, but I would strongly recommend to run this on a test copy of the OST first. This should be best tested against a sparse copy of one of the problematic OSTs, using "e2image -r" and then mounting the raw image with "-o loop".

            Please let me know how this works out.

            adilger Andreas Dilger added a comment - Kit, I've pushed a patch for ll_recover_lost_found_objs which should report any inconsistent objects in the O/* directory tree as discussed above. It should be run with the "-n" option on the "O" directory (instead of the "lost+found" as it usually does. This should report inodes which incorrectly report in their FID xattr that they are a different object ID. It worked OK in my simple testing here, but I would strongly recommend to run this on a test copy of the OST first. This should be best tested against a sparse copy of one of the problematic OSTs, using "e2image -r" and then mounting the raw image with "-o loop". Please let me know how this works out.

            I'd mistakenly looked at a partially-downloaded tarball, and didn't see most of the remaining problematic OSTs that are still disabled:

            23 IN osc lfs2-OST0022-osc lfs2-mdtlov_UUID 5
            25 IN osc lfs2-OST0003-osc lfs2-mdtlov_UUID 5
            27 IN osc lfs2-OST0013-osc lfs2-mdtlov_UUID 5
            34 IN osc lfs2-OST001c-osc lfs2-mdtlov_UUID 5
            35 IN osc lfs2-OST0024-osc lfs2-mdtlov_UUID 5
            47 IN osc lfs2-OST0026-osc lfs2-mdtlov_UUID 5
            50 IN osc lfs2-OST000f-osc lfs2-mdtlov_UUID 5
            

            The corruption appears to be a result of large chunks of the inode table being overwritten by other parts of the inode table. That means there are a large number of bad inodes that are exact copies of valid inodes. This results in objects in /O/

            {seq}/d*/nnnnn actually having an LMA FID or filter_fid xattr that references a different object ID than 'nnnnn'. Our plan moving forward is that I will work on enhancing ll_recover_lost_found_objs to detect and report this mismatch so that running it on the /O directory will verify the O/{seq}

            /d*/nnnnn object name maps to the same FID stored in the inode xattr.

            adilger Andreas Dilger added a comment - I'd mistakenly looked at a partially-downloaded tarball, and didn't see most of the remaining problematic OSTs that are still disabled: 23 IN osc lfs2-OST0022-osc lfs2-mdtlov_UUID 5 25 IN osc lfs2-OST0003-osc lfs2-mdtlov_UUID 5 27 IN osc lfs2-OST0013-osc lfs2-mdtlov_UUID 5 34 IN osc lfs2-OST001c-osc lfs2-mdtlov_UUID 5 35 IN osc lfs2-OST0024-osc lfs2-mdtlov_UUID 5 47 IN osc lfs2-OST0026-osc lfs2-mdtlov_UUID 5 50 IN osc lfs2-OST000f-osc lfs2-mdtlov_UUID 5 The corruption appears to be a result of large chunks of the inode table being overwritten by other parts of the inode table. That means there are a large number of bad inodes that are exact copies of valid inodes. This results in objects in /O/ {seq}/d*/nnnnn actually having an LMA FID or filter_fid xattr that references a different object ID than 'nnnnn'. Our plan moving forward is that I will work on enhancing ll_recover_lost_found_objs to detect and report this mismatch so that running it on the /O directory will verify the O/{seq} /d*/nnnnn object name maps to the same FID stored in the inode xattr.

            Looking at the most recent e2fsck logs, most of the filesystems look reasonably clean (looks like LU-3542 is being hit on most of them), except for OST 38 looks like it has a lot of problems. It seems OST 38 has corruption of the OST object directory O/0 and is putting most of the objects into /lost+found. It makes sense to run ll_recover_lost_found_objs on that OST to rebuild the object directories and move the objects back out to the O/0 hierarchy.

            adilger Andreas Dilger added a comment - Looking at the most recent e2fsck logs, most of the filesystems look reasonably clean (looks like LU-3542 is being hit on most of them), except for OST 38 looks like it has a lot of problems. It seems OST 38 has corruption of the OST object directory O/0 and is putting most of the objects into /lost+found. It makes sense to run ll_recover_lost_found_objs on that OST to rebuild the object directories and move the objects back out to the O/0 hierarchy.

            I don't think "use other data structure to record duplicate blocks / inodes" was ever mentioned. The data structures themselves are fine. However, in e2fsck pass 1 there is only normally a bitmap kept of in-use blocks, and only if there are collisions in the bitmap (i.e. blocks shared by multiple users) does pass 1b/1c run to track the owning inode(s) of every block. That is done in order to reduce memory usage for block bitmap tracking significantly (by a factor of 32) during normal e2fsck runs. The one potential improvement that I mentioned was to track shared blocks in the superblock or similar (or allow it to be specified on the e2fsck command line) so that the block owners are tracked in pass 1 so that pass 1b doesn't need to scan them again. I'm not sure if that would be a significant improvement, just an idea I had.

            adilger Andreas Dilger added a comment - I don't think "use other data structure to record duplicate blocks / inodes" was ever mentioned. The data structures themselves are fine. However, in e2fsck pass 1 there is only normally a bitmap kept of in-use blocks, and only if there are collisions in the bitmap (i.e. blocks shared by multiple users) does pass 1b/1c run to track the owning inode(s) of every block. That is done in order to reduce memory usage for block bitmap tracking significantly (by a factor of 32) during normal e2fsck runs. The one potential improvement that I mentioned was to track shared blocks in the superblock or similar (or allow it to be specified on the e2fsck command line) so that the block owners are tracked in pass 1 so that pass 1b doesn't need to scan them again. I'm not sure if that would be a significant improvement, just an idea I had.

            Hi Andreas,

            I was wondering if you had thoughts on what changes to e2fsck would be most useful in resolving all the remaining corruption. It sounds like:

            • move inodes to l+f on shared=ignore
            • early delete bad inodes
            • use other data structure to record duplicate blocks / inodes
            • other?

            Also we were wondering what these messages ment:
            Oct 21 03:13:07 lfs-oss-2-8 kernel: LustreError: 22020:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8117d284f800 x1449180093007020/t0 o401->@NET_0x500000ab31078_UUID:17/18 lens 512/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0

            Thanks,
            Kit

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Andreas, I was wondering if you had thoughts on what changes to e2fsck would be most useful in resolving all the remaining corruption. It sounds like: move inodes to l+f on shared=ignore early delete bad inodes use other data structure to record duplicate blocks / inodes other? Also we were wondering what these messages ment: Oct 21 03:13:07 lfs-oss-2-8 kernel: LustreError: 22020:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8117d284f800 x1449180093007020/t0 o401->@NET_0x500000ab31078_UUID:17/18 lens 512/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Thanks, Kit
            Oct 18 18:54:00 lfs-oss-2-4 kernel: LustreError: 20837:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lfs2-OST0013_UUID: lvbo_init failed for resource 63005876: rc -2
            Oct 18 18:55:16 lfs-oss-2-5 kernel: LustreError: 21077:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lfs2-OST0024_UUID: lvbo_init failed for resource 20657034: rc -2
            

            These are messages that are to be expected in OST corruption cases like this. It means there are objects referenced by a file on the MDT, but the objects no longer exist.

            Oct 18 19:01:28 lfs-oss-2-3 kernel: LustreError: 21149:0:(filter.c:1555:filter_destroy_internal()) destroying objid 62668287 ino 1575853 nlink 2 count 2
            Oct 18 19:01:28 lfs-oss-2-4 kernel: LustreError: 21062:0:(filter.c:1555:filter_destroy_internal()) destroying objid 62862184 ino 400196 nlink 2 count 2
            

            This looks like an issue introduced by e2fsck linking files into lost+found or similar. OST objects should only ever have a single link. This is not in itself harmful (the object will still be deleted) and does not imply any further corruption. It may be that there is some space leaked in the filesystem that needs to be cleaned up by a later by running another e2fsck and/or deleting files from lost+found.

            adilger Andreas Dilger added a comment - Oct 18 18:54:00 lfs-oss-2-4 kernel: LustreError: 20837:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lfs2-OST0013_UUID: lvbo_init failed for resource 63005876: rc -2 Oct 18 18:55:16 lfs-oss-2-5 kernel: LustreError: 21077:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lfs2-OST0024_UUID: lvbo_init failed for resource 20657034: rc -2 These are messages that are to be expected in OST corruption cases like this. It means there are objects referenced by a file on the MDT, but the objects no longer exist. Oct 18 19:01:28 lfs-oss-2-3 kernel: LustreError: 21149:0:(filter.c:1555:filter_destroy_internal()) destroying objid 62668287 ino 1575853 nlink 2 count 2 Oct 18 19:01:28 lfs-oss-2-4 kernel: LustreError: 21062:0:(filter.c:1555:filter_destroy_internal()) destroying objid 62862184 ino 400196 nlink 2 count 2 This looks like an issue introduced by e2fsck linking files into lost+found or similar. OST objects should only ever have a single link. This is not in itself harmful (the object will still be deleted) and does not imply any further corruption. It may be that there is some space leaked in the filesystem that needs to be cleaned up by a later by running another e2fsck and/or deleting files from lost+found.

            Going through the OST15 logs, it appears that there are a whole range of inodes in the 75000-130000 range are just completely overwritten by garbage (i.e. random timestamps, block counts, sizes, feature flags, etc). There is a feature we wrote "inode badness" that should have detected this and erased those inodes completely, but it doesn't do this until pass 2 of e2fsck. I wonder if this mechanism was foiled by e2fsck being stopped early in the duplicate block pass 1b/1c before it erased the inodes? Also, in hindsight it probably makes sense to ask to clear an inode as soon as its badness exceeds the threshold, because one of the main goals of the inode badness feature is to avoid duplicate block processing on totally corrupt inodes. There may also be some benefit of saving the inode badness in the inode on disk, in case e2fsck is restarted like this.

            As for your patches - the skip-invalid-bitmap patch couldn't be used as-is. At the same time, pass 1 should be modifying only the in-memory bitmaps, it isn't until pass 5 that on-disk bitmaps are updated, so it does seem that something needs to be fixed in the extent processing.

            The shared=ignore patch seems reasonable. It might make sense to have this also imply E2F_SHARED_LPF, since using using those files would be dangerous. However, this would also modify the namespace in a way that would be difficult to undo later if one of the inodes was erased due to "badness" and was no longer sharing blocks.

            adilger Andreas Dilger added a comment - Going through the OST15 logs, it appears that there are a whole range of inodes in the 75000-130000 range are just completely overwritten by garbage (i.e. random timestamps, block counts, sizes, feature flags, etc). There is a feature we wrote "inode badness" that should have detected this and erased those inodes completely, but it doesn't do this until pass 2 of e2fsck. I wonder if this mechanism was foiled by e2fsck being stopped early in the duplicate block pass 1b/1c before it erased the inodes? Also, in hindsight it probably makes sense to ask to clear an inode as soon as its badness exceeds the threshold, because one of the main goals of the inode badness feature is to avoid duplicate block processing on totally corrupt inodes. There may also be some benefit of saving the inode badness in the inode on disk, in case e2fsck is restarted like this. As for your patches - the skip-invalid-bitmap patch couldn't be used as-is. At the same time, pass 1 should be modifying only the in-memory bitmaps, it isn't until pass 5 that on-disk bitmaps are updated, so it does seem that something needs to be fixed in the extent processing. The shared=ignore patch seems reasonable. It might make sense to have this also imply E2F_SHARED_LPF, since using using those files would be dangerous. However, this would also modify the namespace in a way that would be difficult to undo later if one of the inodes was erased due to "badness" and was no longer sharing blocks.

            People

              niu Niu Yawei (Inactive)
              orentas Oz Rentas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: