Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4102

lots of multiply-claimed blocks in e2fsck

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 1.8.8
    • e2fsprogs 1.41.90.wc2
    • 3
    • 11017

    Description

      After a power loss, an older e2fsck (e2fsprogs 1.41.90.wc2) was run on the OSTs. It found tons of multiply-claimed blocks, including for the /O directory. Here's an example of one of the inodes:

      File ... (inode #17825793, mod time Wed Aug 15 19:02:25 2012)
        has 1 multiply-claimed block(s), shared with 1 file(s):
              /O (inode #84934657, mod time Wed Aug 15 19:02:25 2012)
      Clone multiply-claimed blocks? yes
      
      Inode 17825793 doesn't have an associated directory entry, it eventually gets put into lost+found.
      

      So the questions are:

      • how could this have happened? My slightly-informed-probably-wrong theory is that the journal got corrupted and it replayed some old inodes back into existence. I noticed there were a lot of patches dealing with journal checksums committed after 1.41.90.
      • what's the best way to deal with these? Cloning takes forever when you are talking about TB sized files. I tested the delete extended option, and it looks like it deletes both sides of the file. It would be nice if it just deleted the unlinked side. Right now my plan is to create a debugfs script from the read-only e2fsck output, but if there is a better way, that would be good.

      Thanks.

      Attachments

        Issue Links

          Activity

            [LU-4102] lots of multiply-claimed blocks in e2fsck

            I don't think "use other data structure to record duplicate blocks / inodes" was ever mentioned. The data structures themselves are fine. However, in e2fsck pass 1 there is only normally a bitmap kept of in-use blocks, and only if there are collisions in the bitmap (i.e. blocks shared by multiple users) does pass 1b/1c run to track the owning inode(s) of every block. That is done in order to reduce memory usage for block bitmap tracking significantly (by a factor of 32) during normal e2fsck runs. The one potential improvement that I mentioned was to track shared blocks in the superblock or similar (or allow it to be specified on the e2fsck command line) so that the block owners are tracked in pass 1 so that pass 1b doesn't need to scan them again. I'm not sure if that would be a significant improvement, just an idea I had.

            adilger Andreas Dilger added a comment - I don't think "use other data structure to record duplicate blocks / inodes" was ever mentioned. The data structures themselves are fine. However, in e2fsck pass 1 there is only normally a bitmap kept of in-use blocks, and only if there are collisions in the bitmap (i.e. blocks shared by multiple users) does pass 1b/1c run to track the owning inode(s) of every block. That is done in order to reduce memory usage for block bitmap tracking significantly (by a factor of 32) during normal e2fsck runs. The one potential improvement that I mentioned was to track shared blocks in the superblock or similar (or allow it to be specified on the e2fsck command line) so that the block owners are tracked in pass 1 so that pass 1b doesn't need to scan them again. I'm not sure if that would be a significant improvement, just an idea I had.

            Hi Andreas,

            I was wondering if you had thoughts on what changes to e2fsck would be most useful in resolving all the remaining corruption. It sounds like:

            • move inodes to l+f on shared=ignore
            • early delete bad inodes
            • use other data structure to record duplicate blocks / inodes
            • other?

            Also we were wondering what these messages ment:
            Oct 21 03:13:07 lfs-oss-2-8 kernel: LustreError: 22020:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8117d284f800 x1449180093007020/t0 o401->@NET_0x500000ab31078_UUID:17/18 lens 512/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0

            Thanks,
            Kit

            kitwestneat Kit Westneat (Inactive) added a comment - Hi Andreas, I was wondering if you had thoughts on what changes to e2fsck would be most useful in resolving all the remaining corruption. It sounds like: move inodes to l+f on shared=ignore early delete bad inodes use other data structure to record duplicate blocks / inodes other? Also we were wondering what these messages ment: Oct 21 03:13:07 lfs-oss-2-8 kernel: LustreError: 22020:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8117d284f800 x1449180093007020/t0 o401->@NET_0x500000ab31078_UUID:17/18 lens 512/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Thanks, Kit
            Oct 18 18:54:00 lfs-oss-2-4 kernel: LustreError: 20837:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lfs2-OST0013_UUID: lvbo_init failed for resource 63005876: rc -2
            Oct 18 18:55:16 lfs-oss-2-5 kernel: LustreError: 21077:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lfs2-OST0024_UUID: lvbo_init failed for resource 20657034: rc -2
            

            These are messages that are to be expected in OST corruption cases like this. It means there are objects referenced by a file on the MDT, but the objects no longer exist.

            Oct 18 19:01:28 lfs-oss-2-3 kernel: LustreError: 21149:0:(filter.c:1555:filter_destroy_internal()) destroying objid 62668287 ino 1575853 nlink 2 count 2
            Oct 18 19:01:28 lfs-oss-2-4 kernel: LustreError: 21062:0:(filter.c:1555:filter_destroy_internal()) destroying objid 62862184 ino 400196 nlink 2 count 2
            

            This looks like an issue introduced by e2fsck linking files into lost+found or similar. OST objects should only ever have a single link. This is not in itself harmful (the object will still be deleted) and does not imply any further corruption. It may be that there is some space leaked in the filesystem that needs to be cleaned up by a later by running another e2fsck and/or deleting files from lost+found.

            adilger Andreas Dilger added a comment - Oct 18 18:54:00 lfs-oss-2-4 kernel: LustreError: 20837:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lfs2-OST0013_UUID: lvbo_init failed for resource 63005876: rc -2 Oct 18 18:55:16 lfs-oss-2-5 kernel: LustreError: 21077:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-lfs2-OST0024_UUID: lvbo_init failed for resource 20657034: rc -2 These are messages that are to be expected in OST corruption cases like this. It means there are objects referenced by a file on the MDT, but the objects no longer exist. Oct 18 19:01:28 lfs-oss-2-3 kernel: LustreError: 21149:0:(filter.c:1555:filter_destroy_internal()) destroying objid 62668287 ino 1575853 nlink 2 count 2 Oct 18 19:01:28 lfs-oss-2-4 kernel: LustreError: 21062:0:(filter.c:1555:filter_destroy_internal()) destroying objid 62862184 ino 400196 nlink 2 count 2 This looks like an issue introduced by e2fsck linking files into lost+found or similar. OST objects should only ever have a single link. This is not in itself harmful (the object will still be deleted) and does not imply any further corruption. It may be that there is some space leaked in the filesystem that needs to be cleaned up by a later by running another e2fsck and/or deleting files from lost+found.

            Going through the OST15 logs, it appears that there are a whole range of inodes in the 75000-130000 range are just completely overwritten by garbage (i.e. random timestamps, block counts, sizes, feature flags, etc). There is a feature we wrote "inode badness" that should have detected this and erased those inodes completely, but it doesn't do this until pass 2 of e2fsck. I wonder if this mechanism was foiled by e2fsck being stopped early in the duplicate block pass 1b/1c before it erased the inodes? Also, in hindsight it probably makes sense to ask to clear an inode as soon as its badness exceeds the threshold, because one of the main goals of the inode badness feature is to avoid duplicate block processing on totally corrupt inodes. There may also be some benefit of saving the inode badness in the inode on disk, in case e2fsck is restarted like this.

            As for your patches - the skip-invalid-bitmap patch couldn't be used as-is. At the same time, pass 1 should be modifying only the in-memory bitmaps, it isn't until pass 5 that on-disk bitmaps are updated, so it does seem that something needs to be fixed in the extent processing.

            The shared=ignore patch seems reasonable. It might make sense to have this also imply E2F_SHARED_LPF, since using using those files would be dangerous. However, this would also modify the namespace in a way that would be difficult to undo later if one of the inodes was erased due to "badness" and was no longer sharing blocks.

            adilger Andreas Dilger added a comment - Going through the OST15 logs, it appears that there are a whole range of inodes in the 75000-130000 range are just completely overwritten by garbage (i.e. random timestamps, block counts, sizes, feature flags, etc). There is a feature we wrote "inode badness" that should have detected this and erased those inodes completely, but it doesn't do this until pass 2 of e2fsck. I wonder if this mechanism was foiled by e2fsck being stopped early in the duplicate block pass 1b/1c before it erased the inodes? Also, in hindsight it probably makes sense to ask to clear an inode as soon as its badness exceeds the threshold, because one of the main goals of the inode badness feature is to avoid duplicate block processing on totally corrupt inodes. There may also be some benefit of saving the inode badness in the inode on disk, in case e2fsck is restarted like this. As for your patches - the skip-invalid-bitmap patch couldn't be used as-is. At the same time, pass 1 should be modifying only the in-memory bitmaps, it isn't until pass 5 that on-disk bitmaps are updated, so it does seem that something needs to be fixed in the extent processing. The shared=ignore patch seems reasonable. It might make sense to have this also imply E2F_SHARED_LPF, since using using those files would be dangerous. However, this would also modify the namespace in a way that would be difficult to undo later if one of the inodes was erased due to "badness" and was no longer sharing blocks.

            Andreas had mentioned on conference call that there were some OSS log messages we should watch out for. I didn't catch exactly what they were, so here are all "unusual" log messages since the last target came online.

            ndauchy Nathan Dauchy (Inactive) added a comment - Andreas had mentioned on conference call that there were some OSS log messages we should watch out for. I didn't catch exactly what they were, so here are all "unusual" log messages since the last target came online.
            kitwestneat Kit Westneat (Inactive) added a comment - The latest RO e2fsck: [37MB] http://ftp.ddntsr.com/ftp/2013-10-19-lfs2_e2fsck_ro_check_2013-10-18.tgz

            patch to skip some problematic tests:

            diff -rup e2fsprogs-1.42.7.1.ddn1/e2fsck/pass1.c e2fsprogs-1.42.7.3.ddn3/e2fsck/pass1.c
            --- e2fsprogs-1.42.7.1.ddn1/e2fsck/pass1.c      2013-10-14 13:19:11.000000000 -0700
            +++ e2fsprogs-1.42.7.3.ddn3/e2fsck/pass1.c      2013-10-15 18:12:59.000000000 -0700
            @@ -2250,6 +2250,11 @@ report_problem:
                                    pctx->blk2 = extent.e_lblk;
                                    pctx->num = extent.e_len;
                                    if (fix_problem(ctx, problem, pctx)) {
            +                               if (ctx->invalid_bitmaps) {
            +                                       printf("WARNING: invalid bitmaps, unable"
            +                                               "to fix extents\n");
            +                                       goto next;
            +                               }
                                            e2fsck_read_bitmaps(ctx);
                                            pctx->errcode =
                                                    ext2fs_extent_delete(ehandle, 0);
            @@ -2489,9 +2494,14 @@ static void check_blocks(e2fsck_t ctx, s
                            if (extent_fs && (inode->i_flags & EXT4_EXTENTS_FL))
                                    check_blocks_extents(ctx, pctx, &pb);
                            else {
            +                       /*
                                    pctx->errcode = ext2fs_block_iterate3(fs, ino,
                                                            pb.is_dir ? BLOCK_FLAG_HOLE : 0,
                                                            block_buf, process_block, &pb);
            +                       */
            +                       printf("WARNING: inode %d not using extents,"
            +                               " skipping block check.\n", ino);
            +                       return;
                                    /*
                                     * We do not have uninitialized extents in non extent
                                     * files.
            
            kitwestneat Kit Westneat (Inactive) added a comment - patch to skip some problematic tests: diff -rup e2fsprogs-1.42.7.1.ddn1/e2fsck/pass1.c e2fsprogs-1.42.7.3.ddn3/e2fsck/pass1.c --- e2fsprogs-1.42.7.1.ddn1/e2fsck/pass1.c 2013-10-14 13:19:11.000000000 -0700 +++ e2fsprogs-1.42.7.3.ddn3/e2fsck/pass1.c 2013-10-15 18:12:59.000000000 -0700 @@ -2250,6 +2250,11 @@ report_problem: pctx->blk2 = extent.e_lblk; pctx->num = extent.e_len; if (fix_problem(ctx, problem, pctx)) { + if (ctx->invalid_bitmaps) { + printf( "WARNING: invalid bitmaps, unable" + "to fix extents\n" ); + goto next; + } e2fsck_read_bitmaps(ctx); pctx->errcode = ext2fs_extent_delete(ehandle, 0); @@ -2489,9 +2494,14 @@ static void check_blocks(e2fsck_t ctx, s if (extent_fs && (inode->i_flags & EXT4_EXTENTS_FL)) check_blocks_extents(ctx, pctx, &pb); else { + /* pctx->errcode = ext2fs_block_iterate3(fs, ino, pb.is_dir ? BLOCK_FLAG_HOLE : 0, block_buf, process_block, &pb); + */ + printf( "WARNING: inode %d not using extents," + " skipping block check.\n" , ino); + return ; /* * We do not have uninitialized extents in non extent * files.

            patch to add shared=ignore

            kitwestneat Kit Westneat (Inactive) added a comment - patch to add shared=ignore

            http://ddntsr.com/ftp/2013-10-18-lfs2_e2fsck_prepare_lfsck_2013-10-17.tgz (41MB)

            The latest read-only from all the OSS, showing the duplicate blocks that still remain.

            kitwestneat Kit Westneat (Inactive) added a comment - http://ddntsr.com/ftp/2013-10-18-lfs2_e2fsck_prepare_lfsck_2013-10-17.tgz (41MB) The latest read-only from all the OSS, showing the duplicate blocks that still remain.
            kitwestneat Kit Westneat (Inactive) added a comment - http://ddntsr.com/ftp/2013-10-18-lustre_ost15_logs2.tar.gz Too large to upload (30MB)

            I found a larger bzip file >1M and ran bzip test on it, it seems clean.

            Ah interesting. I'll talk to our disk people to see if we can get FUA going.

            Are journal checksums enabled by default? I don't see it in the dumpe2fs output.

            I'm uploading a series of logs that show a progression of our activities on one of the osts (ost15). We ran into a couple issues. One is that some of the corrupted inodes looked like they were giant non-extent based files, meaning e2fsck would hang there trying to check all the blocks individually. Another is that when e2fsck hits an inode with invalid extents, it tries to load the bitmap to correct it, but if the bitmap is corrupted, it just dies.

            I added code to just skip checking non-extent based files and files with invalid extents and just print them so we could clear them out. That's the combined.cmd.

            Anyway this should give you some idea of the level and types of corruption we're seeing.

            kitwestneat Kit Westneat (Inactive) added a comment - I found a larger bzip file >1M and ran bzip test on it, it seems clean. Ah interesting. I'll talk to our disk people to see if we can get FUA going. Are journal checksums enabled by default? I don't see it in the dumpe2fs output. I'm uploading a series of logs that show a progression of our activities on one of the osts (ost15). We ran into a couple issues. One is that some of the corrupted inodes looked like they were giant non-extent based files, meaning e2fsck would hang there trying to check all the blocks individually. Another is that when e2fsck hits an inode with invalid extents, it tries to load the bitmap to correct it, but if the bitmap is corrupted, it just dies. I added code to just skip checking non-extent based files and files with invalid extents and just print them so we could clear them out. That's the combined.cmd. Anyway this should give you some idea of the level and types of corruption we're seeing.

            People

              niu Niu Yawei (Inactive)
              orentas Oz Rentas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: