Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.2
    • lustre-2.1.0-13chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64
      toss/chaos 5
      NetApp 22TB LUNs
    • 3
    • 2172

    Description

      We have been running ior testing on hyperion with toss 5 and have seen ldiskfs corruption. Since I know you have access to hyperion, I was hoping you could log on and look around (the console logs are on hyperion577-pub and santricity can be run from there as well). I have set up a test filesystem called /p/ls1 created with large luns (22TB per lun with 6 luns on each RBOD) on a Netapp. The mds is on hyperion-agb25 and the 2 oss nodes are hyperion-agb27 and hyperion-agb28. I had 10 clients writing i/o to the filesystem and would power cycle an oss every hour to simulate a node crashing. Upon bringing the oss up I would run the full fsck to check for errors and bring up lustre again and continue the i/o load from clients. We hit a bug where the fsck shows corruption and doesn't mount lustre. As a side note, I was running the same testing in parallel with the same HW, but with a small 3TB lun size and did not hit this issue.

      zgrep Mounting ../conman.old/console.hyperion-agb27-20120115.gz

      2012-01-14 14:24:02 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
      2012-01-14 14:24:04 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0001
      2012-01-14 14:24:06 Mounting /dev/dm-4 on /mnt/lustre/local/ls1-OST0002
      2012-01-14 15:21:57 Mounting local filesystems: [ OK ]
      2012-01-14 15:22:03 Mounting other filesystems: [ OK ]
      2012-01-14 15:24:13 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0000
      2012-01-14 15:24:15 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0001
      2012-01-14 15:24:17 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0002
      2012-01-14 16:21:49 Mounting local filesystems: [ OK ]
      2012-01-14 16:21:55 Mounting other filesystems: [ OK ]
      2012-01-14 16:23:51 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
      2012-01-14 16:23:53 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0001
      2012-01-14 16:23:55 Mounting /dev/dm-5 on /mnt/lustre/local/ls1-OST0002
      2012-01-14 17:21:53 Mounting local filesystems: [ OK ]
      2012-01-14 17:21:59 Mounting other filesystems: [ OK ]
      2012-01-14 18:22:00 Mounting local filesystems: [ OK ]
      2012-01-14 18:22:06 Mounting other filesystems: [ OK ]
      2012-01-14 19:21:56 Mounting local filesystems: [ OK ]
      2012-01-14 19:22:02 Mounting other filesystems: [ OK ]
      2012-01-14 20:21:52 Mounting local filesystems: [ OK ]
      2012-01-14 20:21:58 Mounting other filesystems: [ OK ]

      It appears that the corruption occurred way back on Friday 1/14 after 16:20.
      I state this based upon the fact that the OST's did not make it back after the power cycle on 1/14 @ 17:20.

      Also the following fsck results only surfaced after that power cycle:

      2012-01-14 17:23:48 Group descriptor 0 checksum is invalid. FIXED.
      2012-01-14 17:23:48 Group descriptor 1 checksum is invalid. FIXED.
      2012-01-14 17:23:48 Group descriptor 2 checksum is invalid. FIXED.

      Attachments

        1. after.tar
          73 kB
        2. before.tar
          4 kB
        3. full.sdd.1021.log.gz
          2.13 MB
        4. LU1015.log.gz
          5.30 MB
        5. sdb.20120305.5143.stats.gz
          9.34 MB
        6. sdb.20120305.5143.stats.post.gz
          9.34 MB
        7. sdb.20121805.1805.stats.gz
          9.44 MB
        8. sdb.20125805.5815.stats.gz
          9.44 MB
        9. sdb.20125805.5815.stats.post.gz
          9.44 MB
        10. sdb.fail.tar
          60 kB
        11. sdd.20120305.1910.stats.gz
          9.33 MB
        12. sdd.20120305.1910.stats.post.gz
          9.33 MB
        13. sdd.20120306.1204.stats.gz
          9.34 MB
        14. sdd.20120306.1204.stats.post.gz
          9.34 MB
        15. sdd.20120522.2011.e2fsck.gz
          0.2 kB
        16. sdd.20120522.2011.journal.gz
          0.0 kB
        17. sdd.20120522.2011.logdump.gz
          1.68 MB
        18. sdd.20120522.2011.stats.gz
          9.36 MB
        19. sdd.20120522.2011.stats.post.gz
          9.36 MB
        20. sdd.20120522.2323.e2fsck.gz
          0.3 kB
        21. sdd.20120522.2323.journal.gz
          0.0 kB
        22. sdd.20120522.2323.logdump.gz
          26 kB
        23. sdd.20120522.2323.stats.gz
          9.44 MB
        24. sdd.20120522.2323.stats.post.gz
          9.44 MB
        25. sdd.20125905.5946.stats.gz
          9.34 MB
        26. sdd.20125905.5946.stats.post.gz
          9.34 MB
        27. sdd.ext4.full.fsck.txt.gz
          2 kB
        28. sdd.fail.1.tar
          10 kB
        29. sdd.full.fsck.txt.gz
          59 kB

        Activity

          [LU-1015] ldiskfs corruption with large LUNs

          Problem is fixed in released e2fsprogs-1.42.3.wc1.

          adilger Andreas Dilger added a comment - Problem is fixed in released e2fsprogs-1.42.3.wc1.

          I've been able to reproduce this bug in vanilla e2fsck, and the problem exists only for large extent-mapped files that are being truncated at the time of a crash.

          adilger Andreas Dilger added a comment - I've been able to reproduce this bug in vanilla e2fsck, and the problem exists only for large extent-mapped files that are being truncated at the time of a crash.

          The error occured on the second run. The system ran large-lun.sh successfully prior to this.

          cliffw Cliff White (Inactive) added a comment - The error occured on the second run. The system ran large-lun.sh successfully prior to this.

          Cliff, how many runs did it take to hit this error?

          I don't think this is related to the problem seen before. Truncating orphan inodes on recovery is normal behaviour when a file is in the middle of being truncated at crash time. It looks like this handling isn't tested very often and has a bug because the "Truncating orphaned inode" message means the inode should be truncated to size=0 bytes, but then e2fsck gets confused and detects the file size is smaller than the allocated blocks and resets the size to cover the allocated blocks. This should be filed & fixed separately.

          adilger Andreas Dilger added a comment - Cliff, how many runs did it take to hit this error? I don't think this is related to the problem seen before. Truncating orphan inodes on recovery is normal behaviour when a file is in the middle of being truncated at crash time. It looks like this handling isn't tested very often and has a bug because the "Truncating orphaned inode" message means the inode should be truncated to size=0 bytes, but then e2fsck gets confused and detects the file size is smaller than the allocated blocks and resets the size to cover the allocated blocks. This should be filed & fixed separately.

          file is lu1015.060812.tar.gz on the FTP site

          cliffw Cliff White (Inactive) added a comment - file is lu1015.060812.tar.gz on the FTP site

          Running with latest e2fsprogs, one error recovered, logs attached.

          /dev/vglu1015/lv1015_hi: catastrophic mode - not reading inode or group bitmaps
          lu1015-OST0000: recovering journal
          lu1015-OST0000: Truncating orphaned inode 78643270 (uid=0, gid=0, mode=0100666, size=0)
          lu1015-OST0000: Inode 78643270, i_size is 0, should be 16777216. FIXED.
          lu1015-OST0000: 119/182453760 files (1.7% non-contiguous), 57592447/3113852928 blocks

          cliffw Cliff White (Inactive) added a comment - Running with latest e2fsprogs, one error recovered, logs attached. /dev/vglu1015/lv1015_hi: catastrophic mode - not reading inode or group bitmaps lu1015-OST0000: recovering journal lu1015-OST0000: Truncating orphaned inode 78643270 (uid=0, gid=0, mode=0100666, size=0) lu1015-OST0000: Inode 78643270, i_size is 0, should be 16777216. FIXED. lu1015-OST0000: 119/182453760 files (1.7% non-contiguous), 57592447/3113852928 blocks

          e2fsprogs-1.42.3.wc1 (tag v1.42.3.wc1 in git) has been built and packages are available for testing:

          http://build.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el6/

          Cliff, could you please give this a test (even better to run it in a loop) and see if it resolves the problem?

          adilger Andreas Dilger added a comment - e2fsprogs-1.42.3.wc1 (tag v1.42.3.wc1 in git) has been built and packages are available for testing: http://build.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el6/ Cliff, could you please give this a test (even better to run it in a loop) and see if it resolves the problem?

          Bumping priority on this for tracking. It is a bug in e2fsprogs, not Lustre, but making it a blocker ensures it will get continuous attention.

          adilger Andreas Dilger added a comment - Bumping priority on this for tracking. It is a bug in e2fsprogs, not Lustre, but making it a blocker ensures it will get continuous attention.

          Cliff, over the weekend there was a posting on the linux-ext4 list with an e2fsck patch that may resolve this problem. It seems that the root of the problem is in e2fsck itself, not ldiskfs or ext4, but is only seen if there are blocks in the journal to be recovered beyond 16TB, which is why it didn't show up regularly in testing.

          The posted patch is larger, since it also fixes some further 64-bit block number problems on 32-bit systems, but the gist of the patch is below.

          From 3b693d0b03569795d04920a04a0a21e5f64ffedc Mon Sep 17 00:00:00 2001
          From: Theodore Ts'o <tytso@mit.edu>
          Date: Mon, 21 May 2012 21:30:45 -0400
          Subject: [PATCH] e2fsck: fix 64-bit journal support
          
          64-bit journal support was broken; we weren't using the high bits from
          the journal descriptor blocks in some cases!
          
          Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
          ---
          e2fsck/jfs_user.h |    4 ++--
          e2fsck/journal.c  |   33 +++++++++++++++++----------------
          e2fsck/recovery.c |   25 ++++++++++++-------------
          3 files changed, 31 insertions(+), 31 deletions(-)
          
          diff --git a/e2fsck/jfs_user.h b/e2fsck/jfs_user.h
          index 9e33306..92f8ae2 100644
          --- a/e2fsck/jfs_user.h
          +++ b/e2fsck/jfs_user.h
          @@ -18,7 +18,7 @@ struct buffer_head {
           	e2fsck_t	b_ctx;
           	io_channel 	b_io;
           	int	 	b_size;
          -	blk_t	 	b_blocknr;
          +	unsigned long long b_blocknr;
           	int	 	b_dirty;
           	int	 	b_uptodate;
           	int	 	b_err;
          diff --git a/e2fsck/recovery.c b/e2fsck/recovery.c
          index b669941..e94ef4e 100644
          --- a/e2fsck/recovery.c
          +++ b/e2fsck/recovery.c
          @@ -309,7 +309,6 @@ int journal_skip_recovery(journal_t *journal)
           	return err;
           }
           
          -#if 0
           static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag)
           {
           	unsigned long long block = be32_to_cpu(tag->t_blocknr);
          @@ -317,7 +316,6 @@ static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag
           		block |= (__u64)be32_to_cpu(tag->t_blocknr_high) << 32;
           	return block;
           }
          -#endif
           
          /*
           * calc_chksums calculates the checksums for the blocks described in the
           * descriptor block.
          @@ -506,7 +504,8 @@ static int do_one_pass(journal_t *journal,
           					unsigned long blocknr;
           
           					J_ASSERT(obh != NULL);
          -					blocknr = be32_to_cpu(tag->t_blocknr);
          +					blocknr = read_tag_block(tag_bytes,
          +								 tag);
           
           					/* If the block has been
           					 * revoked, then we're all done
          
          adilger Andreas Dilger added a comment - Cliff, over the weekend there was a posting on the linux-ext4 list with an e2fsck patch that may resolve this problem. It seems that the root of the problem is in e2fsck itself, not ldiskfs or ext4, but is only seen if there are blocks in the journal to be recovered beyond 16TB, which is why it didn't show up regularly in testing. The posted patch is larger, since it also fixes some further 64-bit block number problems on 32-bit systems, but the gist of the patch is below. From 3b693d0b03569795d04920a04a0a21e5f64ffedc Mon Sep 17 00:00:00 2001 From: Theodore Ts'o <tytso@mit.edu> Date: Mon, 21 May 2012 21:30:45 -0400 Subject: [PATCH] e2fsck: fix 64-bit journal support 64-bit journal support was broken; we weren't using the high bits from the journal descriptor blocks in some cases! Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- e2fsck/jfs_user.h | 4 ++-- e2fsck/journal.c | 33 +++++++++++++++++---------------- e2fsck/recovery.c | 25 ++++++++++++------------- 3 files changed, 31 insertions(+), 31 deletions(-) diff --git a/e2fsck/jfs_user.h b/e2fsck/jfs_user.h index 9e33306..92f8ae2 100644 --- a/e2fsck/jfs_user.h +++ b/e2fsck/jfs_user.h @@ -18,7 +18,7 @@ struct buffer_head { e2fsck_t b_ctx; io_channel b_io; int b_size; - blk_t b_blocknr; + unsigned long long b_blocknr; int b_dirty; int b_uptodate; int b_err; diff --git a/e2fsck/recovery.c b/e2fsck/recovery.c index b669941..e94ef4e 100644 --- a/e2fsck/recovery.c +++ b/e2fsck/recovery.c @@ -309,7 +309,6 @@ int journal_skip_recovery(journal_t *journal) return err; } -# if 0 static inline unsigned long long read_tag_block( int tag_bytes, journal_block_tag_t *tag) { unsigned long long block = be32_to_cpu(tag->t_blocknr); @@ -317,7 +316,6 @@ static inline unsigned long long read_tag_block( int tag_bytes, journal_block_tag block |= (__u64)be32_to_cpu(tag->t_blocknr_high) << 32; return block; } -#endif /* * calc_chksums calculates the checksums for the blocks described in the * descriptor block. @@ -506,7 +504,8 @@ static int do_one_pass(journal_t *journal, unsigned long blocknr; J_ASSERT(obh != NULL); - blocknr = be32_to_cpu(tag->t_blocknr); + blocknr = read_tag_block(tag_bytes, + tag); /* If the block has been * revoked, then we're all done

          I have reformatted with ext4, running iOR locally, and have had one failure, results attached.

          cliffw Cliff White (Inactive) added a comment - I have reformatted with ext4, running iOR locally, and have had one failure, results attached.

          People

            adilger Andreas Dilger
            cindyheer cindy heer (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: