[LU-1015] ldiskfs corruption with large LUNs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.2
Labels:
- ldiskfs
- paj
Environment:
lustre-2.1.0-13chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64
toss/chaos 5
NetApp 22TB LUNs

Severity:
3
Epic:
- metadata
Rank (Obsolete):
2172

Description

We have been running ior testing on hyperion with toss 5 and have seen ldiskfs corruption. Since I know you have access to hyperion, I was hoping you could log on and look around (the console logs are on hyperion577-pub and santricity can be run from there as well). I have set up a test filesystem called /p/ls1 created with large luns (22TB per lun with 6 luns on each RBOD) on a Netapp. The mds is on hyperion-agb25 and the 2 oss nodes are hyperion-agb27 and hyperion-agb28. I had 10 clients writing i/o to the filesystem and would power cycle an oss every hour to simulate a node crashing. Upon bringing the oss up I would run the full fsck to check for errors and bring up lustre again and continue the i/o load from clients. We hit a bug where the fsck shows corruption and doesn't mount lustre. As a side note, I was running the same testing in parallel with the same HW, but with a small 3TB lun size and did not hit this issue.

zgrep Mounting ../conman.old/console.hyperion-agb27-20120115.gz

2012-01-14 14:24:02 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
2012-01-14 14:24:04 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0001
2012-01-14 14:24:06 Mounting /dev/dm-4 on /mnt/lustre/local/ls1-OST0002
2012-01-14 15:21:57 Mounting local filesystems: [ OK ]
2012-01-14 15:22:03 Mounting other filesystems: [ OK ]
2012-01-14 15:24:13 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0000
2012-01-14 15:24:15 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0001
2012-01-14 15:24:17 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0002
2012-01-14 16:21:49 Mounting local filesystems: [ OK ]
2012-01-14 16:21:55 Mounting other filesystems: [ OK ]
2012-01-14 16:23:51 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
2012-01-14 16:23:53 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0001
2012-01-14 16:23:55 Mounting /dev/dm-5 on /mnt/lustre/local/ls1-OST0002
2012-01-14 17:21:53 Mounting local filesystems: [ OK ]
2012-01-14 17:21:59 Mounting other filesystems: [ OK ]
2012-01-14 18:22:00 Mounting local filesystems: [ OK ]
2012-01-14 18:22:06 Mounting other filesystems: [ OK ]
2012-01-14 19:21:56 Mounting local filesystems: [ OK ]
2012-01-14 19:22:02 Mounting other filesystems: [ OK ]
2012-01-14 20:21:52 Mounting local filesystems: [ OK ]
2012-01-14 20:21:58 Mounting other filesystems: [ OK ]

It appears that the corruption occurred way back on Friday 1/14 after 16:20.
I state this based upon the fact that the OST's did not make it back after the power cycle on 1/14 @ 17:20.

Also the following fsck results only surfaced after that power cycle:

2012-01-14 17:23:48 Group descriptor 0 checksum is invalid. FIXED.
2012-01-14 17:23:48 Group descriptor 1 checksum is invalid. FIXED.
2012-01-14 17:23:48 Group descriptor 2 checksum is invalid. FIXED.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

after.tar
73 kB
05/Mar/12 2:05 PM
before.tar
4 kB
05/Mar/12 1:59 PM
full.sdd.1021.log.gz
2.13 MB
05/Mar/12 1:57 PM
LU1015.log.gz
5.30 MB
20/Jan/12 3:39 PM
sdb.20120305.5143.stats.gz
9.34 MB
05/Mar/12 4:12 PM
sdb.20120305.5143.stats.post.gz
9.34 MB
05/Mar/12 4:13 PM
sdb.20121805.1805.stats.gz
9.44 MB
05/Mar/12 4:13 PM
sdb.20125805.5815.stats.gz
9.44 MB
05/Mar/12 4:16 PM
sdb.20125805.5815.stats.post.gz
9.44 MB
05/Mar/12 4:17 PM
sdb.fail.tar
60 kB
05/Mar/12 4:11 PM
sdd.20120305.1910.stats.gz
9.33 MB
05/Mar/12 2:06 PM
sdd.20120305.1910.stats.post.gz
9.33 MB
05/Mar/12 2:07 PM
sdd.20120306.1204.stats.gz
9.34 MB
06/Mar/12 8:04 PM
sdd.20120306.1204.stats.post.gz
9.34 MB
06/Mar/12 8:04 PM
sdd.20120522.2011.e2fsck.gz
0.2 kB
22/May/12 1:47 PM
sdd.20120522.2011.journal.gz
0.0 kB
22/May/12 1:47 PM
sdd.20120522.2011.logdump.gz
1.68 MB
22/May/12 1:47 PM
sdd.20120522.2011.stats.gz
9.36 MB
22/May/12 1:48 PM
sdd.20120522.2011.stats.post.gz
9.36 MB
22/May/12 1:49 PM
sdd.20120522.2323.e2fsck.gz
0.3 kB
22/May/12 6:31 PM
sdd.20120522.2323.journal.gz
0.0 kB
22/May/12 6:31 PM
sdd.20120522.2323.logdump.gz
26 kB
22/May/12 6:31 PM
sdd.20120522.2323.stats.gz
9.44 MB
22/May/12 6:31 PM
sdd.20120522.2323.stats.post.gz
9.44 MB
22/May/12 6:30 PM
sdd.20125905.5946.stats.gz
9.34 MB
05/Mar/12 2:04 PM
sdd.20125905.5946.stats.post.gz
9.34 MB
05/Mar/12 1:57 PM
sdd.ext4.full.fsck.txt.gz
2 kB
22/May/12 6:30 PM
sdd.fail.1.tar
10 kB
06/Mar/12 8:04 PM
sdd.full.fsck.txt.gz
59 kB
22/May/12 1:49 PM

Activity

[LU-1015] ldiskfs corruption with large LUNs

Andreas Dilger added a comment - 11/Jun/12 10:40 PM

Problem is fixed in released e2fsprogs-1.42.3.wc1.

Andreas Dilger added a comment - 11/Jun/12 10:40 PM Problem is fixed in released e2fsprogs-1.42.3.wc1.

Andreas Dilger added a comment - 11/Jun/12 10:37 PM

I've been able to reproduce this bug in vanilla e2fsck, and the problem exists only for large extent-mapped files that are being truncated at the time of a crash.

Andreas Dilger added a comment - 11/Jun/12 10:37 PM I've been able to reproduce this bug in vanilla e2fsck, and the problem exists only for large extent-mapped files that are being truncated at the time of a crash.

Cliff White (Inactive) added a comment - 11/Jun/12 10:22 AM

The error occured on the second run. The system ran large-lun.sh successfully prior to this.

Cliff White (Inactive) added a comment - 11/Jun/12 10:22 AM The error occured on the second run. The system ran large-lun.sh successfully prior to this.

Andreas Dilger added a comment - 11/Jun/12 3:42 AM

Cliff, how many runs did it take to hit this error?

I don't think this is related to the problem seen before. Truncating orphan inodes on recovery is normal behaviour when a file is in the middle of being truncated at crash time. It looks like this handling isn't tested very often and has a bug because the "Truncating orphaned inode" message means the inode should be truncated to size=0 bytes, but then e2fsck gets confused and detects the file size is smaller than the allocated blocks and resets the size to cover the allocated blocks. This should be filed & fixed separately.

Andreas Dilger added a comment - 11/Jun/12 3:42 AM Cliff, how many runs did it take to hit this error? I don't think this is related to the problem seen before. Truncating orphan inodes on recovery is normal behaviour when a file is in the middle of being truncated at crash time. It looks like this handling isn't tested very often and has a bug because the "Truncating orphaned inode" message means the inode should be truncated to size=0 bytes, but then e2fsck gets confused and detects the file size is smaller than the allocated blocks and resets the size to cover the allocated blocks. This should be filed & fixed separately.

Cliff White (Inactive) added a comment - 08/Jun/12 7:56 PM

file is lu1015.060812.tar.gz on the FTP site

Cliff White (Inactive) added a comment - 08/Jun/12 7:56 PM file is lu1015.060812.tar.gz on the FTP site

Cliff White (Inactive) added a comment - 08/Jun/12 7:46 PM

Running with latest e2fsprogs, one error recovered, logs attached.

/dev/vglu1015/lv1015_hi: catastrophic mode - not reading inode or group bitmaps
lu1015-OST0000: recovering journal
lu1015-OST0000: Truncating orphaned inode 78643270 (uid=0, gid=0, mode=0100666, size=0)
lu1015-OST0000: Inode 78643270, i_size is 0, should be 16777216. FIXED.
lu1015-OST0000: 119/182453760 files (1.7% non-contiguous), 57592447/3113852928 blocks

Cliff White (Inactive) added a comment - 08/Jun/12 7:46 PM Running with latest e2fsprogs, one error recovered, logs attached. /dev/vglu1015/lv1015_hi: catastrophic mode - not reading inode or group bitmaps lu1015-OST0000: recovering journal lu1015-OST0000: Truncating orphaned inode 78643270 (uid=0, gid=0, mode=0100666, size=0) lu1015-OST0000: Inode 78643270, i_size is 0, should be 16777216. FIXED. lu1015-OST0000: 119/182453760 files (1.7% non-contiguous), 57592447/3113852928 blocks

Andreas Dilger added a comment - 31/May/12 5:05 PM

e2fsprogs-1.42.3.wc1 (tag v1.42.3.wc1 in git) has been built and packages are available for testing:

http://build.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el6/

Cliff, could you please give this a test (even better to run it in a loop) and see if it resolves the problem?

Andreas Dilger added a comment - 31/May/12 5:05 PM e2fsprogs-1.42.3.wc1 (tag v1.42.3.wc1 in git) has been built and packages are available for testing: http://build.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el6/ Cliff, could you please give this a test (even better to run it in a loop) and see if it resolves the problem?

Andreas Dilger added a comment - 23/May/12 6:34 PM

Bumping priority on this for tracking. It is a bug in e2fsprogs, not Lustre, but making it a blocker ensures it will get continuous attention.

Andreas Dilger added a comment - 23/May/12 6:34 PM Bumping priority on this for tracking. It is a bug in e2fsprogs, not Lustre, but making it a blocker ensures it will get continuous attention.

Andreas Dilger added a comment - 22/May/12 6:36 PM

Cliff, over the weekend there was a posting on the linux-ext4 list with an e2fsck patch that may resolve this problem. It seems that the root of the problem is in e2fsck itself, not ldiskfs or ext4, but is only seen if there are blocks in the journal to be recovered beyond 16TB, which is why it didn't show up regularly in testing.

The posted patch is larger, since it also fixes some further 64-bit block number problems on 32-bit systems, but the gist of the patch is below.

From 3b693d0b03569795d04920a04a0a21e5f64ffedc Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Mon, 21 May 2012 21:30:45 -0400
Subject: [PATCH] e2fsck: fix 64-bit journal support

64-bit journal support was broken; we weren't using the high bits from
the journal descriptor blocks in some cases!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
e2fsck/jfs_user.h |    4 ++--
e2fsck/journal.c  |   33 +++++++++++++++++----------------
e2fsck/recovery.c |   25 ++++++++++++-------------
3 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/e2fsck/jfs_user.h b/e2fsck/jfs_user.h
index 9e33306..92f8ae2 100644
--- a/e2fsck/jfs_user.h
+++ b/e2fsck/jfs_user.h
@@ -18,7 +18,7 @@ struct buffer_head {
 	e2fsck_t	b_ctx;
 	io_channel 	b_io;
 	int	 	b_size;
-	blk_t	 	b_blocknr;
+	unsigned long long b_blocknr;
 	int	 	b_dirty;
 	int	 	b_uptodate;
 	int	 	b_err;
diff --git a/e2fsck/recovery.c b/e2fsck/recovery.c
index b669941..e94ef4e 100644
--- a/e2fsck/recovery.c
+++ b/e2fsck/recovery.c
@@ -309,7 +309,6 @@ int journal_skip_recovery(journal_t *journal)
 	return err;
 }
 
-#if 0
 static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag)
 {
 	unsigned long long block = be32_to_cpu(tag->t_blocknr);
@@ -317,7 +316,6 @@ static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag
 		block |= (__u64)be32_to_cpu(tag->t_blocknr_high) << 32;
 	return block;
 }
-#endif
 
/*
 * calc_chksums calculates the checksums for the blocks described in the
 * descriptor block.
@@ -506,7 +504,8 @@ static int do_one_pass(journal_t *journal,
 					unsigned long blocknr;
 
 					J_ASSERT(obh != NULL);
-					blocknr = be32_to_cpu(tag->t_blocknr);
+					blocknr = read_tag_block(tag_bytes,
+								 tag);
 
 					/* If the block has been
 					 * revoked, then we're all done

Andreas Dilger added a comment - 22/May/12 6:36 PM Cliff, over the weekend there was a posting on the linux-ext4 list with an e2fsck patch that may resolve this problem. It seems that the root of the problem is in e2fsck itself, not ldiskfs or ext4, but is only seen if there are blocks in the journal to be recovered beyond 16TB, which is why it didn't show up regularly in testing. The posted patch is larger, since it also fixes some further 64-bit block number problems on 32-bit systems, but the gist of the patch is below. From 3b693d0b03569795d04920a04a0a21e5f64ffedc Mon Sep 17 00:00:00 2001 From: Theodore Ts'o <tytso@mit.edu> Date: Mon, 21 May 2012 21:30:45 -0400 Subject: [PATCH] e2fsck: fix 64-bit journal support 64-bit journal support was broken; we weren't using the high bits from the journal descriptor blocks in some cases! Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- e2fsck/jfs_user.h | 4 ++-- e2fsck/journal.c | 33 +++++++++++++++++---------------- e2fsck/recovery.c | 25 ++++++++++++------------- 3 files changed, 31 insertions(+), 31 deletions(-) diff --git a/e2fsck/jfs_user.h b/e2fsck/jfs_user.h index 9e33306..92f8ae2 100644 --- a/e2fsck/jfs_user.h +++ b/e2fsck/jfs_user.h @@ -18,7 +18,7 @@ struct buffer_head { e2fsck_t b_ctx; io_channel b_io; int b_size; - blk_t b_blocknr; + unsigned long long b_blocknr; int b_dirty; int b_uptodate; int b_err; diff --git a/e2fsck/recovery.c b/e2fsck/recovery.c index b669941..e94ef4e 100644 --- a/e2fsck/recovery.c +++ b/e2fsck/recovery.c @@ -309,7 +309,6 @@ int journal_skip_recovery(journal_t *journal) return err; } -# if 0 static inline unsigned long long read_tag_block( int tag_bytes, journal_block_tag_t *tag) { unsigned long long block = be32_to_cpu(tag->t_blocknr); @@ -317,7 +316,6 @@ static inline unsigned long long read_tag_block( int tag_bytes, journal_block_tag block |= (__u64)be32_to_cpu(tag->t_blocknr_high) << 32; return block; } -#endif /* * calc_chksums calculates the checksums for the blocks described in the * descriptor block. @@ -506,7 +504,8 @@ static int do_one_pass(journal_t *journal, unsigned long blocknr; J_ASSERT(obh != NULL); - blocknr = be32_to_cpu(tag->t_blocknr); + blocknr = read_tag_block(tag_bytes, + tag); /* If the block has been * revoked, then we're all done

Cliff White (Inactive) added a comment - 22/May/12 6:30 PM

I have reformatted with ext4, running iOR locally, and have had one failure, results attached.

Cliff White (Inactive) added a comment - 22/May/12 6:30 PM I have reformatted with ext4, running iOR locally, and have had one failure, results attached.

People

Assignee:: Andreas Dilger

Reporter:: cindy heer (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 18/Jan/12 12:24 PM

Updated:: 11/Jun/12 10:40 PM

Resolved:: 11/Jun/12 10:40 PM