[LU-1015] ldiskfs corruption with large LUNs Created: 18/Jan/12 Updated: 11/Jun/12 Resolved: 11/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | cindy heer (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ldiskfs, paj | ||
| Environment: |
lustre-2.1.0-13chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64 |
||
| Attachments: |
|
| Severity: | 3 |
| Epic: | metadata |
| Rank (Obsolete): | 2172 |
| Description |
|
We have been running ior testing on hyperion with toss 5 and have seen ldiskfs corruption. Since I know you have access to hyperion, I was hoping you could log on and look around (the console logs are on hyperion577-pub and santricity can be run from there as well). I have set up a test filesystem called /p/ls1 created with large luns (22TB per lun with 6 luns on each RBOD) on a Netapp. The mds is on hyperion-agb25 and the 2 oss nodes are hyperion-agb27 and hyperion-agb28. I had 10 clients writing i/o to the filesystem and would power cycle an oss every hour to simulate a node crashing. Upon bringing the oss up I would run the full fsck to check for errors and bring up lustre again and continue the i/o load from clients. We hit a bug where the fsck shows corruption and doesn't mount lustre. As a side note, I was running the same testing in parallel with the same HW, but with a small 3TB lun size and did not hit this issue. zgrep Mounting ../conman.old/console.hyperion-agb27-20120115.gz 2012-01-14 14:24:02 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000 It appears that the corruption occurred way back on Friday 1/14 after 16:20. Also the following fsck results only surfaced after that power cycle: 2012-01-14 17:23:48 Group descriptor 0 checksum is invalid. FIXED. |
| Comments |
| Comment by Peter Jones [ 20/Jan/12 ] |
|
Andreas Can you please comment on this one? Thanks Peter |
| Comment by Andreas Dilger [ 20/Jan/12 ] |
|
I don't have Hyperion access myself, unfortunately. Are there any ldiskfs errors on the OSS consoles between 16:23, when the OSTs last successfully mounted, and 17:21, when they apparently failed to mount? Did all of the OSTs fail in a similar manner, or just a single one? Are the group descriptor checksum error messages just the first of many (i.e. are all the group checksums invalid), or is it only for groups 0, 1, 2? Could the OST(s) be mounted after the e2fsck run fixed those checksum errors, or were there other errors/corruption that prevented the OST(s) from mounting? It looks like you are running the RHEL6.2 (220) kernel, so this should be relatively up-to-date w.r.t. upstream ext4 patches, or at least would hopefully narrow down the number of upstream kernel patches to look at. |
| Comment by Cliff White (Inactive) [ 20/Jan/12 ] |
|
Syslog from 13:20 to ~17:30 2012-01-14 |
| Comment by Cliff White (Inactive) [ 20/Jan/12 ] |
|
I have access, so i took the liberty. On 2012-01-14 the first correctable pfsck errors appear at the 13:23 pfsck. 2012-01-14 17:23:47 ls1-OST0002: recovering journal |
| Comment by cindy heer (Inactive) [ 20/Jan/12 ] |
|
Thanks for looking around. I'm currently trying to build another test |
| Comment by Christopher Morrone [ 20/Jan/12 ] |
|
This corruption occurs as a result of unclean shutdowns of the OSS. No, there are no errors from ldiskfs before this occurs. The disk state is inconsistent after the unclean OSS shutdown. If allowed to run without a full fsck, it will result in ldiskfs panicing the node. An preen fsck (fsck -p, which is the default in our init scripts) will not necessarily catch this. If it fails to catch it and the node starts up, ldiskfs will panic the node later. No, not all OSTs hit this. It is more-or-less random which OSTs will be corrupt after the unclean shutdown. Yes, so far fsck will fix the problems and allow ldiskfs to run without error. But we haven't looked too hard to make sure there was no data loss. I think some times we I did wind up with files in lost+found. Cindy, if you aren't already, after powering off the nodes you'll want to avoid using the default "fsck.ldiskfs -p" and instead do "fsck.ldiskfs -f -n" to really find the errors, and to avoid fixing them before Whamcloud can look at them. |
| Comment by Christopher Morrone [ 20/Jan/12 ] |
|
Actually, there WAS one time that I hosed an OST so badly that I gave up and just reformatted it (I was trying to make progress on 1) unclean power-off |
| Comment by cindy heer (Inactive) [ 30/Jan/12 ] |
|
I have restarted testing on ls1 filesystem on hyperion (large lun testing with netapp) with the fsck -n in place. I was not running fsck -p for this testing ever, but I did have the full fsck with -y in the past. That has been modified. I am also running the same testing on ls3 filesystem on hyperion (large luns with ddn hardware) that has been running for about a week and has not had any failures so far. I continue to run iors to the filesystems while every hour I simulate an unclean shutdown on the oss (by powering them off). |
| Comment by cindy heer (Inactive) [ 31/Jan/12 ] |
|
I think I hit a bit of corruption again on the large LUN with Netapp (still no corruption exhibited with the DDN). Here is the output (after oss is powered off to simulate oss crash). I'm running a journal replay fsck.ldiskfs -p and then a full fsck with -n flag): ldev fsck.ldiskfs -p %d e-agb27: pfsck.ldiskfs /dev/dm-1 /dev/dm-2 /dev/dm-0 – -f -n -v -t e-agb27: fsck 1.41.90.1chaos (14-May-2011) e-agb27: fsck.ldiskfs 1.41.90.1chaos (14-May-2011) e-agb27: Pass 1: Checking inodes, blocks, and sizes e-agb27: Pass 2: Checking directory structure e-agb27: Pass 3: Checking directory connectivity e-agb27: Pass 5: Checking group summary information e-agb27: Pass 2: Checking directory structure e-agb27: Pass 3: Checking directory connectivity e-agb27: e-agb27: Pass 1D: Reconciling multiply-claimed blocks e-agb27: e-agb27: (There are 8 inodes containing multiply-claimed blocks.) e-agb27: Block bitmap differences: |
| Comment by cindy heer (Inactive) [ 31/Jan/12 ] |
|
I will leave the oss (hyerion-agb27) down for further examination unless I hear otherwise. |
| Comment by Andreas Dilger [ 03/Feb/12 ] |
|
Cindy, thanks for posting the e2fsck output. Looking at the blocks that are duplicate allocated: e-agb27: Multiply-claimed block(s) in inode 1769473: 4747954688 e-agb27: Multiply-claimed block(s) in inode 1769474: 4747984896 4747984897 4747984898 e-agb27: Multiply-claimed block(s) in inode 1769475: 4747988992 4747990016 4747990017 e-agb27: Multiply-claimed block(s) in inode 1769476: 4747986944 4747987968 4747987969 These are all beyond the 2^32 block limit (4747986944 = 0x11b008800) so it may be that this problem relates to overflow of 32-bit block numbers somewhere in the IO stack. If you have the logs from the previous e2fsck runs that showed corruption, it would be useful to know if the type of corruption always the same or not (attaching a few e2fsck logs would be useful). I have a suspicion based on this one log that it may relate to corruption of the block bitmap during the previous journal recovery or e2fsck (possibly due to 2^32-block truncation) which later causes the blocks to be allocated twice. Several things can be done to begin debugging this, possibly in parallel:
llverfs is a non-destructive test that will write files until the device is full and then read them back and verify the data contents. It is intended to catch errors in the IO path (filesystem, block layer, HBA, driver, controller) related to 32-bit address truncation, but it is not very fast (may take a couple of days, depending on LUN size and IO rate). If this finds problems, there is a lower-level (filesystem destructive) "llverdev" tool that will run against the underlying block device and do a full write/read/verify cycle on the device, excluding the filesystem. I don't have high hopes for this finding a problem, but it is useful to eliminate the chance that there are obvious bugs in the IO stack. We ran full llverdev and llverfs tests previously with a DDN SFA10kE + RHEL5 on a 128TB LUN without problems, but there may be problems with RHEL6, the driver, the controller, etc. in your environment. The debugfs/e2fsck commands are intended to catch (or at least give us some chance to find post-facto) errors in the journal replay and/or e2fsck that are incorrectly marking blocks free, which are later being reallocated. Also, what version of e2fsprogs is e2fsck-1.41.90-1chaos based on? I now recall after working on this bug that there may have been some 64-bit bugs fixed in e2fsck that could potentially be causing problems as well. |
| Comment by D. Marc Stearman (Inactive) [ 06/Feb/12 ] |
|
Andreas, thanks for the excellent analysis. We have been running this same test to isolate the extent of the corruption. We see this behavior on the 22TB luns on the NetApp hardware, but not on a smaller 3TB partition created on similar luns. We also have not seen it on a DDN SFA10K with 16TB luns, but are reconfiguring the DDN to have luns > 16TB to see if we can reproduce it on that hardware. Whamcloud has access to Hyperion, so please coordinate with the Hyperion team to reserve some hardware, and your folks can run the tests you describe above. |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
Setup with 4 OSTS /dev/sda 12T 39G 12T 1% /p/osta Ran llverfs per Andreas on /dev/sdd - no issues. |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
Multiple files for size reason |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
Before run |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
small files |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
before run |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
after fail |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
after fail |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
Ran test on all disks, only sdd showed failure (so far) |
| Comment by Cliff White (Inactive) [ 05/Mar/12 ] |
|
Repeated test, sdb (2nd 21TB LUN) failed this time. Seems easy to repeat, awaiting further requests. |
| Comment by Andreas Dilger [ 06/Mar/12 ] |
|
Cliff, some questions:
Hitting three failures on the > 16TB LUNs is a fairly good indication of the problem is limited to > 16TB LUN support, and not to the NetApp itself (assuming the < 16TB LUNs are also on the NetApp). One option would be to run IOR directly on the OSS node against one of the > 16TB LUNs mounted with "-t ldiskfs" instead of "-t lustre", then do a similar hard reset + e2fsck (and other debugging, which is hopefully in a script by now). This would let us see if this is a problem in the lustre/obdfilter/fsfilt code or in the core ldiskfs code. If this local testing still fails with ldiskfs, then it would be useful to test with ext4 to determine if the problem is in the base ext4 code. Look at the debug logs for the first test, and it does appear that the corruption is of the block bitmap above the 16TB mark, allowing the writes to reallocate those blocks. It may be in ldiskfs or ext4, so starting the above testing would also help cut down the number of variables. |
| Comment by Cliff White (Inactive) [ 06/Mar/12 ] |
|
Per LLNL: all disk are from Netapp “E5400 RBODs” All 4 LUNS are attached to a single OSS node. Again, afaik these LUNS are all furnished by the same NetApp device. I will perform the local IOR test and report results. |
| Comment by Cliff White (Inactive) [ 06/Mar/12 ] |
|
Ran the test using IOR on a local ldiskfs mount. Failed. Results attached. |
| Comment by Cliff White (Inactive) [ 06/Mar/12 ] |
|
Failure with local ldiskfs mount. |
| Comment by Cliff White (Inactive) [ 06/Mar/12 ] |
|
Options for Lustre format: mkfs.lustre --reformat --ost --fsname lu1015 --mgsnode=192.168.120.25@o2ib --mkfsoptions='-t ext4 -J size=2048 -O extents -G 256 -i 69905' /dev/sd$i & |
| Comment by Andreas Dilger [ 06/Mar/12 ] |
|
Cliff, is this running the LLNL ldiskfs RPM, or the ldiskfs from the Lustre tree? It would be good to run vanilla ext4 powerfail tests a couple of extra times to more positively verify that the bug is not present with ext4, since we know that it is definitely still there for ldiskfs. |
| Comment by Cliff White (Inactive) [ 06/Mar/12 ] |
|
rpm -qa |grep disk I will continue running the powerfail test on ext4. |
| Comment by Cliff White (Inactive) [ 07/Mar/12 ] |
|
Somewhat of a head-scratcher atm. Wanted to be certain only the large disks were seeing errors when formatted as ldiskfs, so last night ran a test writing to all four disks mounted as ldiskfs in a loop (short iteration of IOR) creating a new file each loop. Today, will re-run ext4 tests, so far no errors there. |
| Comment by Cliff White (Inactive) [ 08/Mar/12 ] |
|
I have continued running the local (ext4 and ldiskfs) tests, but have not had a failure in two days. |
| Comment by Andreas Dilger [ 09/Mar/12 ] |
|
I thought of another possible way to positively exclude the NetApp from the picture here: Use LVM to create PVs and a volume group on the 22TB LUNs, something like (from memory, please check man pages): pvcreate /dev/sdb /dev/sdd Create a 32TB LV that is using only the first 16TB of these two LUNs: lvcreate -n lvtest_lo -L 16T /dev/vgtest /dev/sdb Create a 12TB LV that is using only the last 6TB of these two LUNs (the size may need to massaged to consume the rest of the space on both /dev/sdb and /dev/sdd: lvcreate -n lvtest_hi -L 6T /dev/vgtest /dev/sdb What this will do is create the "lvtest_lo" on only storage space that is using blocks of the NetApp that are exclusively below 16TB, but the ldiskfs filesystem is larger than 16TB. Conversely, "lvtest_hi" is smaller than 16TB, but is using blocks of the NetApp above the 16TB limit. Run the original Lustre IOR test against all 4 LUNs. If "lvtest_lo" hits the problem, then it is positively caused by Lustre or ldiskfs or ext4. If "lvetst_hi" hits the problem, then it is positively a problem in the NetApp, because ldiskfs is smaller than 16TB (which didn't hit any failure before). If it hits on /dev/sda or /dev/sdc then we are confused (it could be either again), but I hope not. |
| Comment by Cliff White (Inactive) [ 09/Mar/12 ] |
|
Here's the new setup for OSS, will report.
— Segments — Logical extent 4194304 to 8388607: — Logical volume —
— Segments — Logical extent 1520436 to 3040871: |
| Comment by Cliff White (Inactive) [ 09/Mar/12 ] |
|
I am not certain this is going to uncover errors, as the performance through LVM is about 1/10 of the native performance. Any ideas for tuning this setup for better numbers? |
| Comment by Andreas Dilger [ 09/Mar/12 ] |
|
That is probably Sur to seeking between the LUNs and the LVs stripes across them. I didn't think the test took so long to run, just to start testing and then reboot. If it needs to be faster there are less "good" tests that could be run. For example, run on only the hi or lo LVs at one time, alternating, and then see which one fails. That would take twice as long, but not 10x as long. We would need to run several times to be confident only one config is failing. |
| Comment by Cliff White (Inactive) [ 11/Mar/12 ] |
|
The current issue is that none of the configs are failing, I was running longer in hopes of generating a failure. But a failure has not occurred since last tuesday on any configuration, so I am currently quite puzzled. There have been hardware changes durning this, is it possible that this was a hardware issue and has been fixed by the LLNL controller changes? |
| Comment by Cliff White (Inactive) [ 11/Mar/12 ] |
|
To clarify, my concern is not the length of the tests, rather with 1/10 the IO rate, my concern is that we are no longer driving the hardware very hard, if this issue is related to speed or volume of IO at the device level we won't be able to replicate the issue. |
| Comment by Andreas Dilger [ 11/Mar/12 ] |
|
Running on either the hi or lo LVs at one time should get the IO rate to the one LUN back to the original level. If that does not return the symptoms again, then we need to go back to the full LUN testing without LVM to verify that the problem can still be hit. |
| Comment by Cliff White (Inactive) [ 22/May/12 ] |
|
Returned to simpler setup, using straight ldiskfs was able to re-create errors on >21TB OST. Data attached. |
| Comment by Cliff White (Inactive) [ 22/May/12 ] |
|
I have reformatted with ext4, running iOR locally, and have had one failure, results attached. |
| Comment by Andreas Dilger [ 22/May/12 ] |
|
Cliff, over the weekend there was a posting on the linux-ext4 list with an e2fsck patch that may resolve this problem. It seems that the root of the problem is in e2fsck itself, not ldiskfs or ext4, but is only seen if there are blocks in the journal to be recovered beyond 16TB, which is why it didn't show up regularly in testing. The posted patch is larger, since it also fixes some further 64-bit block number problems on 32-bit systems, but the gist of the patch is below. From 3b693d0b03569795d04920a04a0a21e5f64ffedc Mon Sep 17 00:00:00 2001 From: Theodore Ts'o <tytso@mit.edu> Date: Mon, 21 May 2012 21:30:45 -0400 Subject: [PATCH] e2fsck: fix 64-bit journal support 64-bit journal support was broken; we weren't using the high bits from the journal descriptor blocks in some cases! Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- e2fsck/jfs_user.h | 4 ++-- e2fsck/journal.c | 33 +++++++++++++++++---------------- e2fsck/recovery.c | 25 ++++++++++++------------- 3 files changed, 31 insertions(+), 31 deletions(-) diff --git a/e2fsck/jfs_user.h b/e2fsck/jfs_user.h index 9e33306..92f8ae2 100644 --- a/e2fsck/jfs_user.h +++ b/e2fsck/jfs_user.h @@ -18,7 +18,7 @@ struct buffer_head { e2fsck_t b_ctx; io_channel b_io; int b_size; - blk_t b_blocknr; + unsigned long long b_blocknr; int b_dirty; int b_uptodate; int b_err; diff --git a/e2fsck/recovery.c b/e2fsck/recovery.c index b669941..e94ef4e 100644 --- a/e2fsck/recovery.c +++ b/e2fsck/recovery.c @@ -309,7 +309,6 @@ int journal_skip_recovery(journal_t *journal) return err; } -#if 0 static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag) { unsigned long long block = be32_to_cpu(tag->t_blocknr); @@ -317,7 +316,6 @@ static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag block |= (__u64)be32_to_cpu(tag->t_blocknr_high) << 32; return block; } -#endif /* * calc_chksums calculates the checksums for the blocks described in the * descriptor block. @@ -506,7 +504,8 @@ static int do_one_pass(journal_t *journal, unsigned long blocknr; J_ASSERT(obh != NULL); - blocknr = be32_to_cpu(tag->t_blocknr); + blocknr = read_tag_block(tag_bytes, + tag); /* If the block has been * revoked, then we're all done |
| Comment by Andreas Dilger [ 23/May/12 ] |
|
Bumping priority on this for tracking. It is a bug in e2fsprogs, not Lustre, but making it a blocker ensures it will get continuous attention. |
| Comment by Andreas Dilger [ 31/May/12 ] |
|
e2fsprogs-1.42.3.wc1 (tag v1.42.3.wc1 in git) has been built and packages are available for testing: http://build.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el6/ Cliff, could you please give this a test (even better to run it in a loop) and see if it resolves the problem? |
| Comment by Cliff White (Inactive) [ 08/Jun/12 ] |
|
Running with latest e2fsprogs, one error recovered, logs attached. /dev/vglu1015/lv1015_hi: catastrophic mode - not reading inode or group bitmaps |
| Comment by Cliff White (Inactive) [ 08/Jun/12 ] |
|
file is lu1015.060812.tar.gz on the FTP site |
| Comment by Andreas Dilger [ 11/Jun/12 ] |
|
Cliff, how many runs did it take to hit this error? I don't think this is related to the problem seen before. Truncating orphan inodes on recovery is normal behaviour when a file is in the middle of being truncated at crash time. It looks like this handling isn't tested very often and has a bug because the "Truncating orphaned inode" message means the inode should be truncated to size=0 bytes, but then e2fsck gets confused and detects the file size is smaller than the allocated blocks and resets the size to cover the allocated blocks. This should be filed & fixed separately. |
| Comment by Cliff White (Inactive) [ 11/Jun/12 ] |
|
The error occured on the second run. The system ran large-lun.sh successfully prior to this. |
| Comment by Andreas Dilger [ 11/Jun/12 ] |
|
I've been able to reproduce this bug in vanilla e2fsck, and the problem exists only for large extent-mapped files that are being truncated at the time of a crash. |
| Comment by Andreas Dilger [ 11/Jun/12 ] |
|
Problem is fixed in released e2fsprogs-1.42.3.wc1. |