[LU-1015] ldiskfs corruption with large LUNs Created: 18/Jan/12  Updated: 11/Jun/12  Resolved: 11/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: cindy heer (Inactive) Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: ldiskfs, paj
Environment:

lustre-2.1.0-13chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64
toss/chaos 5
NetApp 22TB LUNs


Attachments: File LU1015.log.gz     File after.tar     File before.tar     File full.sdd.1021.log.gz     File sdb.20120305.5143.stats.gz     File sdb.20120305.5143.stats.post.gz     File sdb.20121805.1805.stats.gz     File sdb.20125805.5815.stats.gz     File sdb.20125805.5815.stats.post.gz     File sdb.fail.tar     File sdd.20120305.1910.stats.gz     File sdd.20120305.1910.stats.post.gz     File sdd.20120306.1204.stats.gz     File sdd.20120306.1204.stats.post.gz     File sdd.20120522.2011.e2fsck.gz     File sdd.20120522.2011.journal.gz     File sdd.20120522.2011.logdump.gz     File sdd.20120522.2011.stats.gz     File sdd.20120522.2011.stats.post.gz     File sdd.20120522.2323.e2fsck.gz     File sdd.20120522.2323.journal.gz     File sdd.20120522.2323.logdump.gz     File sdd.20120522.2323.stats.gz     File sdd.20120522.2323.stats.post.gz     File sdd.20125905.5946.stats.gz     File sdd.20125905.5946.stats.post.gz     File sdd.ext4.full.fsck.txt.gz     File sdd.fail.1.tar     File sdd.full.fsck.txt.gz    
Severity: 3
Epic: metadata
Rank (Obsolete): 2172

 Description   

We have been running ior testing on hyperion with toss 5 and have seen ldiskfs corruption. Since I know you have access to hyperion, I was hoping you could log on and look around (the console logs are on hyperion577-pub and santricity can be run from there as well). I have set up a test filesystem called /p/ls1 created with large luns (22TB per lun with 6 luns on each RBOD) on a Netapp. The mds is on hyperion-agb25 and the 2 oss nodes are hyperion-agb27 and hyperion-agb28. I had 10 clients writing i/o to the filesystem and would power cycle an oss every hour to simulate a node crashing. Upon bringing the oss up I would run the full fsck to check for errors and bring up lustre again and continue the i/o load from clients. We hit a bug where the fsck shows corruption and doesn't mount lustre. As a side note, I was running the same testing in parallel with the same HW, but with a small 3TB lun size and did not hit this issue.

zgrep Mounting ../conman.old/console.hyperion-agb27-20120115.gz

2012-01-14 14:24:02 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
2012-01-14 14:24:04 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0001
2012-01-14 14:24:06 Mounting /dev/dm-4 on /mnt/lustre/local/ls1-OST0002
2012-01-14 15:21:57 Mounting local filesystems: [ OK ]
2012-01-14 15:22:03 Mounting other filesystems: [ OK ]
2012-01-14 15:24:13 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0000
2012-01-14 15:24:15 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0001
2012-01-14 15:24:17 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0002
2012-01-14 16:21:49 Mounting local filesystems: [ OK ]
2012-01-14 16:21:55 Mounting other filesystems: [ OK ]
2012-01-14 16:23:51 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
2012-01-14 16:23:53 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0001
2012-01-14 16:23:55 Mounting /dev/dm-5 on /mnt/lustre/local/ls1-OST0002
2012-01-14 17:21:53 Mounting local filesystems: [ OK ]
2012-01-14 17:21:59 Mounting other filesystems: [ OK ]
2012-01-14 18:22:00 Mounting local filesystems: [ OK ]
2012-01-14 18:22:06 Mounting other filesystems: [ OK ]
2012-01-14 19:21:56 Mounting local filesystems: [ OK ]
2012-01-14 19:22:02 Mounting other filesystems: [ OK ]
2012-01-14 20:21:52 Mounting local filesystems: [ OK ]
2012-01-14 20:21:58 Mounting other filesystems: [ OK ]

It appears that the corruption occurred way back on Friday 1/14 after 16:20.
I state this based upon the fact that the OST's did not make it back after the power cycle on 1/14 @ 17:20.

Also the following fsck results only surfaced after that power cycle:

2012-01-14 17:23:48 Group descriptor 0 checksum is invalid. FIXED.
2012-01-14 17:23:48 Group descriptor 1 checksum is invalid. FIXED.
2012-01-14 17:23:48 Group descriptor 2 checksum is invalid. FIXED.



 Comments   
Comment by Peter Jones [ 20/Jan/12 ]

Andreas

Can you please comment on this one?

Thanks

Peter

Comment by Andreas Dilger [ 20/Jan/12 ]

I don't have Hyperion access myself, unfortunately.

Are there any ldiskfs errors on the OSS consoles between 16:23, when the OSTs last successfully mounted, and 17:21, when they apparently failed to mount?

Did all of the OSTs fail in a similar manner, or just a single one?

Are the group descriptor checksum error messages just the first of many (i.e. are all the group checksums invalid), or is it only for groups 0, 1, 2?

Could the OST(s) be mounted after the e2fsck run fixed those checksum errors, or were there other errors/corruption that prevented the OST(s) from mounting?

It looks like you are running the RHEL6.2 (220) kernel, so this should be relatively up-to-date w.r.t. upstream ext4 patches, or at least would hopefully narrow down the number of upstream kernel patches to look at.

Comment by Cliff White (Inactive) [ 20/Jan/12 ]

Syslog from 13:20 to ~17:30 2012-01-14

Comment by Cliff White (Inactive) [ 20/Jan/12 ]

I have access, so i took the liberty.
I do see these errors throughout:
2012-01-14 04:22:21 Buffer I/O error on device sdj, logical block 1

On 2012-01-14 the first correctable pfsck errors appear at the 13:23 pfsck.
At the 14:23 pfsck correctable errors appear on two disks.
The 15:23 and 16:23 pfsck are clean.
At 17:23, things explode

2012-01-14 17:23:47 ls1-OST0002: recovering journal
2012-01-14 17:23:48 fsck.ldiskfs: Group descriptors look bad... trying backup blocks...
2012-01-14 17:23:48 One or more block group descriptor checksums are invalid. Fix? yes
2012-01-14 17:23:48
Followed by a large mess of errors.
I don't see any odd error messages to account for this, is it possible the device was dual-mounted?
Is the hardware healthy? The syslog from 13:23 to the failure is attached.

Comment by cindy heer (Inactive) [ 20/Jan/12 ]

Thanks for looking around. I'm currently trying to build another test
with ddn hardware as a comparison. The device was not dual mounted and
the Netapp hardware exhibits no errors.

Comment by Christopher Morrone [ 20/Jan/12 ]

This corruption occurs as a result of unclean shutdowns of the OSS.

No, there are no errors from ldiskfs before this occurs. The disk state is inconsistent after the unclean OSS shutdown. If allowed to run without a full fsck, it will result in ldiskfs panicing the node.

An preen fsck (fsck -p, which is the default in our init scripts) will not necessarily catch this. If it fails to catch it and the node starts up, ldiskfs will panic the node later.

No, not all OSTs hit this. It is more-or-less random which OSTs will be corrupt after the unclean shutdown.

Yes, so far fsck will fix the problems and allow ldiskfs to run without error. But we haven't looked too hard to make sure there was no data loss. I think some times we I did wind up with files in lost+found.

Cindy, if you aren't already, after powering off the nodes you'll want to avoid using the default "fsck.ldiskfs -p" and instead do "fsck.ldiskfs -f -n" to really find the errors, and to avoid fixing them before Whamcloud can look at them.

Comment by Christopher Morrone [ 20/Jan/12 ]

Actually, there WAS one time that I hosed an OST so badly that I gave up and just reformatted it (I was trying to make progress on LU-874 at the time). But I'm not sure of the steps involved there...it may have been something like:

1) unclean power-off
2) fsck -p (fix something, but not everything?)
3) start ldiskfs, run for a while
4) ldiskfs panics
5) reboot and manually run fsck -f -y, something not quite right...
6) give up and format

Comment by cindy heer (Inactive) [ 30/Jan/12 ]

I have restarted testing on ls1 filesystem on hyperion (large lun testing with netapp) with the fsck -n in place. I was not running fsck -p for this testing ever, but I did have the full fsck with -y in the past. That has been modified. I am also running the same testing on ls3 filesystem on hyperion (large luns with ddn hardware) that has been running for about a week and has not had any failures so far. I continue to run iors to the filesystems while every hour I simulate an unclean shutdown on the oss (by powering them off).

Comment by cindy heer (Inactive) [ 31/Jan/12 ]

I think I hit a bit of corruption again on the large LUN with Netapp (still no corruption exhibited with the DDN). Here is the output (after oss is powered off to simulate oss crash). I'm running a journal replay fsck.ldiskfs -p and then a full fsck with -n flag):

ldev fsck.ldiskfs -p %d
ls1-OST0001: ls1-OST0001: clean, 337/22888320 files, 26023621/5859409133 blocks
ls1-OST0000: ls1-OST0000 contains a file system with errors, check forced.
ls1-OST0002: ls1-OST0002: clean, 340/22888320 files, 24807109/5859409133 blocks
ls1-OST0000: ls1-OST0000: Duplicate or bad block in use!
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 1769473: 4747954688
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 1769474: 4747984896 4747984897 4747984898
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 1769475: 4747988992 4747990016 4747990017
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 1769476: 4747986944 4747987968 4747987969
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 18546689: 4747954688
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 18546690: 4747984896 4747984897 4747984898
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 18546691: 4747988992 4747990016 4747990017
ls1-OST0000: ls1-OST0000: Multiply-claimed block(s) in inode 18546692: 4747986944 4747987968 4747987969
ls1-OST0000: ls1-OST0000: (There are 8 inodes containing multiply-claimed blocks.)
ls1-OST0000:
ls1-OST0000: ls1-OST0000: File /lost+found/#1769473 (inode #1769473, mod time Tue Jan 31 09:39:00 2012)
ls1-OST0000: has 1 multiply-claimed block(s), shared with 1 file(s):
ls1-OST0000: ls1-OST0000: /CONFIGS (inode #18546689, mod time Tue Jan 31 09:39:00 2012)
ls1-OST0000: ls1-OST0000:
ls1-OST0000:
ls1-OST0000: ls1-OST0000: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
ls1-OST0000: (i.e., without -a or -p options)
ldev: Fatal: parallel command execution failed

e-agb27: pfsck.ldiskfs /dev/dm-1 /dev/dm-2 /dev/dm-0 – -f -n -v -t

e-agb27: fsck 1.41.90.1chaos (14-May-2011)
e-agb27: fsck.ldiskfs 1.41.90.1chaos (14-May-2011)
e-agb27: fsck.ldiskfs 1.41.90.1chaos (14-May-2011)

e-agb27: fsck.ldiskfs 1.41.90.1chaos (14-May-2011)
e-agb27: Pass 1: Checking inodes, blocks, and sizes

e-agb27: Pass 1: Checking inodes, blocks, and sizes
e-agb27: Pass 1: Checking inodes, blocks, and sizes

e-agb27: Pass 2: Checking directory structure

e-agb27: Pass 3: Checking directory connectivity
e-agb27: Pass 4: Checking reference counts

e-agb27: Pass 5: Checking group summary information

e-agb27: Pass 2: Checking directory structure

e-agb27: Pass 3: Checking directory connectivity
e-agb27: Pass 4: Checking reference counts

e-agb27:
e-agb27: Running additional passes to resolve blocks claimed by more than one inode...
e-agb27: Pass 1B: Rescanning for multiply-claimed blocks
e-agb27: Multiply-claimed block(s) in inode 1769473: 4747954688
e-agb27: Multiply-claimed block(s) in inode 1769474: 4747984896 4747984897 4747984898
e-agb27: Multiply-claimed block(s) in inode 1769475: 4747988992 4747990016 4747990017
e-agb27: Multiply-claimed block(s) in inode 1769476: 4747986944 4747987968 4747987969
e-agb27: Pass 5: Checking group summary information
e-agb27: Multiply-claimed block(s) in inode 18546689: 4747954688
e-agb27: Multiply-claimed block(s) in inode 18546690: 4747984896 4747984897 4747984898
e-agb27: Multiply-claimed block(s) in inode 18546691: 4747988992 4747990016 4747990017
e-agb27: Multiply-claimed block(s) in inode 18546692: 4747986944 4747987968 4747987969
e-agb27: Pass 1C: Scanning directories for inodes with multiply-claimed blocks

e-agb27: Pass 1D: Reconciling multiply-claimed blocks
e-agb27:
e-agb27: 337 inodes used (0.00%)
e-agb27: 131 non-contiguous files (38.9%)
e-agb27: 0 non-contiguous directories (0.0%)
e-agb27: # of inodes with ind/dind/tind blocks: 0/0/0
e-agb27: Extent depth histogram: 178/149
e-agb27: 26023621 blocks used (0.44%)
e-agb27: 0 bad blocks
e-agb27: 19 large files
e-agb27:
e-agb27: 291 regular files
e-agb27: 37 directories
e-agb27: 0 character device files
e-agb27: 0 block device files
e-agb27: 0 fifos
e-agb27: 0 links
e-agb27: 0 symbolic links (0 fast symbolic links)
e-agb27: 0 sockets
e-agb27: --------
e-agb27: 328 files
e-agb27: Memory used: 43816k/715264k (16113k/27704k), time: 7.44/ 6.65/ 0.75
e-agb27: I/O read: 29MB, write: 0MB, rate: 3.90MB/s

e-agb27:
e-agb27: 340 inodes used (0.00%)
e-agb27: 132 non-contiguous files (38.8%)
e-agb27: 0 non-contiguous directories (0.0%)
e-agb27: # of inodes with ind/dind/tind blocks: 0/0/0
e-agb27: Extent depth histogram: 180/150
e-agb27: 24807109 blocks used (0.42%)
e-agb27: 0 bad blocks
e-agb27: 18 large files
e-agb27:
e-agb27: 294 regular files
e-agb27: 37 directories
e-agb27: 0 character device files
e-agb27: 0 block device files
e-agb27: 0 fifos
e-agb27: 0 links
e-agb27: 0 symbolic links (0 fast symbolic links)
e-agb27: 0 sockets
e-agb27: --------
e-agb27: 331 files
e-agb27: Memory used: 43816k/715264k (16113k/27704k), time: 15.80/ 6.73/ 0.76
e-agb27: I/O read: 30MB, write: 0MB, rate: 1.90MB/s

e-agb27: (There are 8 inodes containing multiply-claimed blocks.)
e-agb27:
e-agb27: File /lost+found/#1769473 (inode #1769473, mod time Tue Jan 31 09:39:00 2012)
e-agb27: has 1 multiply-claimed block(s), shared with 1 file(s):
e-agb27: /CONFIGS (inode #18546689, mod time Tue Jan 31 09:39:00 2012)
e-agb27: Clone multiply-claimed blocks? no
e-agb27:
e-agb27: Delete file? no
e-agb27:
e-agb27: File /lost+found/#1769474 (inode #1769474, mod time Wed Jan 11 12:54:06 2012)
e-agb27: has 3 multiply-claimed block(s), shared with 1 file(s):
e-agb27: /???/mountdata (inode #18546690, mod time Wed Jan 11 12:54:06 2012)
e-agb27: Clone multiply-claimed blocks? no
e-agb27:
e-agb27: Delete file? no
e-agb27:
e-agb27: File ... (inode #1769475, mod time Tue Jan 31 09:39:00 2012)
e-agb27: has 3 multiply-claimed block(s), shared with 1 file(s):
e-agb27: /???/ls1-OST0000 (inode #18546691, mod time Tue Jan 31 09:39:00 2012)
e-agb27: Clone multiply-claimed blocks? no
e-agb27:
e-agb27: Delete file? no
e-agb27:
e-agb27: File /lost+found/#1769476 (inode #1769476, mod time Sat Jan 14 16:23:53 2012)
e-agb27: has 3 multiply-claimed block(s), shared with 1 file(s):
e-agb27: ... (inode #18546692, mod time Sat Jan 14 16:23:53 2012)
e-agb27: Clone multiply-claimed blocks? no
e-agb27:
e-agb27: Delete file? no
e-agb27:
e-agb27: File /CONFIGS (inode #18546689, mod time Tue Jan 31 09:39:00 2012)
e-agb27: has 1 multiply-claimed block(s), shared with 1 file(s):
e-agb27: /lost+found/#1769473 (inode #1769473, mod time Tue Jan 31 09:39:00 2012)
e-agb27: Clone multiply-claimed blocks? no
e-agb27:
e-agb27: Delete file? no
e-agb27:
e-agb27: File /???/mountdata (inode #18546690, mod time Wed Jan 11 12:54:06 2012)
e-agb27: has 3 multiply-claimed block(s), shared with 1 file(s):
e-agb27: /lost+found/#1769474 (inode #1769474, mod time Wed Jan 11 12:54:06 2012)
e-agb27: Clone multiply-claimed blocks? no
e-agb27:
e-agb27: Delete file? no
e-agb27:
e-agb27: File /???/ls1-OST0000 (inode #18546691, mod time Tue Jan 31 09:39:00 2012)
e-agb27: has 3 multiply-claimed block(s), shared with 1 file(s):
e-agb27: ... (inode #1769475, mod time Tue Jan 31 09:39:00 2012)
e-agb27: Clone multiply-claimed blocks? no
e-agb27:
e-agb27: Delete file? no
e-agb27:
e-agb27: File ... (inode #18546692, mod time Sat Jan 14 16:23:53 2012)
e-agb27: has 3 multiply-claimed block(s), shared with 1 file(s):
e-agb27: /lost+found/#1769476 (inode #1769476, mod time Sat Jan 14 16:23:53 2012)
e-agb27: Clone multiply-claimed blocks? no
e-agb27:
e-agb27: Delete file? no
e-agb27:
e-agb27: Pass 2: Checking directory structure
e-agb27: Invalid inode number for '.' in directory inode 1769473.
e-agb27: Fix? no
e-agb27:
e-agb27: Pass 3: Checking directory connectivity
e-agb27: '..' in /lost+found/#1769473 (1769473) is / (2), should be /lost+found (11).
e-agb27: Fix? no
e-agb27:
e-agb27: Pass 4: Checking reference counts
e-agb27: Inode 2 ref count is 5, should be 6. Fix? no
e-agb27:
e-agb27: Inode 11 ref count is 3, should be 2. Fix? no
e-agb27:
e-agb27: Inode 1769473 ref count is 2, should be 1. Fix? no
e-agb27:
e-agb27: Inode 1769474 ref count is 2, should be 1. Fix? no
e-agb27:
e-agb27: Unattached inode 1769475
e-agb27: Connect to /lost+found? no
e-agb27:
e-agb27: Inode 18546689 ref count is 2, should be 3. Fix? no
e-agb27:
e-agb27: Inode 18546691 ref count is 1, should be 2. Fix? no
e-agb27:
e-agb27: Unattached inode 18546692
e-agb27: Connect to /lost+found? no
e-agb27:
e-agb27: Unattached zero-length inode 18546693. Clear? no
e-agb27:
e-agb27: Unattached inode 18546693
e-agb27: Connect to /lost+found? no
e-agb27:
e-agb27: Inode 18546695 ref count is 1, should be 2. Fix? no
e-agb27:
e-agb27: Pass 5: Checking group summary information

e-agb27: Block bitmap differences: (63856391) -(453017600453017602) -(453018113453018114) -(453018626453018627) -453019136 -453019648 -(453020162453020416) -(453020672453020673) -453021187 -453021696 -(453022720-453022721)
e-agb27: Fix? no
e-agb27:
e-agb27: Free blocks count wrong for group #13825 (32768, counted=32498).
e-agb27: Fix? no
e-agb27:
e-agb27: Free blocks count wrong (5835336736, counted=5835336466).
e-agb27: Fix? no
e-agb27:
e-agb27: Inode bitmap differences: -1769477 -1769479
e-agb27: Fix? no
e-agb27:
e-agb27: Free inodes count wrong for group #13824 (125, counted=122).
e-agb27: Fix? no
e-agb27:
e-agb27: Free inodes count wrong (22887977, counted=22887974).
e-agb27: Fix? no
e-agb27:
e-agb27:
e-agb27: ls1-OST0000: ********** WARNING: Filesystem still has errors **********
e-agb27:
e-agb27:
e-agb27: 343 inodes used (0.00%)
e-agb27: 135 non-contiguous files (39.4%)
e-agb27: 0 non-contiguous directories (0.0%)
e-agb27: # of inodes with ind/dind/tind blocks: 0/0/0
e-agb27: Extent depth histogram: 184/150
e-agb27: 24072397 blocks used (0.41%)
e-agb27: 0 bad blocks
e-agb27: 18 large files
e-agb27:
e-agb27: 297 regular files
e-agb27: 38 directories
e-agb27: 0 character device files
e-agb27: 0 block device files
e-agb27: 0 fifos
e-agb27: 3 links
e-agb27: 0 symbolic links (0 fast symbolic links)
e-agb27: 0 sockets
e-agb27: --------
e-agb27: 335 files
e-agb27: Memory used: 46616k/1430528k (16114k/30503k), time: 142.55/129.36/ 1.75
e-agb27: I/O read: 721MB, write: 0MB, rate: 5.06MB/s
e-agb27: FAILED: pfsck.ldiskfs – -f -n -v -t: 0

Comment by cindy heer (Inactive) [ 31/Jan/12 ]

I will leave the oss (hyerion-agb27) down for further examination unless I hear otherwise.

Comment by Andreas Dilger [ 03/Feb/12 ]

Cindy, thanks for posting the e2fsck output. Looking at the blocks that are duplicate allocated:

e-agb27: Multiply-claimed block(s) in inode 1769473: 4747954688
e-agb27: Multiply-claimed block(s) in inode 1769474: 4747984896 4747984897 4747984898
e-agb27: Multiply-claimed block(s) in inode 1769475: 4747988992 4747990016 4747990017
e-agb27: Multiply-claimed block(s) in inode 1769476: 4747986944 4747987968 4747987969

These are all beyond the 2^32 block limit (4747986944 = 0x11b008800) so it may be that this problem relates to overflow of 32-bit block numbers somewhere in the IO stack. If you have the logs from the previous e2fsck runs that showed corruption, it would be useful to know if the type of corruption always the same or not (attaching a few e2fsck logs would be useful). I have a suspicion based on this one log that it may relate to corruption of the block bitmap during the previous journal recovery or e2fsck (possibly due to 2^32-block truncation) which later causes the blocks to be allocated twice.

Several things can be done to begin debugging this, possibly in parallel:

  • Verify IO integrity for > 16TB LUNs
  • take a > 16TB OST out of service
  • mount it directly with ldiskfs
  • run "llverfs -v -l -c 32 {mountpoint}

    " against the mounted OST filesystem

  • run a full e2fsck to determine if there is any corruption present
  • run the current test, but dump the journal contents before mounting or running e2fsck on the filesystem
  • dumpe2fs $ostdev > $logdir/$ostdev.$(date +%Y%m%d).stats
  • debugfs -c -R "dump <8>" $ostdev > $logdir/$ostdev.$(date +%y%m%d).journal
  • debugfs -c -R "logdump -a" $ostdev > $logdir/$ostdev.$(date +%Y%m%d).logdump
  • e2fsck -fp $ostdev 2>&1 | tee $logdir/$ostdev.$(date +%Y%m%d).e2fsck
  • dumpe2fs $ostdev > $logdir/$ostdev.$(date +%Y%m%d).stats.post
  • format some OSTs to be less than 16TB and try to reproduce the problem again on the NetApp hardware

llverfs is a non-destructive test that will write files until the device is full and then read them back and verify the data contents. It is intended to catch errors in the IO path (filesystem, block layer, HBA, driver, controller) related to 32-bit address truncation, but it is not very fast (may take a couple of days, depending on LUN size and IO rate). If this finds problems, there is a lower-level (filesystem destructive) "llverdev" tool that will run against the underlying block device and do a full write/read/verify cycle on the device, excluding the filesystem. I don't have high hopes for this finding a problem, but it is useful to eliminate the chance that there are obvious bugs in the IO stack. We ran full llverdev and llverfs tests previously with a DDN SFA10kE + RHEL5 on a 128TB LUN without problems, but there may be problems with RHEL6, the driver, the controller, etc. in your environment.

The debugfs/e2fsck commands are intended to catch (or at least give us some chance to find post-facto) errors in the journal replay and/or e2fsck that are incorrectly marking blocks free, which are later being reallocated. Also, what version of e2fsprogs is e2fsck-1.41.90-1chaos based on? I now recall after working on this bug that there may have been some 64-bit bugs fixed in e2fsck that could potentially be causing problems as well.

Comment by D. Marc Stearman (Inactive) [ 06/Feb/12 ]

Andreas, thanks for the excellent analysis. We have been running this same test to isolate the extent of the corruption. We see this behavior on the 22TB luns on the NetApp hardware, but not on a smaller 3TB partition created on similar luns. We also have not seen it on a DDN SFA10K with 16TB luns, but are reconfiguring the DDN to have luns > 16TB to see if we can reproduce it on that hardware.

Whamcloud has access to Hyperion, so please coordinate with the Hyperion team to reserve some hardware, and your folks can run the tests you describe above.

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

Setup with 4 OSTS

/dev/sda 12T 39G 12T 1% /p/osta
/dev/sdb 22T 39G 21T 1% /p/ostb
/dev/sdc 10T 39G 9.5T 1% /p/ostc
/dev/sdd 22T 39G 21T 1% /p/ostd

Ran llverfs per Andreas on /dev/sdd - no issues.
Mounted all as Lustre FS, ran IOR from 3 clients, powercycled e-abg26 in the middle of the run,
replicated failure. Captured before/after data in script per Andreas instructions, attached.

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

Multiple files for size reason

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

Before run

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

small files

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

before run

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

after fail

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

after fail

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

Ran test on all disks, only sdd showed failure (so far)

Comment by Cliff White (Inactive) [ 05/Mar/12 ]

Repeated test, sdb (2nd 21TB LUN) failed this time. Seems easy to repeat, awaiting further requests.

Comment by Andreas Dilger [ 06/Mar/12 ]

Cliff, some questions:

  • presumably the 4 LUNs (12 TB, 22TB, 10TB, 22TB) are on the same node?
  • you are not doing failover between multiple OSS nodes, but just rebooting the node in-place?
  • how many times did you run the test, and how many failures? Two tests and two failures?
  • this is running on the NetApp hardware for both the under 16TB and over 16TB LUNs?

Hitting three failures on the > 16TB LUNs is a fairly good indication of the problem is limited to > 16TB LUN support, and not to the NetApp itself (assuming the < 16TB LUNs are also on the NetApp). One option would be to run IOR directly on the OSS node against one of the > 16TB LUNs mounted with "-t ldiskfs" instead of "-t lustre", then do a similar hard reset + e2fsck (and other debugging, which is hopefully in a script by now). This would let us see if this is a problem in the lustre/obdfilter/fsfilt code or in the core ldiskfs code. If this local testing still fails with ldiskfs, then it would be useful to test with ext4 to determine if the problem is in the base ext4 code.

Look at the debug logs for the first test, and it does appear that the corruption is of the block bitmap above the 16TB mark, allowing the writes to reallocate those blocks. It may be in ldiskfs or ext4, so starting the above testing would also help cut down the number of variables.

Comment by Cliff White (Inactive) [ 06/Mar/12 ]

Per LLNL: all disk are from Netapp “E5400 RBODs” All 4 LUNS are attached to a single OSS node.
I did not do failover, as there was nothing to fail to, I performed a hard powercycle using powerman.
The test was run twice, the debug data script was run prior to the first test. Two tests, two failures.

Again, afaik these LUNS are all furnished by the same NetApp device.

I will perform the local IOR test and report results.

Comment by Cliff White (Inactive) [ 06/Mar/12 ]

Ran the test using IOR on a local ldiskfs mount. Failed. Results attached.
Reformatted all LUNS as ext4, re-ran test twice - no failures. The second time, ran IOR in a loop against all four drives (in sequence) for one hour prior to powercycle.

Comment by Cliff White (Inactive) [ 06/Mar/12 ]

Failure with local ldiskfs mount.

Comment by Cliff White (Inactive) [ 06/Mar/12 ]

Options for Lustre format:

mkfs.lustre --reformat --ost --fsname lu1015 --mgsnode=192.168.120.25@o2ib --mkfsoptions='-t ext4 -J size=2048 -O extents -G 256 -i 69905' /dev/sd$i &

Comment by Andreas Dilger [ 06/Mar/12 ]

Cliff, is this running the LLNL ldiskfs RPM, or the ldiskfs from the Lustre tree? It would be good to run vanilla ext4 powerfail tests a couple of extra times to more positively verify that the bug is not present with ext4, since we know that it is definitely still there for ldiskfs.

Comment by Cliff White (Inactive) [ 06/Mar/12 ]

rpm -qa |grep disk
lustre-ldiskfs-3.3.0-2.6.32_220.4.2.el6_lustre.gddd1a7c.x86_64_g0203c14.x86_64
lustre-ldiskfs-debuginfo-3.3.0-2.6.32_220.4.2.el6_lustre.gddd1a7c.x86_64_g0203c14.x86_64

I will continue running the powerfail test on ext4.

Comment by Cliff White (Inactive) [ 07/Mar/12 ]

Somewhat of a head-scratcher atm. Wanted to be certain only the large disks were seeing errors when formatted as ldiskfs, so last night ran a test writing to all four disks mounted as ldiskfs in a loop (short iteration of IOR) creating a new file each loop.
I let the test run for one hour before doing the power cycle, previously had powercycled within 10 minutes of test start.
First reboot - no disks! - restarted ibsrp via /etc/init.d/ibsrp - no errors found.
Second reboot as a test, I restarted ibsrp by hand prior to running the fsck script, again no errors found.
Not sure what to make of this outcome.

Today, will re-run ext4 tests, so far no errors there.

Comment by Cliff White (Inactive) [ 08/Mar/12 ]

I have continued running the local (ext4 and ldiskfs) tests, but have not had a failure in two days.
Am uncertain what if anything has changed.

Comment by Andreas Dilger [ 09/Mar/12 ]

I thought of another possible way to positively exclude the NetApp from the picture here:

Use LVM to create PVs and a volume group on the 22TB LUNs, something like (from memory, please check man pages):

pvcreate /dev/sdb /dev/sdd
vgcreate -n vgtest /dev/sdb /dev/sdd

Create a 32TB LV that is using only the first 16TB of these two LUNs:

lvcreate -n lvtest_lo -L 16T /dev/vgtest /dev/sdb
lvextend -L +16T /dev/vgtest/lvtest_lo

Create a 12TB LV that is using only the last 6TB of these two LUNs (the size may need to massaged to consume the rest of the space on both /dev/sdb and /dev/sdd:

lvcreate -n lvtest_hi -L 6T /dev/vgtest /dev/sdb
lvextend -L +6T /dev/vgtest/lvtest_hi

What this will do is create the "lvtest_lo" on only storage space that is using blocks of the NetApp that are exclusively below 16TB, but the ldiskfs filesystem is larger than 16TB. Conversely, "lvtest_hi" is smaller than 16TB, but is using blocks of the NetApp above the 16TB limit.

Run the original Lustre IOR test against all 4 LUNs. If "lvtest_lo" hits the problem, then it is positively caused by Lustre or ldiskfs or ext4. If "lvetst_hi" hits the problem, then it is positively a problem in the NetApp, because ldiskfs is smaller than 16TB (which didn't hit any failure before). If it hits on /dev/sda or /dev/sdc then we are confused (it could be either again), but I hope not.

Comment by Cliff White (Inactive) [ 09/Mar/12 ]

Here's the new setup for OSS, will report.
— Logical volume —
LV Name /dev/vglu1015/lv1015_lo
VG Name vglu1015
LV UUID sgpofs-dO8q-nGd1-yNlj-Vgah-L1NT-cph17Y
LV Write Access read/write
LV Status available

  1. open 0
    LV Size 32.00 TiB
    Current LE 8388608
    Segments 2
    Allocation inherit
    Read ahead sectors auto
  • currently set to 256
    Block device 253:0

— Segments —
Logical extent 0 to 4194303:
Type linear
Physical volume /dev/sdb
Physical extents 0 to 4194303

Logical extent 4194304 to 8388607:
Type linear
Physical volume /dev/sdd
Physical extents 0 to 4194303

— Logical volume —
LV Name /dev/vglu1015/lv1015_hi
VG Name vglu1015
LV UUID qTWfgz-fr1M-Gss2-0WmE-EJfH-0is2-bTVKGZ
LV Write Access read/write
LV Status available

  1. open 0
    LV Size 11.60 TiB
    Current LE 3040872
    Segments 2
    Allocation inherit
    Read ahead sectors auto
  • currently set to 256
    Block device 253:1

— Segments —
Logical extent 0 to 1520435:
Type linear
Physical volume /dev/sdb
Physical extents 4194304 to 5714739

Logical extent 1520436 to 3040871:
Type linear
Physical volume /dev/sdd
Physical extents 4194304 to 5714739

Comment by Cliff White (Inactive) [ 09/Mar/12 ]

I am not certain this is going to uncover errors, as the performance through LVM is about 1/10 of the native performance. Any ideas for tuning this setup for better numbers?

Comment by Andreas Dilger [ 09/Mar/12 ]

That is probably Sur to seeking between the LUNs and the LVs stripes across them. I didn't think the test took so long to run, just to start testing and then reboot.

If it needs to be faster there are less "good" tests that could be run.

For example, run on only the hi or lo LVs at one time, alternating, and then see which one fails. That would take twice as long, but not 10x as long. We would need to run several times to be confident only one config is failing.

Comment by Cliff White (Inactive) [ 11/Mar/12 ]

The current issue is that none of the configs are failing, I was running longer in hopes of generating a failure. But a failure has not occurred since last tuesday on any configuration, so I am currently quite puzzled. There have been hardware changes durning this, is it possible that this was a hardware issue and has been fixed by the LLNL controller changes?

Comment by Cliff White (Inactive) [ 11/Mar/12 ]

To clarify, my concern is not the length of the tests, rather with 1/10 the IO rate, my concern is that we are no longer driving the hardware very hard, if this issue is related to speed or volume of IO at the device level we won't be able to replicate the issue.

Comment by Andreas Dilger [ 11/Mar/12 ]

Running on either the hi or lo LVs at one time should get the IO rate to the one LUN back to the original level. If that does not return the symptoms again, then we need to go back to the full LUN testing without LVM to verify that the problem can still be hit.

Comment by Cliff White (Inactive) [ 22/May/12 ]

Returned to simpler setup, using straight ldiskfs was able to re-create errors on >21TB OST. Data attached.
Will try now with local IOR/ext4 to see if problem can be isolated to ldiskfs code.

Comment by Cliff White (Inactive) [ 22/May/12 ]

I have reformatted with ext4, running iOR locally, and have had one failure, results attached.

Comment by Andreas Dilger [ 22/May/12 ]

Cliff, over the weekend there was a posting on the linux-ext4 list with an e2fsck patch that may resolve this problem. It seems that the root of the problem is in e2fsck itself, not ldiskfs or ext4, but is only seen if there are blocks in the journal to be recovered beyond 16TB, which is why it didn't show up regularly in testing.

The posted patch is larger, since it also fixes some further 64-bit block number problems on 32-bit systems, but the gist of the patch is below.

From 3b693d0b03569795d04920a04a0a21e5f64ffedc Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Mon, 21 May 2012 21:30:45 -0400
Subject: [PATCH] e2fsck: fix 64-bit journal support

64-bit journal support was broken; we weren't using the high bits from
the journal descriptor blocks in some cases!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
e2fsck/jfs_user.h |    4 ++--
e2fsck/journal.c  |   33 +++++++++++++++++----------------
e2fsck/recovery.c |   25 ++++++++++++-------------
3 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/e2fsck/jfs_user.h b/e2fsck/jfs_user.h
index 9e33306..92f8ae2 100644
--- a/e2fsck/jfs_user.h
+++ b/e2fsck/jfs_user.h
@@ -18,7 +18,7 @@ struct buffer_head {
 	e2fsck_t	b_ctx;
 	io_channel 	b_io;
 	int	 	b_size;
-	blk_t	 	b_blocknr;
+	unsigned long long b_blocknr;
 	int	 	b_dirty;
 	int	 	b_uptodate;
 	int	 	b_err;
diff --git a/e2fsck/recovery.c b/e2fsck/recovery.c
index b669941..e94ef4e 100644
--- a/e2fsck/recovery.c
+++ b/e2fsck/recovery.c
@@ -309,7 +309,6 @@ int journal_skip_recovery(journal_t *journal)
 	return err;
 }
 
-#if 0
 static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag)
 {
 	unsigned long long block = be32_to_cpu(tag->t_blocknr);
@@ -317,7 +316,6 @@ static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag
 		block |= (__u64)be32_to_cpu(tag->t_blocknr_high) << 32;
 	return block;
 }
-#endif
 
/*
 * calc_chksums calculates the checksums for the blocks described in the
 * descriptor block.
@@ -506,7 +504,8 @@ static int do_one_pass(journal_t *journal,
 					unsigned long blocknr;
 
 					J_ASSERT(obh != NULL);
-					blocknr = be32_to_cpu(tag->t_blocknr);
+					blocknr = read_tag_block(tag_bytes,
+								 tag);
 
 					/* If the block has been
 					 * revoked, then we're all done
Comment by Andreas Dilger [ 23/May/12 ]

Bumping priority on this for tracking. It is a bug in e2fsprogs, not Lustre, but making it a blocker ensures it will get continuous attention.

Comment by Andreas Dilger [ 31/May/12 ]

e2fsprogs-1.42.3.wc1 (tag v1.42.3.wc1 in git) has been built and packages are available for testing:

http://build.whamcloud.com/job/e2fsprogs-master/arch=x86_64,distro=el6/

Cliff, could you please give this a test (even better to run it in a loop) and see if it resolves the problem?

Comment by Cliff White (Inactive) [ 08/Jun/12 ]

Running with latest e2fsprogs, one error recovered, logs attached.

/dev/vglu1015/lv1015_hi: catastrophic mode - not reading inode or group bitmaps
lu1015-OST0000: recovering journal
lu1015-OST0000: Truncating orphaned inode 78643270 (uid=0, gid=0, mode=0100666, size=0)
lu1015-OST0000: Inode 78643270, i_size is 0, should be 16777216. FIXED.
lu1015-OST0000: 119/182453760 files (1.7% non-contiguous), 57592447/3113852928 blocks

Comment by Cliff White (Inactive) [ 08/Jun/12 ]

file is lu1015.060812.tar.gz on the FTP site

Comment by Andreas Dilger [ 11/Jun/12 ]

Cliff, how many runs did it take to hit this error?

I don't think this is related to the problem seen before. Truncating orphan inodes on recovery is normal behaviour when a file is in the middle of being truncated at crash time. It looks like this handling isn't tested very often and has a bug because the "Truncating orphaned inode" message means the inode should be truncated to size=0 bytes, but then e2fsck gets confused and detects the file size is smaller than the allocated blocks and resets the size to cover the allocated blocks. This should be filed & fixed separately.

Comment by Cliff White (Inactive) [ 11/Jun/12 ]

The error occured on the second run. The system ran large-lun.sh successfully prior to this.

Comment by Andreas Dilger [ 11/Jun/12 ]

I've been able to reproduce this bug in vanilla e2fsck, and the problem exists only for large extent-mapped files that are being truncated at the time of a crash.

Comment by Andreas Dilger [ 11/Jun/12 ]

Problem is fixed in released e2fsprogs-1.42.3.wc1.

Generated at Sat Feb 10 01:12:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.