[LU-3668] ldiskfs_check_descriptors: Block bitmap for group not in group Created: 30/Jul/13  Updated: 29/Mar/14  Resolved: 29/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Karl W Schulz (Inactive) Assignee: Andreas Dilger
Resolution: Won't Fix Votes: 0
Labels: patch
Environment:

Stampede: CentOS6

OSS's running whamcloud 2.1.6 distribution:

  • kernel-2.6.32-358.11.1.el6_lustre.x86_64
  • lustre-2.1.6-2.6.32_358.11.1.el6_lustre.x86_64.x86_64
  • lustre-ldiskfs-3.3.0-2.6.32_358.11.1.el6_lustre.x86_64.x86_64
  • e2fsprogs-1.42.7.wc1-7.el6.x86_64

Attachments: File md14_dumpe2fs.tar.gz    
Issue Links:
Related
is related to LU-14 live replacement of OST Resolved
Severity: 3
Rank (Obsolete): 9453

 Description   

Our $SCRATCH file system is down and we are unable to mount an OST due to corrupted group descriptors reported.

Symptoms:

(1) cannot mount as normal lustre fs
(2) also cannot mount as ldiskfs
(3) e2fsck reports alarming number of issues

Scenario:

The OST is a RAID6 (8+2) config with external journals. At 18:06 yesterday, MD raid detected a disk error, evicted the failed disk, and started rebuilding the device with a hot spare. Before the rebuild finished, ldiskfs reported the error below and the device went read-only.

Jul 29 22:16:40 oss28 kernel: [547129.288298] LDISKFS-fs error (device md14): ld
iskfs_lookup: deleted inode referenced: 2463495
Jul 29 22:16:40 oss28 kernel: [547129.298723] Aborting journal on device md24.
Jul 29 22:16:40 oss28 kernel: [547129.304211] LustreError: 17212:0:(obd.h:1615:o
bd_transno_commit_cb()) scratch-OST0124: transno 176013176 commit error: 2
Jul 29 22:16:40 oss28 kernel: [547129.316134] LustreError: 17212:0:(obd.h:1615:o
bd_transno_commit_cb()) scratch-OST0124: transno 176013175 commit error: 2
Jul 29 22:16:40 oss28 kernel: [547129.316136] LDISKFS-fs error (device md14): ld
iskfs_journal_start_sb: Detected aborted journal
Jul 29 22:16:40 oss28 kernel: [547129.316139] LDISKFS-fs (md14): Remounting file
system read-only

Host was rebooted at 6am and have been unable to mount since. Would appreciate some suggestions on the best approach to try and recover with e2fsck, journal rebuilding, etc to recover this OST.

I will follow up with output from e2fsck -f -n which is running now (attempting to use backup superblock). Typical entries look as follows:

e2fsck 1.42.7.wc1 (12-Apr-2013)
Inode table for group 3536 is not in group. (block 103079215118)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 3538 is not in group. (block 107524506255360)
Relocate? no

Inode bitmap for group 3538 is not in group. (block 18446612162378989568)
Relocate? no

Inode table for group 3539 is not in group. (block 3439182177370112)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 3541 is not in group. (block 138784755704397824)
Relocate? no

Inode table for group 3542 is not in group. (block 7138029487521792000)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 3544 is not in group. (block 180388626432)
Relocate? no

Inode table for group 3545 is not in group. (block 25769803776)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 3547 is not in group. (block 346054104973312)
Relocate? no

Inode 503 has compression flag set on filesystem without compression support. \
Clear? no

Inode 503 has INDEX_FL flag set but is not a directory.
Clear HTree index? no

HTREE directory inode 503 has an invalid root node.
Clear HTree index? no

HTREE directory inode 503 has an unsupported hash version (40)
Clear HTree index? no

HTREE directory inode 503 uses an incompatible htree root node flag.
Clear HTree index? no

HTREE directory inode 503 has a tree depth (16) which is too big
Clear HTree index? no

Inode 503, i_blocks is 842359139, should be 0. Fix? no

Inode 504 is in use, but has dtime set. Fix? no

Inode 504 has imagic flag set. Clear? no

Inode 504 has a extra size (25649) which is invalid
Fix? no

Inode 504 has INDEX_FL flag set but is not a directory.
Clear HTree index? no

Inode 562 has INDEX_FL flag set but is not a directory.
Clear HTree index? no

HTREE directory inode 562 has an invalid root node.
Clear HTree index? no

HTREE directory inode 562 has an unsupported hash version (51)
Clear HTree index? no

HTREE directory inode 562 has a tree depth (59) which is too big
Clear HTree index? no

Inode 562, i_blocks is 828596838, should be 0. Fix? no

Inode 563 is in use, but has dtime set. Fix? no

Inode 563 has imagic flag set. Clear? no

Inode 563 has a extra size (12387) which is invalid
Fix? no

lock #623050609 (3039575950) causes file to be too big. IGNORED.
Block #623050610 (3038656474) causes file to be too big. IGNORED.
Block #623050611 (3037435566) causes file to be too big. IGNORED.
Block #623050612 (3035215768) causes file to be too big. IGNORED.
Block #623050613 (3031785159) causes file to be too big. IGNORED.
Block #623050614 (3027736066) causes file to be too big. IGNORED.
Block #623050615 (3019627313) causes file to be too big. IGNORED.
Block #623050616 (2970766533) causes file to be too big. IGNORED.
Block #623050617 (871157932) causes file to be too big. IGNORED.
Block #623050618 (879167937) causes file to be too big. IGNORED.
Block #623050619 (883249763) causes file to be too big. IGNORED.
Block #623050620 (885943218) causes file to be too big. IGNORED.
Too many illegal blocks in inode 1618.
Clear inode? no

Suppress messages? no



 Comments   
Comment by Andreas Dilger [ 30/Jul/13 ]

Have you tried running e2fsck with a backup group descriptor table, something like:

e2fsck -fn -b 32768 -B 4096 /dev/md14

For the -b argument, valid values include 32768, 98304, 163840, 229376, 294912, 819200, ... (32768 * (3,5,7)^n). If all of these report corrupt group descriptors then it is likely that the MD RAID rebuild has somehow built the parity of the disk incorrectly. If each of the descriptors reports different errors, it might be possible to combine them manually to get a full set of valid descriptors.

Could you please also provide the output of dumpe2fs [-b 32768 -B 4096] -h /dev/md14, particularly if this is consistent across different -b values.

Comment by Andreas Dilger [ 30/Jul/13 ]

Note that it is also possible to mount the filesystem on the clients and deactivate this OST on the clients + MDS using:

lctl --device {device} deactivate

Where device is either the device name (e.g. $fsname-OST0000-osc-MDT0000 on the MDS) or number as reported by lctl dl. Note that the device number will be different on the clients compared to the MDS).

Access to existing files using this OST will return EIO, but new files will not use it. This is typically only practical to use if the program input can be read from a different filesystem.

Comment by Karl W Schulz (Inactive) [ 30/Jul/13 ]

Yes, based on a post you made previously, we also tried values of -b = 32768,98304,163840,229376,294912,819200, and 884736. For values smaller than 884736, the first message we saw from fsck is of the form "block bitmap for group <x> is not in group". The snippet of e2fsck output pasted above is with b=884736 and although the bad block bitmap is not the first error detected, it occurs shortly thereafter.

Here is the top of a standard fsck: e2fsck -f -n /dev/md14

  1. head -50 /tmp/fsck.log
    e2fsck 1.42.7.wc1 (12-Apr-2013)
    ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
    e2fsck: Group descriptors look bad... trying backup blocks...
    Block bitmap for group 2508 is not in group. (block 261993005056)
    Relocate? no

Inode table for group 2536 is not in group. (block 261993005056)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 2546 is not in group. (block 3456555320082432)
Relocate? no

Inode bitmap for group 2546 is not in group. (block 18446612162378989568)
Relocate? no

Inode table for group 2547 is not in group. (block 3487607933632512)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 2549 is not in group. (block 10222520243247382528)
Relocate? no

Inode table for group 2550 is not in group. (block 9007199254740992)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 2552 is not in group. (block 30064771072)
Relocate? no

Inode table for group 2553 is not in group. (block 13108240187392)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 2555 is not in group. (block 1960356217880576)
Relocate? no

Inode bitmap for group 2555 is not in group. (block 18446612140904153088)
Relocate? no

Inode table for group 2556 is not in group. (block 3456551025115136)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

Block bitmap for group 2558 is not in group. (block 1051959948897943552)
Relocate? no

Inode table for group 2559 is not in group. (block 17592186044416)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no

I stopped at -b=884736, but will try higher values just in case. Also, will upload the requested dumpe2fs output here shortly.

Comment by Karl W Schulz (Inactive) [ 30/Jul/13 ]

Output of dumpe2fs with -B 4096 and alternate values for -b.

Comment by Karl W Schulz (Inactive) [ 30/Jul/13 ]

Just documenting that there does not appear to be any appreciable improvement using alternative superblocks; it always shows "Block bitmap for group <x> is not in group"

I tried the following superblock values:

Primary superblock at 0, Group descriptors at 1-2795
Backup superblock at 32768, Group descriptors at 32769-35563
Backup superblock at 98304, Group descriptors at 98305-101099
Backup superblock at 163840, Group descriptors at 163841-166635
Backup superblock at 229376, Group descriptors at 229377-232171
Backup superblock at 294912, Group descriptors at 294913-297707
Backup superblock at 819200, Group descriptors at 819201-821995
Backup superblock at 884736, Group descriptors at 884737-887531
Backup superblock at 1605632, Group descriptors at 1605633-1608427
Backup superblock at 2654208, Group descriptors at 2654209-2657003
Backup superblock at 4096000, Group descriptors at 4096001-4098795
Backup superblock at 7962624, Group descriptors at 7962625-7965419
Backup superblock at 11239424, Group descriptors at 11239425-11242219
Backup superblock at 20480000, Group descriptors at 20480001-20482795
Backup superblock at 23887872, Group descriptors at 23887873-23890667
Backup superblock at 71663616, Group descriptors at 71663617-71666411
Backup superblock at 78675968, Group descriptors at 78675969-78678763
Backup superblock at 102400000, Group descriptors at 102400001-102402795
Backup superblock at 214990848, Group descriptors at 214990849-214993643
Backup superblock at 512000000, Group descriptors at 512000001-512002795
Backup superblock at 550731776, Group descriptors at 550731777-550734571
Backup superblock at 644972544, Group descriptors at 644972545-644975339
Backup superblock at 1934917632, Group descriptors at 1934917633-1934920427

If I go to the next value of -b 2560000000 it states that the superblock cannot be read.

Comment by Andreas Dilger [ 30/Jul/13 ]

In theory there should still be backup superblocks + group descriptors at 3855122432 and 5804752896, which are within the 5860530816-block filesystem.

That said, at this point I'm concerned that the whole OST is corrupted somehow by improper RAID parity reconstruction or similar. For there to be corruption in all of the group descriptors, spread across the whole filesystem implies that even if we were able to manually rebuild the descriptor table from the good blocks in various different groups it is likely that the data will be equally corrupted.

In your most recent e2fsck output (9:16 am) it appears for the primary group descriptor that descriptor block #39 (filesystem block 40) is corrupt (2508 * 64 / 4096 = 2559 * 64 / 4096 = 39 + 1 for the offset of the first GDT in the filesystem). It would be possible to restore this one block from a backup descriptor block (e.g. 39 + 32769=32808), something like:

dd if=/dev/md14 of=/dev/md14 bs=4096 count=1 skip=32808 seek=40 conv=notrunc

This is only really practical to do if there are only one or two corrupt group descriptor blocks. It isn't clear to me if the above error messages are just a snippet of huge swaths of corruption in each group, or if there is only a single bad block in the ~2800 or so group descriptor blocks. In the latter case, there is some hope that the filesystem could at least be partially recovered. If there are many bad group descriptors in every backup it is likely there is an equal amount of corruption of the file data.

Comment by Tommy Minyard [ 30/Jul/13 ]

Thanks for the additional information, Andreas. If possible, could we set up a con-call this afternoon and discuss some options (I think Peter may have been trying to get this organized even though he is on vacation)? At this point, would it be better to go back to the RAID-6 device and try to start from there? We know which disk was the last one added. We can stop the array, start it in read-only mode without the last disk added and see what the array says at that time with e2fsck.

Comment by James Nunez (Inactive) [ 30/Jul/13 ]

Tommy

We're looking into the problem and formulating next steps.

Comment by Andreas Dilger [ 30/Jul/13 ]

It might be possible to pull the new disk and run in degraded mode, to see if this allows the filesystem data to be read correctly. It may also be that the MD RAID rebuild has written bad data to the parity blocks by this point, I'm not sure. At this point that is the only thing I can think of that is likely to be able to recover this OST.

Comment by Andreas Dilger [ 30/Jul/13 ]

I would also recommend to deactivate this OST on the MDS so that it does not try to modify it if (hopefully) it can be accessed again and is mounted with Lustre again. That would avoid allocating new objects on the OST, and give us some time to figure out what to do next.

Comment by Tommy Minyard [ 30/Jul/13 ]

The OST is currently deactivated in the MDS, one of the first things we did this morning after finding the problem. I have also deactivated it on all client nodes for the cluster to prevent user tasks from hanging when trying to access a file that resides on that OST. I will talk with Karl and we will start testing with read-only assembly of the array to see if we can get it recovered.

Comment by Andreas Dilger [ 30/Jul/13 ]

This is the process to modify the /CONFIGS/mountdata file copied from OST0001 for OST0002, on my MythTV Lustre filesystem named "myth". I verified at the end that the generated "md2.bin" file was binary identical to the one that exists on OST0002 already.

# mount -t ldiskfs /dev/vgmyth/lvmythost1 /mnt/tmp # mount other OST as ldiskfs
# xxd /mnt/tmp/CONFIGS/mountdata > /tmp/md1.asc    # save mountdata for reference
# xxd /mnt/tmp/CONFIGS/mountdata > /tmp/md2.asc    # save another one for editing
# vi /tmp/md2.asc                                  # edit 0001 to 0002 in 3 places
# xxd -r /tmp/md2.asc > /tmp/md2.bin               # convert modified one to binary
# xxd -r /tmp/md2.bin > /tmp/md2.asc2              # convert back to ASCII to verify
# diff -u /tmp/md1.asc /tmp/md2.asc2               # compare original and modified
--- /tmp/md1.asc  2013-07-30 15:40:12.201994814 -0600
+++ /tmp/md2.asc  2013-07-30 15:40:48.775245386 -0600
@@ -1,14 +1,14 @@
 0000000: 0100 d01d 0000 0000 0000 0000 0000 0000  ................
-0000010: 0300 0000 0200 0000 0100 0000 0100 0000  ................
+0000010: 0300 0000 0200 0000 0200 0000 0100 0000  ................
 0000020: 6d79 7468 0065 0000 0000 0000 0000 0000  myth.e..........
 0000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 0000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 0000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
-0000060: 6d79 7468 2d4f 5354 3030 3031 0000 0000  myth-OST0001....
+0000060: 6d79 7468 2d4f 5354 3030 3032 0000 0000  myth-OST0002....
 0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 0000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 0000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
-00000a0: 6d79 7468 2d4f 5354 3030 3031 5f55 5549  myth-OST0001_UUI
+00000a0: 6d79 7468 2d4f 5354 3030 3032 5f55 5549  myth-OST0002_UUI
 00000b0: 4400 0000 0000 0000 0000 0000 0000 0000  D...............
 00000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

Peter, it might make sense to allow a mkfs.lustre formatting option to clear the LDD_F_VIRGIN flag so that this binary editing dance isn't needed, and the "new" OST will not try to register with the MGS.

Comment by Tommy Minyard [ 30/Jul/13 ]

One quick update, we stopped the array and restarted it without the spare drive that was added in last night (running with 9 out of 10 of the drives currently). At this point, the e2fsck output looks much better than before (see below). One question from our side, should we just let e2fsck use the default superblock or should we specify one with the -b option? Also, should we be concerned about any of the errors that e2fsck has reported initially, most look like no major issue, except maybe the first one with resize inode not valid? The current e2fsck is not making any changes. Our plan now is to let this run and see how many errors it finds and if not too bad, rerun it with the -p option to make some repairs. We will still need to add back in the 10th drive and let the array rebuild at some point, but right now we just want to make sure we have a valid MD array that will mount without error.

[root@oss28.stampede]# e2fsck -fn -B 4096 /dev/md14
e2fsck 1.42.7.wc1 (12-Apr-2013)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
Resize inode not valid. Recreate? no

Pass 1: Checking inodes, blocks, and sizes
Inode 11468804 has an invalid extent node (blk 2936017803, lblk 393)
Clear? no

Inode 11468804, i_blocks is 8264, should be 5024. Fix? no

Inode 11534337 has an invalid extent node (blk 2952816317, lblk 764)
Clear? no

Inode 11534337, i_size is 4292608, should be 3129344. Fix? no

Inode 11534337, i_blocks is 8408, should be 6128. Fix? no

Inode 13092415 has an invalid extent node (blk 3523217944, lblk 0)
Clear? no

Inode 13092415, i_blocks is 544, should be 0. Fix? no

Inode 14291200 has an invalid extent node (blk 3526886078, lblk 0)
Clear? no

Inode 14291200, i_blocks is 2056, should be 0. Fix? no

Comment by Andreas Dilger [ 30/Jul/13 ]

It looks like e2fsck is already trying one of the backup group descriptors, but is able to find a backup that doesn't have any problems, so I would just let it proceed with the one it finds. If the first reported problem is at inode 11468804, that is at least half-way through the filesystem (at 128 inodes per group, per previous dumpe2fs output), and inode 14291200 is about 60% through the filesystem so I suspect e2fsck should be able to recover the majority filesystem reasonably well.

It does make sense to allow e2fsck to progress for a while to verify that it isn't finding massive corruption later on, but from the snippet here it looks much better than before.

Comment by Karl W Schulz (Inactive) [ 30/Jul/13 ]

Yes, it looks to be using -b 32768 as we can duplicate the results if we specify this value. Trying an actual fix now with e2fsck.....fingers crossed.

Comment by Karl W Schulz (Inactive) [ 31/Jul/13 ]

Update: the number of actual e2fsck issues observed was more substantial when we ran with fixes enabled compared to the previous run with "-n". However, it did complete after about 2 hours and allowed us to mount via ldiskfs. The mountdata and LAST_ID files looked reasonable and we were subsequently able to mount as lustre fs. We do have a small percentage of files in lost+found, and are going to leave this OST inactive on the MDS till the next maintenance, but it looks like we were able to recover the majority in this case. Thanks for the help and suggestions today. We definitely have not seen anything quite like this and are rebuilding the raidset with an alternate drive.

Comment by Andreas Dilger [ 31/Jul/13 ]

It is possible to recover the files in lost+found using the "ll_recover_lost_found_objs". This will move the OST objects from lost+found to their proper location /O/0/d{objid %32}/{objid}, using the information stored in the "fid" xattr in the inode. Any objects that are zero-length have likely never been accessed and could be deleted. This needs to be done with the OST mounted as ldiskfs (will eventually be done automatically when the LFSCK Phase 2 project is completed).

Comment by Andreas Dilger [ 31/Jul/13 ]

While I don't think any of the problem seen here relates to Lustre specifically, I'm going to leave this ticket open for implementation of "mkfs.lustre --replace --index=N", which just avoids setting the LDD_F_VIRGIN flag when formatting a new OST.

Comment by Tommy Minyard [ 02/Aug/13 ]

Thanks for the information on the recovery of lost+found files, not something we had usually done in the past. We have a maintenance scheduled for next week and am planning to attempt the recovery at that time. One question that has come up from a few users we contacted regarding their "lost" files, some have copied the files we had planned to recover back from the tape library and they want to make sure the ll_recover_lost_found_objs recovery will not overwrite the new files. How will the Lustre recovery behave if the previous file has been replaced with a new copy? My suspicion on how this works is that it will not overwrite the new file but just wanted to get your thoughts on this scenario.

Comment by Andreas Dilger [ 03/Aug/13 ]

This will only recover the OST objects from l+f and not touch any of the data or metadata. If the old file was deleted and restored, it will have a different MDS inode with different objects and there will be no impact from running ll_recover_lost_found_objs. If, for some reason there are objects with the same objid in O/0/d*/ (e.g. sone kind of msnual recovery of OST objects was dobe) then the old objects will be left in l+f.

Comment by Andreas Dilger [ 09/Aug/13 ]

Closing this as "Not a Bug" since the problem ended up being in the MD RAID/parity rebuilding.

I'm submitting a patch for "mkfs.lustre --replace" under LU-14, which is an existing ticket for OST replacement (which thankfully was not needed in this case).

Comment by Tommy Minyard [ 10/Aug/13 ]

Andreas, I know you just closed this yesterday but now we have had the SAME exact sequence of events happen on a second OSS when it had a drive fail and the automatic spare rebuild started. Note that we just upgraded to the 2.6.32-358.11.1 kernel provided in the Intel distribution for the Lustre 2.1.6 release on July 23rd and now after two drive failures have had the exact same sequence of events happen. We had plenty of drive failures and automatic rebuilds with our previous 2.1.5 and corresponding kernel (not sure right offhand the version). We have not seen other reports of this yet in our searching, but this has to be more than a coincidence. For now we have disabled our automatic spare rebuilding on all OSS's.

Now the bad news is that even after we tried following our previous procedure of removing the drive that was most recently added and rebuilt on, we cannot get e2fsck to complete on the raid device. It ran overnight last night and grew to 27GB in memory but had not written anything to the screen for almost 12 hours before we gave up on it and restarted (took only a few hours on the other array). The restart has fixed a few more errors but is steadily growing in memory usage again. From what we have found searching around, if fsck is using a lot of memory, that typically points to pretty severe file corruption. Any thoughts on this situation or suggestions for us to try?

Thanks,
Tommy

Comment by Peter Jones [ 10/Aug/13 ]

Tommy

This is very strange. I have reopened the ticket so we can continue to track this until we have a clearer picture about what is going on.

Peter

Comment by Tommy Minyard [ 10/Aug/13 ]

Thanks Peter, the restarted fsck is still running and consuming more than 20GB of memory right now. It has made a few more repairs since my earlier comment so it is still progressing. One thing to add, we know that rebuilds can be successful as we did it manually with the last OST that suffered the RAID-6 corruption. In the two cases of rebuild that have caused problems so far, the two primary differences are that they were kicked off automatically and the OST was still active in the MDS allowing for new files to be written to it.

I've been digging around looking for reported Linux RAID-6 issues and there is note of a rebuild issue on one page I found but it indicted the problem had been fixed in 2.6.32 and later kernels.

Comment by Tommy Minyard [ 12/Aug/13 ]

So we have not had any success with trying to get fsck to run to completion on the corrupted OST. We let the e2fsck run on the oss until it ran out of memory, consuming 55GB of memory but it did not appear to be making much progress on the repairs. We are currently out of ideas for repair, if you have any further suggestions please let us know ASAP. We can mount the OST as ldiskfs but it looks like there is NO data actually on the filesystem under this mount point.

At this point, we think there will not be any way to recover the data so we are working on the procedure to recreate the OST from scratch. Karl took some notes on how to replace the OST back in where it was previously and we'll follow the instructions in LU-14.

Comment by Peter Jones [ 12/Aug/13 ]

Tommy

Andreas is out this week but I did manage to connect with him to see whether he had any suggestions and he suggested trying to rebuild 2.1.6 against the older kernel to see whether that has any effect on this behaviour. I'm continuing to talk to other engineers with expertise in this area to see if there are any other thoughts.

Regards

Peter

Comment by Karl W Schulz (Inactive) [ 13/Aug/13 ]

We did take a quick stab at building the v 2.1.6 release against our older kernel (and even the 2.6.32-279.19.1.el6_lustre version that was supported with v 2.1.5), but it looks like the newer ldiskfs patches for the rhel6 series are finding some conflicts which prevents a build out of the box.

Consequently, we've decided to roll the servers back to v 2.1.5 and the previous production OSS kernel (2.6.32-279.5.2.el6)

Comment by Andreas Dilger [ 20/Aug/13 ]

As Murphy would have it, you hit this problem again the day after I went on vacation for a week.

I looked through the Lustre 2.1.5->2.1.6 changes, and while there are some changes in the ldiskfs patches, these are mostly in patch context and not functional changes. I couldn't see any other significant changes to the OST code either.

On the kernel side, there were some changes to ext4 that appear related to freezing the filesystem for suspend/resume, and better block accounting for delayed allocation (which Lustre doesn't use). There is another change to optimize extent handling for fallocate() (which Lustre also doesn't use), but I can't see how it would relate to MD device failure/rebuilding. I'm not sure what changes went into the MD RAID code.

Do you still have logs for the e2fsck runs from the second OST failure? I can't imagine why it would be consuming so much memory, unless there was some strange corruption that e2fsck isn't expecting. It shouldn't be using more than about 5-6GB of RAM in the worst case. If you have logs it might be possible to reverse-engineer what type of corruption was seen. Presumably, you weren't able to recover anything from this OST in the end? Nothing in lost+found?

Comment by John Fuchs-Chesney (text) (Inactive) [ 12/Mar/14 ]

Karl and Tommy,
Reading through this I think you may have moved on from this problem.
Do you want us to keep this issue open, or may we mark it as resolved?
Many thanks,
~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 29/Mar/14 ]

Customer has apparently moved on from this issue.
~ jfc.

Generated at Sat Feb 10 01:35:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.