[LU-3668] ldiskfs_check_descriptors: Block bitmap for group not in group Created: 30/Jul/13 Updated: 29/Mar/14 Resolved: 29/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Karl W Schulz (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
Stampede: CentOS6 OSS's running whamcloud 2.1.6 distribution:
|
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9453 | ||||||||
| Description |
|
Our $SCRATCH file system is down and we are unable to mount an OST due to corrupted group descriptors reported. Symptoms: (1) cannot mount as normal lustre fs Scenario: The OST is a RAID6 (8+2) config with external journals. At 18:06 yesterday, MD raid detected a disk error, evicted the failed disk, and started rebuilding the device with a hot spare. Before the rebuild finished, ldiskfs reported the error below and the device went read-only. Jul 29 22:16:40 oss28 kernel: [547129.288298] LDISKFS-fs error (device md14): ld Host was rebooted at 6am and have been unable to mount since. Would appreciate some suggestions on the best approach to try and recover with e2fsck, journal rebuilding, etc to recover this OST. I will follow up with output from e2fsck -f -n which is running now (attempting to use backup superblock). Typical entries look as follows: e2fsck 1.42.7.wc1 (12-Apr-2013) Block bitmap for group 3538 is not in group. (block 107524506255360) Inode bitmap for group 3538 is not in group. (block 18446612162378989568) Inode table for group 3539 is not in group. (block 3439182177370112) Block bitmap for group 3541 is not in group. (block 138784755704397824) Inode table for group 3542 is not in group. (block 7138029487521792000) Block bitmap for group 3544 is not in group. (block 180388626432) Inode table for group 3545 is not in group. (block 25769803776) Block bitmap for group 3547 is not in group. (block 346054104973312) Inode 503 has compression flag set on filesystem without compression support. \ Inode 503 has INDEX_FL flag set but is not a directory. HTREE directory inode 503 has an invalid root node. HTREE directory inode 503 has an unsupported hash version (40) HTREE directory inode 503 uses an incompatible htree root node flag. HTREE directory inode 503 has a tree depth (16) which is too big Inode 503, i_blocks is 842359139, should be 0. Fix? no Inode 504 is in use, but has dtime set. Fix? no Inode 504 has imagic flag set. Clear? no Inode 504 has a extra size (25649) which is invalid Inode 504 has INDEX_FL flag set but is not a directory. Inode 562 has INDEX_FL flag set but is not a directory. HTREE directory inode 562 has an invalid root node. HTREE directory inode 562 has an unsupported hash version (51) HTREE directory inode 562 has a tree depth (59) which is too big Inode 562, i_blocks is 828596838, should be 0. Fix? no Inode 563 is in use, but has dtime set. Fix? no Inode 563 has imagic flag set. Clear? no Inode 563 has a extra size (12387) which is invalid lock #623050609 (3039575950) causes file to be too big. IGNORED. Suppress messages? no |
| Comments |
| Comment by Andreas Dilger [ 30/Jul/13 ] |
|
Have you tried running e2fsck with a backup group descriptor table, something like: e2fsck -fn -b 32768 -B 4096 /dev/md14 For the -b argument, valid values include 32768, 98304, 163840, 229376, 294912, 819200, ... (32768 * (3,5,7)^n). If all of these report corrupt group descriptors then it is likely that the MD RAID rebuild has somehow built the parity of the disk incorrectly. If each of the descriptors reports different errors, it might be possible to combine them manually to get a full set of valid descriptors. Could you please also provide the output of dumpe2fs [-b 32768 -B 4096] -h /dev/md14, particularly if this is consistent across different -b values. |
| Comment by Andreas Dilger [ 30/Jul/13 ] |
|
Note that it is also possible to mount the filesystem on the clients and deactivate this OST on the clients + MDS using: lctl --device {device} deactivate
Where device is either the device name (e.g. $fsname-OST0000-osc-MDT0000 on the MDS) or number as reported by lctl dl. Note that the device number will be different on the clients compared to the MDS). Access to existing files using this OST will return EIO, but new files will not use it. This is typically only practical to use if the program input can be read from a different filesystem. |
| Comment by Karl W Schulz (Inactive) [ 30/Jul/13 ] |
|
Yes, based on a post you made previously, we also tried values of -b = 32768,98304,163840,229376,294912,819200, and 884736. For values smaller than 884736, the first message we saw from fsck is of the form "block bitmap for group <x> is not in group". The snippet of e2fsck output pasted above is with b=884736 and although the bad block bitmap is not the first error detected, it occurs shortly thereafter. Here is the top of a standard fsck: e2fsck -f -n /dev/md14
Inode table for group 2536 is not in group. (block 261993005056) Block bitmap for group 2546 is not in group. (block 3456555320082432) Inode bitmap for group 2546 is not in group. (block 18446612162378989568) Inode table for group 2547 is not in group. (block 3487607933632512) Block bitmap for group 2549 is not in group. (block 10222520243247382528) Inode table for group 2550 is not in group. (block 9007199254740992) Block bitmap for group 2552 is not in group. (block 30064771072) Inode table for group 2553 is not in group. (block 13108240187392) Block bitmap for group 2555 is not in group. (block 1960356217880576) Inode bitmap for group 2555 is not in group. (block 18446612140904153088) Inode table for group 2556 is not in group. (block 3456551025115136) Block bitmap for group 2558 is not in group. (block 1051959948897943552) Inode table for group 2559 is not in group. (block 17592186044416) I stopped at -b=884736, but will try higher values just in case. Also, will upload the requested dumpe2fs output here shortly. |
| Comment by Karl W Schulz (Inactive) [ 30/Jul/13 ] |
|
Output of dumpe2fs with -B 4096 and alternate values for -b. |
| Comment by Karl W Schulz (Inactive) [ 30/Jul/13 ] |
|
Just documenting that there does not appear to be any appreciable improvement using alternative superblocks; it always shows "Block bitmap for group <x> is not in group" I tried the following superblock values: Primary superblock at 0, Group descriptors at 1-2795 If I go to the next value of -b 2560000000 it states that the superblock cannot be read. |
| Comment by Andreas Dilger [ 30/Jul/13 ] |
|
In theory there should still be backup superblocks + group descriptors at 3855122432 and 5804752896, which are within the 5860530816-block filesystem. That said, at this point I'm concerned that the whole OST is corrupted somehow by improper RAID parity reconstruction or similar. For there to be corruption in all of the group descriptors, spread across the whole filesystem implies that even if we were able to manually rebuild the descriptor table from the good blocks in various different groups it is likely that the data will be equally corrupted. In your most recent e2fsck output (9:16 am) it appears for the primary group descriptor that descriptor block #39 (filesystem block 40) is corrupt (2508 * 64 / 4096 = 2559 * 64 / 4096 = 39 + 1 for the offset of the first GDT in the filesystem). It would be possible to restore this one block from a backup descriptor block (e.g. 39 + 32769=32808), something like: dd if=/dev/md14 of=/dev/md14 bs=4096 count=1 skip=32808 seek=40 conv=notrunc This is only really practical to do if there are only one or two corrupt group descriptor blocks. It isn't clear to me if the above error messages are just a snippet of huge swaths of corruption in each group, or if there is only a single bad block in the ~2800 or so group descriptor blocks. In the latter case, there is some hope that the filesystem could at least be partially recovered. If there are many bad group descriptors in every backup it is likely there is an equal amount of corruption of the file data. |
| Comment by Tommy Minyard [ 30/Jul/13 ] |
|
Thanks for the additional information, Andreas. If possible, could we set up a con-call this afternoon and discuss some options (I think Peter may have been trying to get this organized even though he is on vacation)? At this point, would it be better to go back to the RAID-6 device and try to start from there? We know which disk was the last one added. We can stop the array, start it in read-only mode without the last disk added and see what the array says at that time with e2fsck. |
| Comment by James Nunez (Inactive) [ 30/Jul/13 ] |
|
Tommy We're looking into the problem and formulating next steps. |
| Comment by Andreas Dilger [ 30/Jul/13 ] |
|
It might be possible to pull the new disk and run in degraded mode, to see if this allows the filesystem data to be read correctly. It may also be that the MD RAID rebuild has written bad data to the parity blocks by this point, I'm not sure. At this point that is the only thing I can think of that is likely to be able to recover this OST. |
| Comment by Andreas Dilger [ 30/Jul/13 ] |
|
I would also recommend to deactivate this OST on the MDS so that it does not try to modify it if (hopefully) it can be accessed again and is mounted with Lustre again. That would avoid allocating new objects on the OST, and give us some time to figure out what to do next. |
| Comment by Tommy Minyard [ 30/Jul/13 ] |
|
The OST is currently deactivated in the MDS, one of the first things we did this morning after finding the problem. I have also deactivated it on all client nodes for the cluster to prevent user tasks from hanging when trying to access a file that resides on that OST. I will talk with Karl and we will start testing with read-only assembly of the array to see if we can get it recovered. |
| Comment by Andreas Dilger [ 30/Jul/13 ] |
|
This is the process to modify the /CONFIGS/mountdata file copied from OST0001 for OST0002, on my MythTV Lustre filesystem named "myth". I verified at the end that the generated "md2.bin" file was binary identical to the one that exists on OST0002 already. # mount -t ldiskfs /dev/vgmyth/lvmythost1 /mnt/tmp # mount other OST as ldiskfs # xxd /mnt/tmp/CONFIGS/mountdata > /tmp/md1.asc # save mountdata for reference # xxd /mnt/tmp/CONFIGS/mountdata > /tmp/md2.asc # save another one for editing # vi /tmp/md2.asc # edit 0001 to 0002 in 3 places # xxd -r /tmp/md2.asc > /tmp/md2.bin # convert modified one to binary # xxd -r /tmp/md2.bin > /tmp/md2.asc2 # convert back to ASCII to verify # diff -u /tmp/md1.asc /tmp/md2.asc2 # compare original and modified --- /tmp/md1.asc 2013-07-30 15:40:12.201994814 -0600 +++ /tmp/md2.asc 2013-07-30 15:40:48.775245386 -0600 @@ -1,14 +1,14 @@ 0000000: 0100 d01d 0000 0000 0000 0000 0000 0000 ................ -0000010: 0300 0000 0200 0000 0100 0000 0100 0000 ................ +0000010: 0300 0000 0200 0000 0200 0000 0100 0000 ................ 0000020: 6d79 7468 0065 0000 0000 0000 0000 0000 myth.e.......... 0000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ -0000060: 6d79 7468 2d4f 5354 3030 3031 0000 0000 myth-OST0001.... +0000060: 6d79 7468 2d4f 5354 3030 3032 0000 0000 myth-OST0002.... 0000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................ -00000a0: 6d79 7468 2d4f 5354 3030 3031 5f55 5549 myth-OST0001_UUI +00000a0: 6d79 7468 2d4f 5354 3030 3032 5f55 5549 myth-OST0002_UUI 00000b0: 4400 0000 0000 0000 0000 0000 0000 0000 D............... 00000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ Peter, it might make sense to allow a mkfs.lustre formatting option to clear the LDD_F_VIRGIN flag so that this binary editing dance isn't needed, and the "new" OST will not try to register with the MGS. |
| Comment by Tommy Minyard [ 30/Jul/13 ] |
|
One quick update, we stopped the array and restarted it without the spare drive that was added in last night (running with 9 out of 10 of the drives currently). At this point, the e2fsck output looks much better than before (see below). One question from our side, should we just let e2fsck use the default superblock or should we specify one with the -b option? Also, should we be concerned about any of the errors that e2fsck has reported initially, most look like no major issue, except maybe the first one with resize inode not valid? The current e2fsck is not making any changes. Our plan now is to let this run and see how many errors it finds and if not too bad, rerun it with the -p option to make some repairs. We will still need to add back in the 10th drive and let the array rebuild at some point, but right now we just want to make sure we have a valid MD array that will mount without error. [root@oss28.stampede]# e2fsck -fn -B 4096 /dev/md14 Pass 1: Checking inodes, blocks, and sizes Inode 11468804, i_blocks is 8264, should be 5024. Fix? no Inode 11534337 has an invalid extent node (blk 2952816317, lblk 764) Inode 11534337, i_size is 4292608, should be 3129344. Fix? no Inode 11534337, i_blocks is 8408, should be 6128. Fix? no Inode 13092415 has an invalid extent node (blk 3523217944, lblk 0) Inode 13092415, i_blocks is 544, should be 0. Fix? no Inode 14291200 has an invalid extent node (blk 3526886078, lblk 0) Inode 14291200, i_blocks is 2056, should be 0. Fix? no |
| Comment by Andreas Dilger [ 30/Jul/13 ] |
|
It looks like e2fsck is already trying one of the backup group descriptors, but is able to find a backup that doesn't have any problems, so I would just let it proceed with the one it finds. If the first reported problem is at inode 11468804, that is at least half-way through the filesystem (at 128 inodes per group, per previous dumpe2fs output), and inode 14291200 is about 60% through the filesystem so I suspect e2fsck should be able to recover the majority filesystem reasonably well. It does make sense to allow e2fsck to progress for a while to verify that it isn't finding massive corruption later on, but from the snippet here it looks much better than before. |
| Comment by Karl W Schulz (Inactive) [ 30/Jul/13 ] |
|
Yes, it looks to be using -b 32768 as we can duplicate the results if we specify this value. Trying an actual fix now with e2fsck.....fingers crossed. |
| Comment by Karl W Schulz (Inactive) [ 31/Jul/13 ] |
|
Update: the number of actual e2fsck issues observed was more substantial when we ran with fixes enabled compared to the previous run with "-n". However, it did complete after about 2 hours and allowed us to mount via ldiskfs. The mountdata and LAST_ID files looked reasonable and we were subsequently able to mount as lustre fs. We do have a small percentage of files in lost+found, and are going to leave this OST inactive on the MDS till the next maintenance, but it looks like we were able to recover the majority in this case. Thanks for the help and suggestions today. We definitely have not seen anything quite like this and are rebuilding the raidset with an alternate drive. |
| Comment by Andreas Dilger [ 31/Jul/13 ] |
|
It is possible to recover the files in lost+found using the "ll_recover_lost_found_objs". This will move the OST objects from lost+found to their proper location /O/0/d{objid %32}/{objid}, using the information stored in the "fid" xattr in the inode. Any objects that are zero-length have likely never been accessed and could be deleted. This needs to be done with the OST mounted as ldiskfs (will eventually be done automatically when the LFSCK Phase 2 project is completed). |
| Comment by Andreas Dilger [ 31/Jul/13 ] |
|
While I don't think any of the problem seen here relates to Lustre specifically, I'm going to leave this ticket open for implementation of "mkfs.lustre --replace --index=N", which just avoids setting the LDD_F_VIRGIN flag when formatting a new OST. |
| Comment by Tommy Minyard [ 02/Aug/13 ] |
|
Thanks for the information on the recovery of lost+found files, not something we had usually done in the past. We have a maintenance scheduled for next week and am planning to attempt the recovery at that time. One question that has come up from a few users we contacted regarding their "lost" files, some have copied the files we had planned to recover back from the tape library and they want to make sure the ll_recover_lost_found_objs recovery will not overwrite the new files. How will the Lustre recovery behave if the previous file has been replaced with a new copy? My suspicion on how this works is that it will not overwrite the new file but just wanted to get your thoughts on this scenario. |
| Comment by Andreas Dilger [ 03/Aug/13 ] |
|
This will only recover the OST objects from l+f and not touch any of the data or metadata. If the old file was deleted and restored, it will have a different MDS inode with different objects and there will be no impact from running ll_recover_lost_found_objs. If, for some reason there are objects with the same objid in O/0/d*/ (e.g. sone kind of msnual recovery of OST objects was dobe) then the old objects will be left in l+f. |
| Comment by Andreas Dilger [ 09/Aug/13 ] |
|
Closing this as "Not a Bug" since the problem ended up being in the MD RAID/parity rebuilding. I'm submitting a patch for "mkfs.lustre --replace" under |
| Comment by Tommy Minyard [ 10/Aug/13 ] |
|
Andreas, I know you just closed this yesterday but now we have had the SAME exact sequence of events happen on a second OSS when it had a drive fail and the automatic spare rebuild started. Note that we just upgraded to the 2.6.32-358.11.1 kernel provided in the Intel distribution for the Lustre 2.1.6 release on July 23rd and now after two drive failures have had the exact same sequence of events happen. We had plenty of drive failures and automatic rebuilds with our previous 2.1.5 and corresponding kernel (not sure right offhand the version). We have not seen other reports of this yet in our searching, but this has to be more than a coincidence. For now we have disabled our automatic spare rebuilding on all OSS's. Now the bad news is that even after we tried following our previous procedure of removing the drive that was most recently added and rebuilt on, we cannot get e2fsck to complete on the raid device. It ran overnight last night and grew to 27GB in memory but had not written anything to the screen for almost 12 hours before we gave up on it and restarted (took only a few hours on the other array). The restart has fixed a few more errors but is steadily growing in memory usage again. From what we have found searching around, if fsck is using a lot of memory, that typically points to pretty severe file corruption. Any thoughts on this situation or suggestions for us to try? Thanks, |
| Comment by Peter Jones [ 10/Aug/13 ] |
|
Tommy This is very strange. I have reopened the ticket so we can continue to track this until we have a clearer picture about what is going on. Peter |
| Comment by Tommy Minyard [ 10/Aug/13 ] |
|
Thanks Peter, the restarted fsck is still running and consuming more than 20GB of memory right now. It has made a few more repairs since my earlier comment so it is still progressing. One thing to add, we know that rebuilds can be successful as we did it manually with the last OST that suffered the RAID-6 corruption. In the two cases of rebuild that have caused problems so far, the two primary differences are that they were kicked off automatically and the OST was still active in the MDS allowing for new files to be written to it. I've been digging around looking for reported Linux RAID-6 issues and there is note of a rebuild issue on one page I found but it indicted the problem had been fixed in 2.6.32 and later kernels. |
| Comment by Tommy Minyard [ 12/Aug/13 ] |
|
So we have not had any success with trying to get fsck to run to completion on the corrupted OST. We let the e2fsck run on the oss until it ran out of memory, consuming 55GB of memory but it did not appear to be making much progress on the repairs. We are currently out of ideas for repair, if you have any further suggestions please let us know ASAP. We can mount the OST as ldiskfs but it looks like there is NO data actually on the filesystem under this mount point. At this point, we think there will not be any way to recover the data so we are working on the procedure to recreate the OST from scratch. Karl took some notes on how to replace the OST back in where it was previously and we'll follow the instructions in |
| Comment by Peter Jones [ 12/Aug/13 ] |
|
Tommy Andreas is out this week but I did manage to connect with him to see whether he had any suggestions and he suggested trying to rebuild 2.1.6 against the older kernel to see whether that has any effect on this behaviour. I'm continuing to talk to other engineers with expertise in this area to see if there are any other thoughts. Regards Peter |
| Comment by Karl W Schulz (Inactive) [ 13/Aug/13 ] |
|
We did take a quick stab at building the v 2.1.6 release against our older kernel (and even the 2.6.32-279.19.1.el6_lustre version that was supported with v 2.1.5), but it looks like the newer ldiskfs patches for the rhel6 series are finding some conflicts which prevents a build out of the box. Consequently, we've decided to roll the servers back to v 2.1.5 and the previous production OSS kernel (2.6.32-279.5.2.el6) |
| Comment by Andreas Dilger [ 20/Aug/13 ] |
|
As Murphy would have it, you hit this problem again the day after I went on vacation for a week. I looked through the Lustre 2.1.5->2.1.6 changes, and while there are some changes in the ldiskfs patches, these are mostly in patch context and not functional changes. I couldn't see any other significant changes to the OST code either. On the kernel side, there were some changes to ext4 that appear related to freezing the filesystem for suspend/resume, and better block accounting for delayed allocation (which Lustre doesn't use). There is another change to optimize extent handling for fallocate() (which Lustre also doesn't use), but I can't see how it would relate to MD device failure/rebuilding. I'm not sure what changes went into the MD RAID code. Do you still have logs for the e2fsck runs from the second OST failure? I can't imagine why it would be consuming so much memory, unless there was some strange corruption that e2fsck isn't expecting. It shouldn't be using more than about 5-6GB of RAM in the worst case. If you have logs it might be possible to reverse-engineer what type of corruption was seen. Presumably, you weren't able to recover anything from this OST in the end? Nothing in lost+found? |
| Comment by John Fuchs-Chesney (text) (Inactive) [ 12/Mar/14 ] |
|
Karl and Tommy, |
| Comment by John Fuchs-Chesney (Inactive) [ 29/Mar/14 ] |
|
Customer has apparently moved on from this issue. |