[LU-269] Ldisk-fs error (device md41): ldiskfs_check_descriptors: Block bitma p for group 0 not in group Created: 03/May/11 Updated: 28/Jun/11 Resolved: 09/May/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Dan Ferber (Inactive) | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 5.5 and Lustre 1.8.0.1 |
||
| Severity: | 2 |
| Rank (Obsolete): | 10544 |
| Description |
|
MDS1/2 are running and active/active cluster configuration. mgs resides on MDS1 and mdt resides on MDS2. Is this an OK way to do things or not? In the original architecture of the system Oracle stated this was supported, but the customer just found out from Tyian over at Oracle that this is not a supported configuration for the MDS devices. As of last Thursday customer could see all raid devices by the OS, but for some reason OST11 just simply would not become available. That issue went away with their "bare metal" reboot of the system on Friday morning. What started however, we have yet to fix: OST15 /dev/md41 resident on OSS4 This device is not mounting anymore to the OSS at the Operating System level. The raid device can be constructed, but Lustre will not mount it. Customer contact for questions is tyler.s.wiegers@lmco.com |
| Comments |
| Comment by Tyler Wiegers (Inactive) [ 03/May/11 ] |
|
Environment should be RHEL 5.3, not 5.5. We did a clean system startup, cleared messages, and systematically assembled the bitmaps, mounted them, assembled the raid devices, and mounted them for each OST. There were no errors for this OST until attempting to mount the raid device. All other OST's mounted successfully. The unique data point for this OST is that it's raid device is missing a disk (md41). The disk was reported as unknown after our CAM/firmware upgrades yesterday so we replaced it, but we did not re-insert it into the raid. Would that situation cause the errors that we currently see? The log output is the following: oss3# mount -t lustre /dev/md41 /mnt/lustre_ost15 /var/log/messages output (trimmed dates/times, I had to re-type this in from hard copy): LDISKFS-fs error (device md41): ldiskf_check_descriptors: Block bitmap for group 0 not in group (block 134217728)! Thanks! |
| Comment by Cliff White (Inactive) [ 03/May/11 ] |
|
Well,a missing disk without a spare would definitely mess up the raid, i would think. Was there data on the missing spindle? Has that been recovered? After the md side is healthy, you should run 'fsck -fn' on md41 and see what that reports. Assuming the md41 device is restored, 'fsck -fy' may fix the bitmap issue, but run '-fn' first If there are other errors beyond the bitmap, you should attach the results here, but if you only |
| Comment by Cliff White (Inactive) [ 03/May/11 ] |
|
Also, your first question was un-related - If the MGT and MDT are separate partitions, it okay to have one node active for MGS and the other active for MDS in a failover pair - the MGS is really, really lightweight, so after client mount the MGS node should be more or less idle. |
| Comment by Tyler Wiegers (Inactive) [ 03/May/11 ] |
|
We inserted the disk back into the raid, it is currently rebuilding. Trying to mount the OST while the disk is rebuilding gives the same error. We've been able to mount OST's while disks are rebuilding in the past, so the core issue doesn't look like it's resolved. No data should have been lost, we are running an 8+2 raid 6 device, so we can run with 8/10 disks without any data loss. |
| Comment by Cliff White (Inactive) [ 03/May/11 ] |
|
Okay, after the rebuild, please run 'fsck -fn' |
| Comment by Johann Lombardi (Inactive) [ 04/May/11 ] |
|
> We inserted the disk back into the raid, it is currently rebuilding. Trying to mount the OST Based on the following comment: Could you please tell us what version of the mptsas driver you use? > No data should have been lost, we are running an 8+2 raid 6 device, so we can run with 8/10 disks without any data loss. Right. |
| Comment by Tyler Wiegers (Inactive) [ 04/May/11 ] |
|
This disk finished rebuilding. Once rebuilt, we attemped to run an e2fsck on the disks, which failed due to the MMP flag being set. We cleared the flag using tune2fs (ref Once the e2fsck was complete, we were able to successfully mount this OST, which is extremely good news. There were multiple recovered files in lost+found which we will be attempting to recover. We are in the process of running e2fsck on all of our OSTs. Once complete we are planning a complete power down of all OSS's, MDS's, and disk arrays in order to do a fresh clean startup. I will update no later than tomorrow with hopefully a problem resolved statement. |
| Comment by Cliff White (Inactive) [ 05/May/11 ] |
|
Great, thanks for keeping us updated – |
| Comment by Peter Jones [ 09/May/11 ] |
|
Rob Baker of LMCO has confirmed that the critical situation is over and production is stable. Residual issues will be tracked under a new ticket in the future. |