[LU-497] DDN failure - Now can't find a valid superblock Created: 08/Jul/11 Updated: 26/Oct/11 Resolved: 08/Jul/11 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Critical |
| Reporter: | Joe Mervini | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
chaos version 4.4-2 on dell R710 servers connection via IB to DDN S2A9900. |
||
| Rank (Obsolete): | 10499 |
| Description |
|
A tray on one of our 99K enclosures biffed last night causing the OSS to panic. When we got things more or less back in order we attempted fscks on all the LUNs associated with that server and succeeded on all but one. When I attempt to run the fsck the system complains about fsck.ext4 not being found. When I run fsck.ldiskfs on the trouble LUN I get the following: [root@aoss11 ~]# fsck.ldiskfs /dev/sdg When I go to the alternate superblocks (only three get listed) I get the same error. The odd thing is if I do a tunefs.lustre on the device I gives me all the information on the OST. If I try to run dumpe2fs it spits out some of the disk info then just waits. I can break out of the command but even if I run the command on one of the good LUNs I get the same results. I don't know how to try to find any additional superblocks. This is a production file system so we are obviously down and critical. Any assistance would be greaty appreciated. |
| Comments |
| Comment by Joe Mervini [ 08/Jul/11 ] |
|
Quick follow-up: so there's no confusion, we used fsck.ldiskfs on all the other devices successfully. |
| Comment by Peter Jones [ 08/Jul/11 ] |
|
Joe I am looking for an engineer to help you with this issue. Can I just confirm on the version of Lustre code that you are running. Is it really Lustre 1.8.6-wc1 or is it the Lustre 1.8.5 + patches bundled with the latest Chaos releae? Thanks Peter |
| Comment by Joe Mervini [ 08/Jul/11 ] |
|
It is the Lustre 1.8.5 + patches bundled with the latest Chaos releas.e |
| Comment by Andreas Dilger [ 08/Jul/11 ] |
|
Older versions of e2fsck have some issues like this with the MMP block being left in a state where it reports e2fsck is still being run. Those problems have been fixed with newer e2fsck releases. In order to clear this flag in the MMP block you need to run: tune2fs -f -E clear_mmp /dev/sdg and then run e2fsck as normal. I would separately recommend upgrading e2fsprogs to 1.41.12.ora2, which contains several MMP fixes. |
| Comment by Joe Mervini [ 08/Jul/11 ] |
|
Andreas, Thank you so much for the help. tun2fs -f -e clear_mmp /dev/sdg indeed cleared the way to run fsck. The delay in feedback was because we allowed one of the failed drives in the array to complete its rebuild before doing anything else. After that, we got real anal and shutdown the servers attached to the controller pair and restarted them. We then ran fsck will -n to see what got reported, then with -yDf. Both checks took more that 2 hours to complete combined. We then ran ll_recover_lost_found_objs on the LUN mounted ldiskfs and which restored all objects in lost+found. We then brought the whole file system back online and everything is back to normal. Thanks again for the quick response. |
| Comment by Peter Jones [ 08/Jul/11 ] |
|
Joe Glad to hear that normal service has been resumed Peter |