[LU-497] DDN failure - Now can't find a valid superblock Created: 08/Jul/11  Updated: 26/Oct/11  Resolved: 08/Jul/11

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Task Priority: Critical
Reporter: Joe Mervini Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None
Environment:

chaos version 4.4-2 on dell R710 servers connection via IB to DDN S2A9900.


Rank (Obsolete): 10499

 Description   

A tray on one of our 99K enclosures biffed last night causing the OSS to panic. When we got things more or less back in order we attempted fscks on all the LUNs associated with that server and succeeded on all but one.

When I attempt to run the fsck the system complains about fsck.ext4 not being found. When I run fsck.ldiskfs on the trouble LUN I get the following:

[root@aoss11 ~]# fsck.ldiskfs /dev/sdg
fsck-sdg[7235]: running (null)
fsck-sdg[7235]: fsck.ldiskfs 1.41.10.sun2-4chaos (23-Jun-2010)
fsck-sdg[7235]: fsck.ldiskfs: MMP: fsck being run while trying to open /dev/sdg
fsck-sdg[7235]:
fsck-sdg[7235]: The superblock could not be read or does not describe a correct ext2
fsck-sdg[7235]: filesystem. If the device is valid and it really contains an ext2
fsck-sdg[7235]: filesystem (and not swap or ufs or something else), then the superblock
fsck-sdg[7235]: is corrupt, and you might try running e2fsck with an alternate superblock:
fsck-sdg[7235]: e2fsck -b 32768 <device>
fsck-sdg[7235]:
fsck-sdg[7235]: exit code 8 (operational error)

When I go to the alternate superblocks (only three get listed) I get the same error.

The odd thing is if I do a tunefs.lustre on the device I gives me all the information on the OST.

If I try to run dumpe2fs it spits out some of the disk info then just waits. I can break out of the command but even if I run the command on one of the good LUNs I get the same results. I don't know how to try to find any additional superblocks.

This is a production file system so we are obviously down and critical. Any assistance would be greaty appreciated.



 Comments   
Comment by Joe Mervini [ 08/Jul/11 ]

Quick follow-up: so there's no confusion, we used fsck.ldiskfs on all the other devices successfully.

Comment by Peter Jones [ 08/Jul/11 ]

Joe

I am looking for an engineer to help you with this issue. Can I just confirm on the version of Lustre code that you are running. Is it really Lustre 1.8.6-wc1 or is it the Lustre 1.8.5 + patches bundled with the latest Chaos releae?

Thanks

Peter

Comment by Joe Mervini [ 08/Jul/11 ]

It is the Lustre 1.8.5 + patches bundled with the latest Chaos releas.e

Comment by Andreas Dilger [ 08/Jul/11 ]

Older versions of e2fsck have some issues like this with the MMP block being left in a state where it reports e2fsck is still being run. Those problems have been fixed with newer e2fsck releases.

In order to clear this flag in the MMP block you need to run:

    tune2fs -f -E clear_mmp /dev/sdg

and then run e2fsck as normal. I would separately recommend upgrading e2fsprogs to 1.41.12.ora2, which contains several MMP fixes.

Comment by Joe Mervini [ 08/Jul/11 ]

Andreas,

Thank you so much for the help. tun2fs -f -e clear_mmp /dev/sdg indeed cleared the way to run fsck. The delay in feedback was because we allowed one of the failed drives in the array to complete its rebuild before doing anything else. After that, we got real anal and shutdown the servers attached to the controller pair and restarted them.

We then ran fsck will -n to see what got reported, then with -yDf. Both checks took more that 2 hours to complete combined. We then ran ll_recover_lost_found_objs on the LUN mounted ldiskfs and which restored all objects in lost+found.

We then brought the whole file system back online and everything is back to normal.

Thanks again for the quick response.

Comment by Peter Jones [ 08/Jul/11 ]

Joe

Glad to hear that normal service has been resumed

Peter

Generated at Sat Feb 10 01:07:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.