[LU-1910] OSS kernel panics after upgrade Created: 12/Sep/12  Updated: 08/Mar/14  Resolved: 08/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Hellen (Inactive) Assignee: Oleg Drokin
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Sun Fire x4540 server, 48 internal 1TB disks, lustre patched kernel - kernel-2.6.18-308.4.1.el5, Lustre 1.8.8


Attachments: File oss06_messages    
Severity: 3
Rank (Obsolete): 10643

 Description   

Since our recent upgrade to 1.8.8, we've been experiencing problems with the md subsystem. Our OSTs are constructed as 8+2 RAID6 metadevices using the mdadm utility.
Every Sunday morning, cron.weekly runs the raid.check scripts and starts re-syncing and if it hits a medium error, the md subsytem hangs, for example "cat /proc/mdstat" hangs. The load on the server immediately starts going up until the server becomes unusable and we have to reboot the OSS server
What could be causing this and should we be running raid.check on the ost metadevices?



 Comments   
Comment by Peter Jones [ 12/Sep/12 ]

Oleg will help with this one

Comment by Oleg Drokin [ 12/Sep/12 ]

It looks like you have hit two problems at once (related).
Problem #1 - your disk at 5:0:0:0 (/dev/sdao if we believe dmesg) have gone bad (there's a clear read error we can see in the log at the start of it all).
Problem #2, either due to a controller, a driver bug or a combination of both the controller is wedged on error and cannot access anything anymore so you see a wide hang (sadly this is not all that infrequent, I have numerous semi-highend nodes under my control that have these problems too - disk controller hangs on disk errors, though not with all kernels).

I guess the most important for you right now is to swap out the bad drive and rebuild your raid. Havign that bad disk separated you can plug it into a testing node and find if there's any kernel that does not hang the controller on read error and contact maintainers of the driver/redhat with the info.

Comment by Hellen (Inactive) [ 12/Sep/12 ]

Thanks for your response. In the mean time would you recommend we disable the cron.weekly raid.check. So far, the hangs only occur if a disk error is discovered during the check.

Comment by Oleg Drokin [ 12/Sep/12 ]

Well, sure disabling raid check on that particular array temporarily is a good idea until you can replace/fix the bad drive.

Just be aware that it does not fix anything, it's papering over the real issue (which you did not hit either because you did not access the file stored there yet, or because the bad block is in unused space. Basically due to luck, if you try to read entire disk, you'll still hit this).

Seeing as how it's just a read error, it might be the case of "bitrot" where a track on disk just develops a read error because some bits change and CRC no longer matches, but the actual track is still good, those you can usually fix by just writing over bad spot location, easiest one being just to kick the bad drive out of the array, wiping raid superblock and then readding it back where reconstruction process will write the entire disk over, including the problematic area. I have multiple drives that were "healed" by this process.

But if you actually care about high availability, if I were you, I'd actually pull the disk, replace it with a spare one and then experiment with the controller drivers in order to see if there is a stabler version on a test box. Otherwise next time some other disk goes bad you'll have controller/driver freeze again which is not a very good thing for availability reasons.

Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ]

Solution and guidance provided to customer.
No need to keep this ticket unresolved any longer.
~ jfc.

Generated at Sat Feb 10 01:20:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.