[LU-1910] OSS kernel panics after upgrade Created: 12/Sep/12 Updated: 08/Mar/14 Resolved: 08/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Hellen (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Sun Fire x4540 server, 48 internal 1TB disks, lustre patched kernel - kernel-2.6.18-308.4.1.el5, Lustre 1.8.8 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10643 |
| Description |
|
Since our recent upgrade to 1.8.8, we've been experiencing problems with the md subsystem. Our OSTs are constructed as 8+2 RAID6 metadevices using the mdadm utility. |
| Comments |
| Comment by Peter Jones [ 12/Sep/12 ] |
|
Oleg will help with this one |
| Comment by Oleg Drokin [ 12/Sep/12 ] |
|
It looks like you have hit two problems at once (related). I guess the most important for you right now is to swap out the bad drive and rebuild your raid. Havign that bad disk separated you can plug it into a testing node and find if there's any kernel that does not hang the controller on read error and contact maintainers of the driver/redhat with the info. |
| Comment by Hellen (Inactive) [ 12/Sep/12 ] |
|
Thanks for your response. In the mean time would you recommend we disable the cron.weekly raid.check. So far, the hangs only occur if a disk error is discovered during the check. |
| Comment by Oleg Drokin [ 12/Sep/12 ] |
|
Well, sure disabling raid check on that particular array temporarily is a good idea until you can replace/fix the bad drive. Just be aware that it does not fix anything, it's papering over the real issue (which you did not hit either because you did not access the file stored there yet, or because the bad block is in unused space. Basically due to luck, if you try to read entire disk, you'll still hit this). Seeing as how it's just a read error, it might be the case of "bitrot" where a track on disk just develops a read error because some bits change and CRC no longer matches, but the actual track is still good, those you can usually fix by just writing over bad spot location, easiest one being just to kick the bad drive out of the array, wiping raid superblock and then readding it back where reconstruction process will write the entire disk over, including the problematic area. I have multiple drives that were "healed" by this process. But if you actually care about high availability, if I were you, I'd actually pull the disk, replace it with a spare one and then experiment with the controller drivers in order to see if there is a stabler version on a test box. Otherwise next time some other disk goes bad you'll have controller/driver freeze again which is not a very good thing for availability reasons. |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ] |
|
Solution and guidance provided to customer. |