[LU-1950] oss high load, dead ll_ost_io Created: 16/Sep/12 Updated: 15/Mar/14 Resolved: 15/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Hellen (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Sun Fire X4540, kernel ver 2.6.18-308.4.1 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 4022 |
| Description |
|
We have experiencing a problem on our OSS servers. The server load shoot up to as high a 500, with dead io threads, the ost's become unusable until the OSS is rebooted. This has happened 5 times on 4 different OSS server in the last 48 hours. I've attached messages file from the server that was last affected. |
| Comments |
| Comment by Cliff White (Inactive) [ 16/Sep/12 ] |
|
Can you getting a dump of all stacks with SysRq on the OSS when this is happening? Please attach if so. Any indications of IO hardware problems? |
| Comment by Oleg Drokin [ 16/Sep/12 ] |
|
The logs indicate back-end IO devices as being extremely slow to process requests, more than 10 thousand seconds to process a single request. In particular small IO workloads are pretty bothersome because to the disks they look like a stream of small random requests and rotating disks hate that workloads where they mostly do nothing but move the writing arm around (seeking) instead of writing. More or less healthy workload for rotating media is about 1 request in flight per spindle. Unfortunately there's no way in Lustre 1.8 to limit it like that (nether is 2.x capable of this yet, though NRS feature will eventually allow that). Sadly Lustre is unable to write to the media any faster than the media is capable of and this needs to be taken into account when looking at various IO patterns your applications are having. |
| Comment by Hellen (Inactive) [ 16/Sep/12 ] |
|
Will try to get a dump of all stacks with SysRq the next time it happens We have 672 clients, I haven’t noticed something running with different IO footprint, but I'll keep looking. Most activity going on is write activity, So I'll enable async journal and see if that helps. There are some users with a lot random small IO, so will also decrease number of IO threads to see if that helps too. |
| Comment by Oleg Drokin [ 17/Sep/12 ] |
|
While a controller problem could be one of the possible causes, did you really change the controller driver recently? If so, definitely try to roll-back to a previous version or otherwise try to change it. |
| Comment by Hellen (Inactive) [ 17/Sep/12 ] |
|
yes, we upgraded the OS and kernel recently, The controller driver in use now, is a later version to what we had before the upgrade |
| Comment by Hellen (Inactive) [ 18/Sep/12 ] |
|
Please see attached console messages with a dump of all stacks. This is from the last sever affected a few minutes ago. |
| Comment by Hellen (Inactive) [ 18/Sep/12 ] |
|
SysRq dump of all stacks |
| Comment by Oleg Drokin [ 18/Sep/12 ] |
|
I see in the new log requests are "only" taking ~2600 seconds now instead of 10000 in previous one. Did you reduce number of threads or made other changes. Also one more option for you to try I guess is to disable read only cache, which it's enabled by default always so it's not something changed during your upgrade, there's a known problem that hit sometimes manifested in the cache on OSS taking too much memory and forcing cached metadata out resulting in a lot of small reads all the time. I imagine if your workload changed that might have been triggered too and you might want to try and disable ROC just in case at least temporarily to see if that would help. lctl set_param -n obdfilter.*.read_cache_enable=0 please let us know if this has any effect. |
| Comment by Hellen (Inactive) [ 18/Sep/12 ] |
|
sysrq-t |
| Comment by Hellen (Inactive) [ 18/Sep/12 ] |
|
cat /proc/fs/jbd2/md*/history |
| Comment by Hellen (Inactive) [ 21/Sep/12 ] |
|
We've reduced IO threads from 512 to 256, read_cache and writethrough_cache are disabled adn we've enabled async journals. We also upgraded mptbase drivers from the Redhat default and are now using an Oracle flavored driver. |
| Comment by Hellen (Inactive) [ 21/Sep/12 ] |
|
jbd2 stats from all md devices |
| Comment by Hellen (Inactive) [ 21/Sep/12 ] |
|
messages file |
| Comment by Oleg Drokin [ 21/Sep/12 ] |
|
I only see blocking in the block layer (in the disk driver likely), possibly Johann will have some more comments. Any chance you can try reverting mpt driver to the one used before your recent upgrade when you did not see such problems? |
| Comment by Hellen (Inactive) [ 21/Sep/12 ] |
|
another server affected but this one did not recover like the last one and eventually had to reboot it., the problem started between 21:15 and 21:30 (Sep 21) - i'll attach, jbdstats, messages and console messages with sysrq-t output |
| Comment by Hellen (Inactive) [ 26/Sep/12 ] |
|
Hi have you managed to have a look at the sysrq-t output? |
| Comment by Oleg Drokin [ 26/Sep/12 ] |
|
What I find a bit bothersome is in some mpt threads there's this on the stack: [<ffffffff880e4dfb>] :mptbase:mpt_fault_reset_work+0x0/0x276 The reset is seemingly broken in your mpt driver (though it does not explain how everything comes back alive once you stop lustre, I guess. Other than perhaps lustre abandons the stuck IO and only seems alive then eve though the disks remain inaccessible?) I wonder if you can try to compile older mpt driver from older kernel against newer kernel and try int that way with newer lustre? |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ] |
|
Hellen, |
| Comment by John Fuchs-Chesney (Inactive) [ 15/Mar/14 ] |
|
Looks like we will not make any further progress on this issue. |