[LU-1950] oss high load, dead ll_ost_io Created: 16/Sep/12  Updated: 15/Mar/14  Resolved: 15/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Hellen (Inactive) Assignee: Oleg Drokin
Resolution: Incomplete Votes: 0
Labels: None
Environment:

Sun Fire X4540, kernel ver 2.6.18-308.4.1


Attachments: File jbd_history     File lustre_log     File o1_console     File oss01-jbd2.tar     File oss01-lustre-log     File oss01-messages     File oss06_console     File oss07_messages     File oss2-jbd-stats.tar     File oss2-messages     File sysrq-t    
Severity: 2
Rank (Obsolete): 4022

 Description   

We have experiencing a problem on our OSS servers. The server load shoot up to as high a 500, with dead io threads, the ost's become unusable until the OSS is rebooted. This has happened 5 times on 4 different OSS server in the last 48 hours. I've attached messages file from the server that was last affected.



 Comments   
Comment by Cliff White (Inactive) [ 16/Sep/12 ]

Can you getting a dump of all stacks with SysRq on the OSS when this is happening? Please attach if so. Any indications of IO hardware problems?

Comment by Oleg Drokin [ 16/Sep/12 ]

The logs indicate back-end IO devices as being extremely slow to process requests, more than 10 thousand seconds to process a single request.
Typical reasons are including raid arrays going bad/degraded and lots of small IO workloads.
I see this is happening on several nodes, so I wonder if you tarted to run something recently with a very different IO footprint than before?
In particular some application that does a lot of small IO (either read or writes) or some application that does a lot of unlinks from a lot of clients (how many clients do you have?)

In particular small IO workloads are pretty bothersome because to the disks they look like a stream of small random requests and rotating disks hate that workloads where they mostly do nothing but move the writing arm around (seeking) instead of writing.

More or less healthy workload for rotating media is about 1 request in flight per spindle. Unfortunately there's no way in Lustre 1.8 to limit it like that (nether is 2.x capable of this yet, though NRS feature will eventually allow that).
If a lot of random small IO is an important workload for you, it might make sense to actually decrease number of IO threads on OSTs from 500 to ~number of spindles in total you have on that ost + 5.
Also if you have mostly write activity going on, enabling async journals might also help ( lctl set_param -n obdfilter.${ost_svc}.sync_journal 0" for every OST in the system.

Sadly Lustre is unable to write to the media any faster than the media is capable of and this needs to be taken into account when looking at various IO patterns your applications are having.

Comment by Hellen (Inactive) [ 16/Sep/12 ]

Will try to get a dump of all stacks with SysRq the next time it happens

We have 672 clients, I haven’t noticed something running with different IO footprint, but I'll keep looking.
We do have a possible controller/driver bug (LU-1910) and are planning to upgrade mptsas drivers ASAP, could a controller problem be a possible cause of this?

Most activity going on is write activity, So I'll enable async journal and see if that helps. There are some users with a lot random small IO, so will also decrease number of IO threads to see if that helps too.

Comment by Oleg Drokin [ 17/Sep/12 ]

While a controller problem could be one of the possible causes, did you really change the controller driver recently? If so, definitely try to roll-back to a previous version or otherwise try to change it.

Comment by Hellen (Inactive) [ 17/Sep/12 ]

yes, we upgraded the OS and kernel recently, The controller driver in use now, is a later version to what we had before the upgrade

Comment by Hellen (Inactive) [ 18/Sep/12 ]

Please see attached console messages with a dump of all stacks. This is from the last sever affected a few minutes ago.

Comment by Hellen (Inactive) [ 18/Sep/12 ]

SysRq dump of all stacks

Comment by Oleg Drokin [ 18/Sep/12 ]

I see in the new log requests are "only" taking ~2600 seconds now instead of 10000 in previous one. Did you reduce number of threads or made other changes.

Also one more option for you to try I guess is to disable read only cache, which it's enabled by default always so it's not something changed during your upgrade, there's a known problem that hit sometimes manifested in the cache on OSS taking too much memory and forcing cached metadata out resulting in a lot of small reads all the time. I imagine if your workload changed that might have been triggered too and you might want to try and disable ROC just in case at least temporarily to see if that would help.

lctl set_param -n obdfilter.*.read_cache_enable=0
lctl set_param -n obdfilter.*.writethrough_cache_enable=0

please let us know if this has any effect.

Comment by Hellen (Inactive) [ 18/Sep/12 ]

sysrq-t

Comment by Hellen (Inactive) [ 18/Sep/12 ]

cat /proc/fs/jbd2/md*/history

Comment by Hellen (Inactive) [ 21/Sep/12 ]

We've reduced IO threads from 512 to 256, read_cache and writethrough_cache are disabled adn we've enabled async journals. We also upgraded mptbase drivers from the Redhat default and are now using an Oracle flavored driver.
Its been about 2 days since we made the changes. This morning one of the OSS server load shot up to over 255. There were no errors in the messages log prior to the load shooting up. The load stayed up for about an hour and client could not reach the OST's on the server during this time. The load eventually came down and everything worked normally again. I've attached the messages file from the server affected(check from Sep 21 06:10:5 The load went from 8 to 255 between 05:45 and 06:15) and jbd2-history from the md devices - I've been collecting these every 30 minutes.

Comment by Hellen (Inactive) [ 21/Sep/12 ]

jbd2 stats from all md devices

Comment by Hellen (Inactive) [ 21/Sep/12 ]

messages file

Comment by Oleg Drokin [ 21/Sep/12 ]

I only see blocking in the block layer (in the disk driver likely), possibly Johann will have some more comments.

Any chance you can try reverting mpt driver to the one used before your recent upgrade when you did not see such problems?

Comment by Hellen (Inactive) [ 21/Sep/12 ]

another server affected but this one did not recover like the last one and eventually had to reboot it., the problem started between 21:15 and 21:30 (Sep 21) - i'll attach, jbdstats, messages and console messages with sysrq-t output

Comment by Hellen (Inactive) [ 26/Sep/12 ]

Hi have you managed to have a look at the sysrq-t output?
We been reverting every server affected back to old kernel(with lustre ver 1.8.5)

Comment by Oleg Drokin [ 26/Sep/12 ]

What I find a bit bothersome is in some mpt threads there's this on the stack:

 [<ffffffff880e4dfb>] :mptbase:mpt_fault_reset_work+0x0/0x276

The reset is seemingly broken in your mpt driver (though it does not explain how everything comes back alive once you stop lustre, I guess. Other than perhaps lustre abandons the stuck IO and only seems alive then eve though the disks remain inaccessible?)

I wonder if you can try to compile older mpt driver from older kernel against newer kernel and try int that way with newer lustre?

Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ]

Hellen,
Is there anything further we can do to help you on this issue, or can we mark it as resolved?
Many thanks,
~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 15/Mar/14 ]

Looks like we will not make any further progress on this issue.

Generated at Sat Feb 10 01:21:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.