[LU-7677] sudden severe slowdown for some OSTs in the file system Created: 17/Jan/16  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Frederik Ferner (Inactive) Assignee: Oleg Drokin
Resolution: Incomplete Votes: 0
Labels: None
Environment:

Rhel 6 clients and servers, clients on Lustre 2.7.1, OSSes on 2.7.0, active MDS on 2.7.1 + patches for changelog overflow, about 2 weeks ago, the active MDS was failed over to the standby which had been upgraded previously.
patches applied on MDS over 2.7.1 are:

0e98253 LU-6556 obdclass: re-allow catalog to wrap around
7f86741 LU-6634 llog: destroy plain llog if init fails
4616323 LU-6634 llog: add llog_trans_destroy()

OSTs on DDN SFA10K


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Over the last day we have had reports from our the lustre file system is unusable due to very slow write speeds.

While investigating the issue, we discovered that a number of OSTs are showing write throughput of <1MB/s while the rest is showing the expected >400MB/s throughput. Initially only a very small number of OSTs were affected but the issue now seems to affect an increasing number of OSTs.

We have recovered the bulk of the performance for our users by disabling any affected OST on the MDT, allowing our users to continue operating for now. This is clearly not a situation we want to continue operating for longer than necessary, but it allowed us to keep the system up while we are investigating the issue.

There does not seem to be any pattern which OSTs are affected, AFAICT they are distributed over all OSS nodes while all OSS also have perfectly working OSTs.

There is nothing obvious in the logs on clients, OSS, MDS and no obvious fault on the SFA10K.

Currently our users are fairly ok with the situation but starting tomorrow (UK time) we will need to investigate this with high priority.

The only Lustre related log entries we have been able to find are on all OSS nodes and are similar to the one below, repeating roughly every 10 minutes.

Jan 17 15:10:48 cs04r-sc-oss03-03 kernel: Lustre: 27537:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453043393/real 1453043393]  req@ffff8805c377f9c0 x1521165154263304/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1453043448 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 17 15:10:48 cs04r-sc-oss03-03 kernel: Lustre: 27537:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 8 previous similar messages

10.144.144.1@o2ib is the MDS which was active until we initiated a failover to the second MDS after upgrading that to 2.7.1+patches, we have so far not been able to identify the cause of this error.

Any suggestions how to debug this (at least initially preferably without further impact on the production file system) would be much appreciated.



 Comments   
Comment by Frederik Ferner (Inactive) [ 18/Jan/16 ]

FWIW, we have now confirmed that all affected OSTs are using the same controller on the SFA10K and all OSTs on that controller are affected. DDN are looking into this.

Comment by Peter Jones [ 18/Jan/16 ]

ok Frederik. Let us know if there is anything else you need from us

Comment by Frederik Ferner (Inactive) [ 01/Feb/16 ]

Peter, All,

could we re-open this ticket please? Looks like we need to investigate this further.

We have replaced the controller, the problem went away but came back with the same symptoms after about a week.

As I see it, we are now looking at investigating individual OSTs, preferably on a low level, eliminating as much of the Lustre layers as possible. Ideally without affecting the data. Any suggestions how we can debug this would be appreciated. (We've got a maintenance window tomorrow where we'll try to do as much as we can.)

Frederik

Comment by John Fuchs-Chesney (Inactive) [ 01/Feb/16 ]

By request of Frederick.
~ jfc.

Comment by Oleg Drokin [ 01/Feb/16 ]

So if you want to remove the lustre out of the picture completely, you can use the raw block device testing with ike dd and stuff. it should be nondestructive for reads at least.

You can mount the underlying lustre fs as ldiskfs and then read/write some files with some performance-measuring tool too and that sohuld be nondestructive too, just remember to delete the files afterwards.

And you can use obdsurvey tools to read write via lustre layers but no network and I believe that 's also nondestructive as it does not really reformat anything. Just need to remember to nuke the objects it used for IO.

Comment by Frederik Ferner (Inactive) [ 04/Feb/16 ]

Oleg,

thanks for the suggestions.

We have been able to see that if an OST is slow, this is also observed when reading from the raw block device with dd. (I've not tried any of the others...)

We have now also managed to trigger slow throughput on an OST by running a script randomly writing 4k blocks in a file. If we run this script in 4 instances on a single client, each writing its own file, each file ~400GB, all files on the same OST. When doing this, with pretty much nothing else ongoing on the file system, the throughput drops from 600MB/s or more to <10MB/s when testing with a simple dd to another file on that OST from a different node. Is that expected?

Frederik

Comment by Oleg Drokin [ 20/Feb/16 ]

Well, typically if you write a lot of small random IO to disks, they become really slow due to all the seeking involved, so no real surprise there.

The surprising part is you are doing this over a DDN device and I read they were smart about that and had a guaranteed floor rate of the IO even in case of a stream of random small IO.

If you can demonstrate this with no Lustre in the picture by just mounting an OST volume as ldiskfs (or formatting a spare LUN as ext4 if you prefer), running 4 (or 4*8 = 32 to account for multiple RPCs in flight in case of Lustre) of those "random stream of 4k writes per file" in DIRECTIO mode (the directio is important to force data to go to disk right away to match what Lustre does). and then do a streaming write (also in directio mode in 1M chunks), if that is slow, I think you'd need to talk to DDN and they should be able to help you see if there's some sort of a configuration error or some other problem going on.

Generated at Sat Feb 10 02:10:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.