Details
-
Bug
-
Resolution: Incomplete
-
Major
-
None
-
Lustre 2.7.0
-
None
-
Rhel 6 clients and servers, clients on Lustre 2.7.1, OSSes on 2.7.0, active MDS on 2.7.1 + patches for changelog overflow, about 2 weeks ago, the active MDS was failed over to the standby which had been upgraded previously.
patches applied on MDS over 2.7.1 are:
0e98253LU-6556obdclass: re-allow catalog to wrap around
7f86741LU-6634llog: destroy plain llog if init fails
4616323LU-6634llog: add llog_trans_destroy()
OSTs on DDN SFA10KRhel 6 clients and servers, clients on Lustre 2.7.1, OSSes on 2.7.0, active MDS on 2.7.1 + patches for changelog overflow, about 2 weeks ago, the active MDS was failed over to the standby which had been upgraded previously. patches applied on MDS over 2.7.1 are: 0e98253 LU-6556 obdclass: re-allow catalog to wrap around 7f86741 LU-6634 llog: destroy plain llog if init fails 4616323 LU-6634 llog: add llog_trans_destroy() OSTs on DDN SFA10K
-
3
-
9223372036854775807
Description
Over the last day we have had reports from our the lustre file system is unusable due to very slow write speeds.
While investigating the issue, we discovered that a number of OSTs are showing write throughput of <1MB/s while the rest is showing the expected >400MB/s throughput. Initially only a very small number of OSTs were affected but the issue now seems to affect an increasing number of OSTs.
We have recovered the bulk of the performance for our users by disabling any affected OST on the MDT, allowing our users to continue operating for now. This is clearly not a situation we want to continue operating for longer than necessary, but it allowed us to keep the system up while we are investigating the issue.
There does not seem to be any pattern which OSTs are affected, AFAICT they are distributed over all OSS nodes while all OSS also have perfectly working OSTs.
There is nothing obvious in the logs on clients, OSS, MDS and no obvious fault on the SFA10K.
Currently our users are fairly ok with the situation but starting tomorrow (UK time) we will need to investigate this with high priority.
The only Lustre related log entries we have been able to find are on all OSS nodes and are similar to the one below, repeating roughly every 10 minutes.
Jan 17 15:10:48 cs04r-sc-oss03-03 kernel: Lustre: 27537:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453043393/real 1453043393] req@ffff8805c377f9c0 x1521165154263304/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1453043448 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 17 15:10:48 cs04r-sc-oss03-03 kernel: Lustre: 27537:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
10.144.144.1@o2ib is the MDS which was active until we initiated a failover to the second MDS after upgrading that to 2.7.1+patches, we have so far not been able to identify the cause of this error.
Any suggestions how to debug this (at least initially preferably without further impact on the production file system) would be much appreciated.