Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7677

sudden severe slowdown for some OSTs in the file system

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Major
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      Over the last day we have had reports from our the lustre file system is unusable due to very slow write speeds.

      While investigating the issue, we discovered that a number of OSTs are showing write throughput of <1MB/s while the rest is showing the expected >400MB/s throughput. Initially only a very small number of OSTs were affected but the issue now seems to affect an increasing number of OSTs.

      We have recovered the bulk of the performance for our users by disabling any affected OST on the MDT, allowing our users to continue operating for now. This is clearly not a situation we want to continue operating for longer than necessary, but it allowed us to keep the system up while we are investigating the issue.

      There does not seem to be any pattern which OSTs are affected, AFAICT they are distributed over all OSS nodes while all OSS also have perfectly working OSTs.

      There is nothing obvious in the logs on clients, OSS, MDS and no obvious fault on the SFA10K.

      Currently our users are fairly ok with the situation but starting tomorrow (UK time) we will need to investigate this with high priority.

      The only Lustre related log entries we have been able to find are on all OSS nodes and are similar to the one below, repeating roughly every 10 minutes.

      Jan 17 15:10:48 cs04r-sc-oss03-03 kernel: Lustre: 27537:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453043393/real 1453043393]  req@ffff8805c377f9c0 x1521165154263304/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1453043448 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Jan 17 15:10:48 cs04r-sc-oss03-03 kernel: Lustre: 27537:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      

      10.144.144.1@o2ib is the MDS which was active until we initiated a failover to the second MDS after upgrading that to 2.7.1+patches, we have so far not been able to identify the cause of this error.

      Any suggestions how to debug this (at least initially preferably without further impact on the production file system) would be much appreciated.

      Attachments

        Activity

          People

            green Oleg Drokin
            ferner Frederik Ferner (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: