Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5561

Lustre random reads: 80% performance loss from 1.8 to 2.6

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.6.0
    • None
    • 3
    • 15511

    Description

      In a random read benchmark with a moderate amount of data, reading from disk (IE, cache cold), we see an 80-90% performance loss going from 1.8 to 2.6. We have not tested 2.4/2.5. (We've tried both Cray's 1.8.6 client on SLES and Intel's 1.8.9 client on CentOS and had similar results.)

      This is the IOR command line to read the file - the file is single striped:
      IOR -E -F -e -g -b 2964m -t 19k -k -C -Q 17 -r -z -v

      The IOR command is a single task reading 2.89 GB in 19KB chunks, entirely randomly.

      After writing the file out (same command with -w), then dropping server & client caches (echo 3 > /proc/sys/vm/drop_caches), we see (numbers are from a virtual cluster; numbers on real hardware were similar in terms of % change), I saw the following read rates on 1.8.9:

      33.8 MB/s
      22.5 MB/s
      20.3 MB/s
      22.9 MB/s

      And these read rates on 2.6:
      3.48 MB/s
      3.57 MB/s

      Server is running 2.6; we saw very similar numbers with a 2.1 server, so it doesn't seem to be related to the server version.

      I'll be attaching attached brw_stats for the OST where the file was located, for one run each of 1.8 and 2.6. The main thing I noted was that the 1.8 client seemed to do many more large reads, and many, many fewer reads overall.

      The 1.8 client read a total of (approximately - estimated from brw_stats) ~4040 MB, but did it in ~11k total RPCs/disk I/O ops.

      The 2.6 client read a total (again, estimated from brw_stats) of ~4060 MB. It used roughly 140k total RPCs/disk I/O ops. More than ten time as many IO requests from the 2.6 client. The distribution of IO sizes, unsurprisingly, skews much more towards small IOs.

      Thoughts? I know random IO is generally not a good use of Lustre, but some codes do it, and this change in performance from 1.8 to 2.6 is kind of staggering.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: