Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5561

Lustre random reads: 80% performance loss from 1.8 to 2.6


    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not a Bug
    • Affects Version/s: Lustre 2.6.0
    • Fix Version/s: None
    • Labels:
    • Severity:
    • Rank (Obsolete):


      In a random read benchmark with a moderate amount of data, reading from disk (IE, cache cold), we see an 80-90% performance loss going from 1.8 to 2.6. We have not tested 2.4/2.5. (We've tried both Cray's 1.8.6 client on SLES and Intel's 1.8.9 client on CentOS and had similar results.)

      This is the IOR command line to read the file - the file is single striped:
      IOR -E -F -e -g -b 2964m -t 19k -k -C -Q 17 -r -z -v

      The IOR command is a single task reading 2.89 GB in 19KB chunks, entirely randomly.

      After writing the file out (same command with -w), then dropping server & client caches (echo 3 > /proc/sys/vm/drop_caches), we see (numbers are from a virtual cluster; numbers on real hardware were similar in terms of % change), I saw the following read rates on 1.8.9:

      33.8 MB/s
      22.5 MB/s
      20.3 MB/s
      22.9 MB/s

      And these read rates on 2.6:
      3.48 MB/s
      3.57 MB/s

      Server is running 2.6; we saw very similar numbers with a 2.1 server, so it doesn't seem to be related to the server version.

      I'll be attaching attached brw_stats for the OST where the file was located, for one run each of 1.8 and 2.6. The main thing I noted was that the 1.8 client seemed to do many more large reads, and many, many fewer reads overall.

      The 1.8 client read a total of (approximately - estimated from brw_stats) ~4040 MB, but did it in ~11k total RPCs/disk I/O ops.

      The 2.6 client read a total (again, estimated from brw_stats) of ~4060 MB. It used roughly 140k total RPCs/disk I/O ops. More than ten time as many IO requests from the 2.6 client. The distribution of IO sizes, unsurprisingly, skews much more towards small IOs.

      Thoughts? I know random IO is generally not a good use of Lustre, but some codes do it, and this change in performance from 1.8 to 2.6 is kind of staggering.


          Issue Links



              • Assignee:
                wc-triage WC Triage
                paf Patrick Farrell (Inactive)
              • Votes:
                0 Vote for this issue
                7 Start watching this issue


                • Created: