Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1056

Single-client, single-thread and single-file is limited at 1.5GB/s

Details

    • Improvement
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 1.8.6, Lustre 1.8.x (1.8.0 - 1.8.5)
    • 8742

    Description

      A few savvy Lustre veterans from various organizations including NRLreported that they all observed the 1.5GB/s cap on single client with the single thread on the single file.

      "... curious about what might be limiting a Lustre client's (on QDR IB) single-file to single process performance to only be 1.4 to 1.5 GB/s even when a full QDR IB fabric is in play!"

      Since Lustre Architecture does not have such a limit, it would be worthwhile to investigate the root cause of the 1.5GB/s cap. The understanding and enhancement of the high rate IO single threaded sequential IO would enable Lustre to win competition with QFS, CXFS, StorNext.

      Initial Analysis:

      The potential limitation could be:

      • The single thread IO does not push/pull enough throughput to/from Lustre
      • Lustre client does not handle the single thread IO efficiently enough

      We can use a simple experiment to approve whether the single thread IO can push/pull enough throughput to/from Lustre. Since the IO on Lustre would all firstly go through the VFS layer, which is as same as other file system, the single thread IO limit without involving Lustre could be roughly represented via writing/reading from RAM FS.

      [root@client-31 ~]# mkdir /mnt/ramdisk
      [root@client-31 ~]# mount -t ramfs none -o rw,size=10240M,mode=755 /mnt/ramdisk

      Read
      [root@client-31 ~]# dd of=/dev/zero if=/mnt/ramdisk/bigfile bs=1M count=10240
      10240+0 records in
      10240+0 records out
      10737418240 bytes (11 GB) copied, 2.47548 s, 4.3 GB/s

      Write
      [root@client-31 ~]# dd if=/dev/zero of=/mnt/ramdisk/bigfile bs=1M count=10240
      10240+0 records in
      10240+0 records out
      10737418240 bytes (11 GB) copied, 5.45022 s, 2.0 GB/s

      This experiment shows that the single thread IO could write and read data beyond 1.5GB/s.

      Had some discussions with Nasf, for asynchronous IOs, the ack would be sent back to the client process before the RPCs reaches OSTs. So the limitation is more likely to hide between the code of copying striped data to OSC caches.

      Please note that:

      1. The 1.5GB/s limits applies to both Read and Write?

      "Even if data is in the OSS memory (but not the client) I only see more consistent throughput but not higher. So it seems like a client limit from the implementation (somewhere in the code path). If data is in the client's cache then we can see 3-5 GB/s but that's just reading pages from memory and ll_readpage is never called. Because ll_file_read->ll_file_aio_read->generic_file_aio_read->do_generic_file_read never calls the readpage function for the given address_space if the call to find_get_page found it in cache (the radix tree)."

      2. All of our higher IO rates make use of the read ahead or write behind so it would all be asynchronous.

      Please also refer to http://groups.google.com/group/lustre-discuss-list/msg/30ed1fde6ab6e62d

      Attachments

        Issue Links

          Activity

            [LU-1056] Single-client, single-thread and single-file is limited at 1.5GB/s

            This is being worked on in LU-8964.

            adilger Andreas Dilger added a comment - This is being worked on in LU-8964 .

            Hi Jeremy,

            How many stripes does that single file have? I guess the bottleneck would be on data copying so CPU usage info will be helpful. You can also try direct IO to see if there is any improvement.

            jay Jinshan Xiong (Inactive) added a comment - Hi Jeremy, How many stripes does that single file have? I guess the bottleneck would be on data copying so CPU usage info will be helpful. You can also try direct IO to see if there is any improvement.

            I don't have any issues with that being in 2.x though we have never made it there to start any testing.

            The ptlrpcd thread was not a bottleneck if I remember correctly. While I have certainly see it be a bottleneck with many threads, the single threaded IO case it didn't seem to be the problem.

            All IO was buffered that I tested, I seldom use O_DIRECT. checksums were disabled (compiled with --disable-checksum), debugging should have been insignificant since it was was the default settings. I don't know if there would have been any issues with the DLM locking since its the part of Lustre I'm probably least familiar with.

            Possibly as soon as a couple weeks I might be able to provide some numbers for Lustre 2.2 servers and clients.

            jfilizetti Jeremy Filizetti added a comment - I don't have any issues with that being in 2.x though we have never made it there to start any testing. The ptlrpcd thread was not a bottleneck if I remember correctly. While I have certainly see it be a bottleneck with many threads, the single threaded IO case it didn't seem to be the problem. All IO was buffered that I tested, I seldom use O_DIRECT. checksums were disabled (compiled with --disable-checksum), debugging should have been insignificant since it was was the default settings. I don't know if there would have been any issues with the DLM locking since its the part of Lustre I'm probably least familiar with. Possibly as soon as a couple weeks I might be able to provide some numbers for Lustre 2.2 servers and clients.

            Zhiqi,
            there is no development of features or enhancements against Lustre 1.8 today, so any development (including performance enhancements like this) would need to be done against Lustre 2.2+ clients. Also, since the Lustre 2.x client IO code is significantly different than the 1.8 client IO code, any optimizations done against 1.8 would not necessarily even be portable or useful for 2.x.

            One thing that is of interest from 1.8.x is to measure CPU usage of the ptlrpcd thread, to see if this is peaking at 100% and causing the throughput limitation that is being seen. This has been observed previously when data checksums were enabled, but may also be the case at 1.5GB/s when checksums are disabled.

            The other factor is whether this IO is buffered or O_DIRECT? If it is buffered IO, then the overhead of copying data from userspace to the kernel is significant, and ANY extra CPU usage (debugging, Lustre checksums, DLM locking, etc) will reduce the IO performance. This would show up in "oprofile" data as copy_from_user() or similar.

            Was the Lustre debugging disabled on the client for these tests?

            Is it possible to get a benchmark run with Lustre 2.2 clients and Lustre 2.2 or 2.1 servers? The 2.2 client has the multi-threaded ptlrpcd support, which should significantly improve throughput if ptlrpcd is the limiting factor. However, the 2.x CLIO code hasn't been tuned as much as the 1.8 code was, so there may be other challenges with getting more than 1.5GB/s with 2.x, but we need a starting point of reference.

            adilger Andreas Dilger added a comment - Zhiqi, there is no development of features or enhancements against Lustre 1.8 today, so any development (including performance enhancements like this) would need to be done against Lustre 2.2+ clients. Also, since the Lustre 2.x client IO code is significantly different than the 1.8 client IO code, any optimizations done against 1.8 would not necessarily even be portable or useful for 2.x. One thing that is of interest from 1.8.x is to measure CPU usage of the ptlrpcd thread, to see if this is peaking at 100% and causing the throughput limitation that is being seen. This has been observed previously when data checksums were enabled, but may also be the case at 1.5GB/s when checksums are disabled. The other factor is whether this IO is buffered or O_DIRECT? If it is buffered IO, then the overhead of copying data from userspace to the kernel is significant, and ANY extra CPU usage (debugging, Lustre checksums, DLM locking, etc) will reduce the IO performance. This would show up in "oprofile" data as copy_from_user() or similar. Was the Lustre debugging disabled on the client for these tests? Is it possible to get a benchmark run with Lustre 2.2 clients and Lustre 2.2 or 2.1 servers? The 2.2 client has the multi-threaded ptlrpcd support, which should significantly improve throughput if ptlrpcd is the limiting factor. However, the 2.x CLIO code hasn't been tuned as much as the 1.8 code was, so there may be other challenges with getting more than 1.5GB/s with 2.x, but we need a starting point of reference.

            People

              jay Jinshan Xiong (Inactive)
              zhiqi Zhiqi Tao (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: