[LU-1056] Single-client, single-thread and single-file is limited at 1.5GB/s Created: 30/Jan/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6, Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Zhiqi Tao (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: NRL, llnl
Environment:

Tested on many environments between 4-8 OSS, striped across 1-30 OSTs (typically 1 MB stripes), 5-30 OSTs per OSS, Lustre 1.8.2, 1.8.4, 1.8.5, 1.8.6-wc, and RAM clients varying from 4 - 144 GB (same with the OSS).

Lustre was compiled with --disable-checksum. The complete Lustre tuneables are:

options ko2iblnd peer_credits=128 peer_credits_hiw=0 map_on_demand=31 credits=256 concurrent_sends=256 ntx=512 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1


Issue Links:
Duplicate
duplicates LU-8964 use parallel I/O to improve performan... Resolved
Related
is related to LU-6658 single stream write performance impro... Resolved
Rank (Obsolete): 8742

 Description   

A few savvy Lustre veterans from various organizations including NRLreported that they all observed the 1.5GB/s cap on single client with the single thread on the single file.

"... curious about what might be limiting a Lustre client's (on QDR IB) single-file to single process performance to only be 1.4 to 1.5 GB/s even when a full QDR IB fabric is in play!"

Since Lustre Architecture does not have such a limit, it would be worthwhile to investigate the root cause of the 1.5GB/s cap. The understanding and enhancement of the high rate IO single threaded sequential IO would enable Lustre to win competition with QFS, CXFS, StorNext.

Initial Analysis:

The potential limitation could be:

  • The single thread IO does not push/pull enough throughput to/from Lustre
  • Lustre client does not handle the single thread IO efficiently enough

We can use a simple experiment to approve whether the single thread IO can push/pull enough throughput to/from Lustre. Since the IO on Lustre would all firstly go through the VFS layer, which is as same as other file system, the single thread IO limit without involving Lustre could be roughly represented via writing/reading from RAM FS.

[root@client-31 ~]# mkdir /mnt/ramdisk
[root@client-31 ~]# mount -t ramfs none -o rw,size=10240M,mode=755 /mnt/ramdisk

Read
[root@client-31 ~]# dd of=/dev/zero if=/mnt/ramdisk/bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 2.47548 s, 4.3 GB/s

Write
[root@client-31 ~]# dd if=/dev/zero of=/mnt/ramdisk/bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.45022 s, 2.0 GB/s

This experiment shows that the single thread IO could write and read data beyond 1.5GB/s.

Had some discussions with Nasf, for asynchronous IOs, the ack would be sent back to the client process before the RPCs reaches OSTs. So the limitation is more likely to hide between the code of copying striped data to OSC caches.

Please note that:

1. The 1.5GB/s limits applies to both Read and Write?

"Even if data is in the OSS memory (but not the client) I only see more consistent throughput but not higher. So it seems like a client limit from the implementation (somewhere in the code path). If data is in the client's cache then we can see 3-5 GB/s but that's just reading pages from memory and ll_readpage is never called. Because ll_file_read->ll_file_aio_read->generic_file_aio_read->do_generic_file_read never calls the readpage function for the given address_space if the call to find_get_page found it in cache (the radix tree)."

2. All of our higher IO rates make use of the read ahead or write behind so it would all be asynchronous.

Please also refer to http://groups.google.com/group/lustre-discuss-list/msg/30ed1fde6ab6e62d



 Comments   
Comment by Andreas Dilger [ 31/Jan/12 ]

Zhiqi,
there is no development of features or enhancements against Lustre 1.8 today, so any development (including performance enhancements like this) would need to be done against Lustre 2.2+ clients. Also, since the Lustre 2.x client IO code is significantly different than the 1.8 client IO code, any optimizations done against 1.8 would not necessarily even be portable or useful for 2.x.

One thing that is of interest from 1.8.x is to measure CPU usage of the ptlrpcd thread, to see if this is peaking at 100% and causing the throughput limitation that is being seen. This has been observed previously when data checksums were enabled, but may also be the case at 1.5GB/s when checksums are disabled.

The other factor is whether this IO is buffered or O_DIRECT? If it is buffered IO, then the overhead of copying data from userspace to the kernel is significant, and ANY extra CPU usage (debugging, Lustre checksums, DLM locking, etc) will reduce the IO performance. This would show up in "oprofile" data as copy_from_user() or similar.

Was the Lustre debugging disabled on the client for these tests?

Is it possible to get a benchmark run with Lustre 2.2 clients and Lustre 2.2 or 2.1 servers? The 2.2 client has the multi-threaded ptlrpcd support, which should significantly improve throughput if ptlrpcd is the limiting factor. However, the 2.x CLIO code hasn't been tuned as much as the 1.8 code was, so there may be other challenges with getting more than 1.5GB/s with 2.x, but we need a starting point of reference.

Comment by Jeremy Filizetti [ 31/Jan/12 ]

I don't have any issues with that being in 2.x though we have never made it there to start any testing.

The ptlrpcd thread was not a bottleneck if I remember correctly. While I have certainly see it be a bottleneck with many threads, the single threaded IO case it didn't seem to be the problem.

All IO was buffered that I tested, I seldom use O_DIRECT. checksums were disabled (compiled with --disable-checksum), debugging should have been insignificant since it was was the default settings. I don't know if there would have been any issues with the DLM locking since its the part of Lustre I'm probably least familiar with.

Possibly as soon as a couple weeks I might be able to provide some numbers for Lustre 2.2 servers and clients.

Comment by Jinshan Xiong (Inactive) [ 01/Feb/12 ]

Hi Jeremy,

How many stripes does that single file have? I guess the bottleneck would be on data copying so CPU usage info will be helpful. You can also try direct IO to see if there is any improvement.

Comment by Andreas Dilger [ 29/May/17 ]

This is being worked on in LU-8964.

Generated at Sat Feb 10 01:13:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.