[LU-1056] Single-client, single-thread and single-file is limited at 1.5GB/s - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.6, Lustre 1.8.x (1.8.0 - 1.8.5)
Labels:
- NRL
- llnl
Environment:

Hide
Tested on many environments between 4-8 OSS, striped across 1-30 OSTs (typically 1 MB stripes), 5-30 OSTs per OSS, Lustre 1.8.2, 1.8.4, 1.8.5, 1.8.6-wc, and RAM clients varying from 4 - 144 GB (same with the OSS).

Lustre was compiled with --disable-checksum. The complete Lustre tuneables are:

options ko2iblnd peer_credits=128 peer_credits_hiw=0 map_on_demand=31 credits=256 concurrent_sends=256 ntx=512 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1

Show
Tested on many environments between 4-8 OSS, striped across 1-30 OSTs (typically 1 MB stripes), 5-30 OSTs per OSS, Lustre 1.8.2, 1.8.4, 1.8.5, 1.8.6-wc, and RAM clients varying from 4 - 144 GB (same with the OSS). Lustre was compiled with --disable-checksum. The complete Lustre tuneables are: options ko2iblnd peer_credits=128 peer_credits_hiw=0 map_on_demand=31 credits=256 concurrent_sends=256 ntx=512 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1

Rank (Obsolete):
8742

Description

A few savvy Lustre veterans from various organizations including NRLreported that they all observed the 1.5GB/s cap on single client with the single thread on the single file.

"... curious about what might be limiting a Lustre client's (on QDR IB) single-file to single process performance to only be 1.4 to 1.5 GB/s even when a full QDR IB fabric is in play!"

Since Lustre Architecture does not have such a limit, it would be worthwhile to investigate the root cause of the 1.5GB/s cap. The understanding and enhancement of the high rate IO single threaded sequential IO would enable Lustre to win competition with QFS, CXFS, StorNext.

Initial Analysis:

The potential limitation could be:

The single thread IO does not push/pull enough throughput to/from Lustre
Lustre client does not handle the single thread IO efficiently enough

We can use a simple experiment to approve whether the single thread IO can push/pull enough throughput to/from Lustre. Since the IO on Lustre would all firstly go through the VFS layer, which is as same as other file system, the single thread IO limit without involving Lustre could be roughly represented via writing/reading from RAM FS.

[root@client-31 ~]# mkdir /mnt/ramdisk
[root@client-31 ~]# mount -t ramfs none -o rw,size=10240M,mode=755 /mnt/ramdisk

Read
[root@client-31 ~]# dd of=/dev/zero if=/mnt/ramdisk/bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 2.47548 s, 4.3 GB/s

Write
[root@client-31 ~]# dd if=/dev/zero of=/mnt/ramdisk/bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.45022 s, 2.0 GB/s

This experiment shows that the single thread IO could write and read data beyond 1.5GB/s.

Had some discussions with Nasf, for asynchronous IOs, the ack would be sent back to the client process before the RPCs reaches OSTs. So the limitation is more likely to hide between the code of copying striped data to OSC caches.

Please note that:

1. The 1.5GB/s limits applies to both Read and Write?

"Even if data is in the OSS memory (but not the client) I only see more consistent throughput but not higher. So it seems like a client limit from the implementation (somewhere in the code path). If data is in the client's cache then we can see 3-5 GB/s but that's just reading pages from memory and ll_readpage is never called. Because ll_file_read->ll_file_aio_read->generic_file_aio_read->do_generic_file_read never calls the readpage function for the given address_space if the call to find_get_page found it in cache (the radix tree)."

2. All of our higher IO rates make use of the read ahead or write behind so it would all be asynchronous.

Please also refer to http://groups.google.com/group/lustre-discuss-list/msg/30ed1fde6ab6e62d

Attachments

Issue Links

duplicates

LU-8964 use parallel I/O to improve performance on machines with slow single thread performance

Resolved

is related to

LU-6658 single stream write performance improvement with worker threads in llite

Resolved

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-1056] Single-client, single-thread and single-file is limited at 1.5GB/s

Andreas Dilger added a comment - 29/May/17 3:06 AM

This is being worked on in ~~LU-8964~~.

Andreas Dilger added a comment - 29/May/17 3:06 AM This is being worked on in LU-8964 .

Jinshan Xiong (Inactive) added a comment - 01/Feb/12 1:55 AM

Hi Jeremy,

How many stripes does that single file have? I guess the bottleneck would be on data copying so CPU usage info will be helpful. You can also try direct IO to see if there is any improvement.

Jinshan Xiong (Inactive) added a comment - 01/Feb/12 1:55 AM Hi Jeremy, How many stripes does that single file have? I guess the bottleneck would be on data copying so CPU usage info will be helpful. You can also try direct IO to see if there is any improvement.

Jeremy Filizetti added a comment - 31/Jan/12 11:06 PM

I don't have any issues with that being in 2.x though we have never made it there to start any testing.

The ptlrpcd thread was not a bottleneck if I remember correctly. While I have certainly see it be a bottleneck with many threads, the single threaded IO case it didn't seem to be the problem.

All IO was buffered that I tested, I seldom use O_DIRECT. checksums were disabled (compiled with --disable-checksum), debugging should have been insignificant since it was was the default settings. I don't know if there would have been any issues with the DLM locking since its the part of Lustre I'm probably least familiar with.

Possibly as soon as a couple weeks I might be able to provide some numbers for Lustre 2.2 servers and clients.

Jeremy Filizetti added a comment - 31/Jan/12 11:06 PM I don't have any issues with that being in 2.x though we have never made it there to start any testing. The ptlrpcd thread was not a bottleneck if I remember correctly. While I have certainly see it be a bottleneck with many threads, the single threaded IO case it didn't seem to be the problem. All IO was buffered that I tested, I seldom use O_DIRECT. checksums were disabled (compiled with --disable-checksum), debugging should have been insignificant since it was was the default settings. I don't know if there would have been any issues with the DLM locking since its the part of Lustre I'm probably least familiar with. Possibly as soon as a couple weeks I might be able to provide some numbers for Lustre 2.2 servers and clients.

Andreas Dilger added a comment - 31/Jan/12 1:22 AM

Zhiqi,
there is no development of features or enhancements against Lustre 1.8 today, so any development (including performance enhancements like this) would need to be done against Lustre 2.2+ clients. Also, since the Lustre 2.x client IO code is significantly different than the 1.8 client IO code, any optimizations done against 1.8 would not necessarily even be portable or useful for 2.x.

One thing that is of interest from 1.8.x is to measure CPU usage of the ptlrpcd thread, to see if this is peaking at 100% and causing the throughput limitation that is being seen. This has been observed previously when data checksums were enabled, but may also be the case at 1.5GB/s when checksums are disabled.

The other factor is whether this IO is buffered or O_DIRECT? If it is buffered IO, then the overhead of copying data from userspace to the kernel is significant, and ANY extra CPU usage (debugging, Lustre checksums, DLM locking, etc) will reduce the IO performance. This would show up in "oprofile" data as copy_from_user() or similar.

Was the Lustre debugging disabled on the client for these tests?

Is it possible to get a benchmark run with Lustre 2.2 clients and Lustre 2.2 or 2.1 servers? The 2.2 client has the multi-threaded ptlrpcd support, which should significantly improve throughput if ptlrpcd is the limiting factor. However, the 2.x CLIO code hasn't been tuned as much as the 1.8 code was, so there may be other challenges with getting more than 1.5GB/s with 2.x, but we need a starting point of reference.

Andreas Dilger added a comment - 31/Jan/12 1:22 AM Zhiqi, there is no development of features or enhancements against Lustre 1.8 today, so any development (including performance enhancements like this) would need to be done against Lustre 2.2+ clients. Also, since the Lustre 2.x client IO code is significantly different than the 1.8 client IO code, any optimizations done against 1.8 would not necessarily even be portable or useful for 2.x. One thing that is of interest from 1.8.x is to measure CPU usage of the ptlrpcd thread, to see if this is peaking at 100% and causing the throughput limitation that is being seen. This has been observed previously when data checksums were enabled, but may also be the case at 1.5GB/s when checksums are disabled. The other factor is whether this IO is buffered or O_DIRECT? If it is buffered IO, then the overhead of copying data from userspace to the kernel is significant, and ANY extra CPU usage (debugging, Lustre checksums, DLM locking, etc) will reduce the IO performance. This would show up in "oprofile" data as copy_from_user() or similar. Was the Lustre debugging disabled on the client for these tests? Is it possible to get a benchmark run with Lustre 2.2 clients and Lustre 2.2 or 2.1 servers? The 2.2 client has the multi-threaded ptlrpcd support, which should significantly improve throughput if ptlrpcd is the limiting factor. However, the 2.x CLIO code hasn't been tuned as much as the 1.8 code was, so there may be other challenges with getting more than 1.5GB/s with 2.x, but we need a starting point of reference.

Single-client, single-thread and single-file is limited at 1.5GB/s

Details

Description

Attachments

Issue Links

Activity

People

Dates