[LU-8709] parallel asynchronous readahead Created: 14/Oct/16  Updated: 10/Feb/21  Resolved: 10/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Li Xi
Resolution: Duplicate Votes: 0
Labels: perf_optimization

Attachments: PNG File Lustre-SingleThread.png    
Issue Links:
Blocker
is blocked by LU-8726 Do fake read page on OST to help read... Resolved
Duplicate
is duplicated by LU-6 Update the readahead logic in lustre 2.0 Resolved
Related
is related to LU-8413 sanity test_101f fails with 'misses t... Resolved
is related to LU-8964 use parallel I/O to improve performan... Resolved
is related to LU-11416 Improve readahead for random read of ... Open
is related to LU-12043 improve Lustre single thread read per... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Tracker for parallel async readahead improvement from DDN, as described in http://www.eofs.eu/_media/events/lad16/19_parallel_readahead_framework_li_xi.pdf

Note that it would be very desirable in the single thread case if the copy_to_user was also handled in parallel, as this is a major CPU overhead on many-core systems and if it can be parallelized it may increase the peak read performance.

As for lockahead integration with readahead, I agree that this is possible to do this, but it is only useful if the client doesn't get full-file extent locks. It would also be interesting if the write code detected sequential or strided writes and did lockahead at write time.



 Comments   
Comment by Patrick Farrell (Inactive) [ 01/Nov/16 ]

About lockahead/readahead integration. There's not much use for lockahead in the "read" case, since locks can overlap. At least, I've never come up with a realistic scenario where it's relevant/helpful.

Assuming it's not totally invalid to tie this stuff together, we could do possibly do something positive by recognizing strided writing. There are some complexities there - For example, we'd need to make a non-blocking lockahead lock request for at least the first one, to cancel the full file locks the clients are exchanging (normal lockahead locks must be non-blocking, so they can be requested many at a time). There's also some danger around the different clients not being coordinated when they go in to lockahead mode.

I can think of how to probably make it work, but I'm not totally convinced it's worth the effort.

Comment by Gerrit Updater [ 03/Nov/16 ]

Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/23552
Subject: LU-8709 llite: implement parrallel asynchronous readahead
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5629bfc1afbcf8913d738118f9085d3e5901f831

Comment by Li Xi (Inactive) [ 03/Nov/16 ]

The pushed patch is definitely not the final version, and I am going to test and optimize it further. I really needs your review and feedback in this progress. Thank you in advance!

Comment by Dmitry Eremin (Inactive) [ 09/Dec/16 ]

Can you explain in more details what benchmarks you used for testing? It's not clear for me why we need new readahead algorithm instead of using standard from Linux kernel. I just make it asynchronous and parallelize for Lustre needs. This allows me significantly increase a read performance without implementing new algorithm.

 

Comment by Andreas Dilger [ 29/Mar/17 ]

Closing as a duplicate of LU-8964.

Comment by Li Xi (Inactive) [ 17/Apr/18 ]

I am repopening this ticket because the patch inLU-8960 doesn't really improve performance as much as we can get from this patch.

Comment by Shuichi Ihara (Inactive) [ 25/Apr/18 ]

Here is performance resutls and comapring of patch for LU-8709 and LU-8964.
Unfortunately, I don't see any performance improvements from patch LU-8964 and it's even worse than without patch.
In fact, LU-8709 (patch http://review.whamcloud.com/23552) is diferent approch and 2.3x performance improvements from patch.

# pdsh -w oss[01-06],mds[11-12],dcr-vm[1-4],c[01-32] 'echo 3 > /proc/sys/vm/drop_caches '
# lfs setstripe -S 16m -c -1 /scratch1/out
# mpirun -np 1 /work/tools/bin/IOR -w -k -t 1m -b 256g -e -F -vv -o /scratch1/out/file

# pdsh -w oss[01-06],mds[11-12],dcr-vm[1-4],c[01-32] 'echo 3 > /proc/sys/vm/drop_caches'
# mpirun -np 1 /work/tools/bin/IOR -r -k -E -t 1m -b 256g -e -F -vv -o /scratch1/out/file
Branch Single Thread Read Performance(MB/sec)
b2_10 1,591
b2_10+LU-8709 3,774
master 1,793
master+LU-8964 1,000
master+LU-8964(pio=1) 1,594
master+LU-8964(cpu_npartitions=10,pio=0) 1,000
master+LU-8964(cpu_npartitions=10,pio=1) 1,820
Comment by Shuichi Ihara (Inactive) [ 25/Apr/18 ]

We are more investigating what's going on b2_10+LU-8709. Atatched is capture cleint performance every 1sec during IOR is running.
When read started on cient and the performance is pretty good (getting ~5.5GB/sec), but once memory usages reached to close to max_cache_mb, performance was getting very nustable. (e.g. sometimes we see more than 5GB/sec, but sometimes less than 3GB/sec)

Comment by Dmitry Eremin (Inactive) [ 25/Apr/18 ]

The patch LU-8964 was not used in full speed because of you use small 1Mb transfer buffer (-t 1m). As I mentoined bebofe in common case the transfer buffer is split by strip size and transfer in parallel.

Which version of patch LU-8964 was used? In Patch Set 56 I significantly improve algorithm. It should more agressive read ahead.

Comment by Shuichi Ihara (Inactive) [ 25/Apr/18 ]

The patch LU-8964 was not used in full speed because of you use small 1Mb transfer buffer (-t 1m). As I mentoined bebofe in common case the transfer buffer is split by strip size and transfer in parallel.

In fact, 1m is not small. And,after patch, it should keep same performance for all operations against before patch unless there are any trade off.

Which version of patch LU-8964 was used? In Patch Set 56 I significantly improve algorithm. It should more agressive read ahead.

Sure, I can try latest your patch, but please tell us what exactly configurations (e.g. pio, number of partition, RA size, etc.) you prefer.

Comment by Patrick Farrell (Inactive) [ 25/Apr/18 ]

Yeah, 1M is probably more common than any size larger than it.  We should be very concerned with improving 1 MiB i/o.  (It's one thing I would like to change about the PIO code - That it is all implemented on stripe boundaries.  8 MiB stripes are very common these days, and we could benefit from parallelizing i/o smaller than 8 MiB.)

Comment by Dmitry Eremin (Inactive) [ 26/Apr/18 ]

Hmm. This is good point to pay more attention to 1Mb size buffer. But splitting into small chunks don't provide benefit because of additional overhead of parallelization. Unfortunatelly we have many restrictions in current Lustre I/O pipeline which prevent me from better parallelization and asyncronizim.

I discovered that I had best performance with splitting CPUs into several partitions in the way to have 2 HW cores (whith several hyper threads) into one partition. For example, it you have a machine with 8 HW cores with 2 hyper threds in each core (16 logical CPUs) the best configuration will be split it into 4 partitions. If you have several NUMA nodes it would be good to split each of this node in the same way.

I think the 8Mb size transfer buffer is good enough to see the benefits from PIO code.

Comment by Lukasz Flis [ 15/May/19 ]

We would like to give this patch a try in a mixed-workload environment on our test filesystem. Is it possible to get this patch for current b2_10 (2.10.7) ?

Comment by Andreas Dilger [ 15/May/19 ]

Lukasz, the patch on this ticket is not currently being developed. The patch on LU-12043 is the one that will be landing on 2.13.

Comment by Lukasz Flis [ 15/May/19 ]

Andreas, thank you for update, LU-12043 looks very promising.
I had a quick look at changes and they are all on client side - can we assume that lustre-client 2.13 may have better performance on 2.10 servers ?

Comment by Andreas Dilger [ 15/May/19 ]

Correct, with the caveat that LU-12043 patch is mainly improving single-threaded read performance. If there are many threads on the client the performance may not be very different, so it depends on your workload.

Generated at Sat Feb 10 02:19:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.