[LU-8709] parallel asynchronous readahead Created: 14/Oct/16 Updated: 10/Feb/21 Resolved: 10/Feb/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Li Xi |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | perf_optimization | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
Tracker for parallel async readahead improvement from DDN, as described in http://www.eofs.eu/_media/events/lad16/19_parallel_readahead_framework_li_xi.pdf Note that it would be very desirable in the single thread case if the copy_to_user was also handled in parallel, as this is a major CPU overhead on many-core systems and if it can be parallelized it may increase the peak read performance. As for lockahead integration with readahead, I agree that this is possible to do this, but it is only useful if the client doesn't get full-file extent locks. It would also be interesting if the write code detected sequential or strided writes and did lockahead at write time. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 01/Nov/16 ] |
|
About lockahead/readahead integration. There's not much use for lockahead in the "read" case, since locks can overlap. At least, I've never come up with a realistic scenario where it's relevant/helpful. Assuming it's not totally invalid to tie this stuff together, we could do possibly do something positive by recognizing strided writing. There are some complexities there - For example, we'd need to make a non-blocking lockahead lock request for at least the first one, to cancel the full file locks the clients are exchanging (normal lockahead locks must be non-blocking, so they can be requested many at a time). There's also some danger around the different clients not being coordinated when they go in to lockahead mode. I can think of how to probably make it work, but I'm not totally convinced it's worth the effort. |
| Comment by Gerrit Updater [ 03/Nov/16 ] |
|
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/23552 |
| Comment by Li Xi (Inactive) [ 03/Nov/16 ] |
|
The pushed patch is definitely not the final version, and I am going to test and optimize it further. I really needs your review and feedback in this progress. Thank you in advance! |
| Comment by Dmitry Eremin (Inactive) [ 09/Dec/16 ] |
|
Can you explain in more details what benchmarks you used for testing? It's not clear for me why we need new readahead algorithm instead of using standard from Linux kernel. I just make it asynchronous and parallelize for Lustre needs. This allows me significantly increase a read performance without implementing new algorithm.
|
| Comment by Andreas Dilger [ 29/Mar/17 ] |
|
Closing as a duplicate of |
| Comment by Li Xi (Inactive) [ 17/Apr/18 ] |
|
I am repopening this ticket because the patch inLU-8960 doesn't really improve performance as much as we can get from this patch. |
| Comment by Shuichi Ihara (Inactive) [ 25/Apr/18 ] |
|
Here is performance resutls and comapring of patch for # pdsh -w oss[01-06],mds[11-12],dcr-vm[1-4],c[01-32] 'echo 3 > /proc/sys/vm/drop_caches ' # lfs setstripe -S 16m -c -1 /scratch1/out # mpirun -np 1 /work/tools/bin/IOR -w -k -t 1m -b 256g -e -F -vv -o /scratch1/out/file # pdsh -w oss[01-06],mds[11-12],dcr-vm[1-4],c[01-32] 'echo 3 > /proc/sys/vm/drop_caches' # mpirun -np 1 /work/tools/bin/IOR -r -k -E -t 1m -b 256g -e -F -vv -o /scratch1/out/file |
| Comment by Shuichi Ihara (Inactive) [ 25/Apr/18 ] |
|
We are more investigating what's going on b2_10+ |
| Comment by Dmitry Eremin (Inactive) [ 25/Apr/18 ] |
|
The patch Which version of patch |
| Comment by Shuichi Ihara (Inactive) [ 25/Apr/18 ] |
In fact, 1m is not small.
Sure, I can try latest your patch, but please tell us what exactly configurations (e.g. pio, number of partition, RA size, etc.) you prefer. |
| Comment by Patrick Farrell (Inactive) [ 25/Apr/18 ] |
|
Yeah, 1M is probably more common than any size larger than it. We should be very concerned with improving 1 MiB i/o. (It's one thing I would like to change about the PIO code - That it is all implemented on stripe boundaries. 8 MiB stripes are very common these days, and we could benefit from parallelizing i/o smaller than 8 MiB.) |
| Comment by Dmitry Eremin (Inactive) [ 26/Apr/18 ] |
|
Hmm. This is good point to pay more attention to 1Mb size buffer. But splitting into small chunks don't provide benefit because of additional overhead of parallelization. Unfortunatelly we have many restrictions in current Lustre I/O pipeline which prevent me from better parallelization and asyncronizim. I discovered that I had best performance with splitting CPUs into several partitions in the way to have 2 HW cores (whith several hyper threads) into one partition. For example, it you have a machine with 8 HW cores with 2 hyper threds in each core (16 logical CPUs) the best configuration will be split it into 4 partitions. If you have several NUMA nodes it would be good to split each of this node in the same way. I think the 8Mb size transfer buffer is good enough to see the benefits from PIO code. |
| Comment by Lukasz Flis [ 15/May/19 ] |
|
We would like to give this patch a try in a mixed-workload environment on our test filesystem. Is it possible to get this patch for current b2_10 (2.10.7) ? |
| Comment by Andreas Dilger [ 15/May/19 ] |
|
Lukasz, the patch on this ticket is not currently being developed. The patch on |
| Comment by Lukasz Flis [ 15/May/19 ] |
|
Andreas, thank you for update, |
| Comment by Andreas Dilger [ 15/May/19 ] |
|
Correct, with the caveat that |