Details

    • Improvement
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.10.0
    • 9223372036854775807

    Description

      Tracker for parallel async readahead improvement from DDN, as described in http://www.eofs.eu/_media/events/lad16/19_parallel_readahead_framework_li_xi.pdf

      Note that it would be very desirable in the single thread case if the copy_to_user was also handled in parallel, as this is a major CPU overhead on many-core systems and if it can be parallelized it may increase the peak read performance.

      As for lockahead integration with readahead, I agree that this is possible to do this, but it is only useful if the client doesn't get full-file extent locks. It would also be interesting if the write code detected sequential or strided writes and did lockahead at write time.

      Attachments

        Issue Links

          Activity

            [LU-8709] parallel asynchronous readahead

            Correct, with the caveat that LU-12043 patch is mainly improving single-threaded read performance. If there are many threads on the client the performance may not be very different, so it depends on your workload.

            adilger Andreas Dilger added a comment - Correct, with the caveat that LU-12043 patch is mainly improving single-threaded read performance. If there are many threads on the client the performance may not be very different, so it depends on your workload.
            lflis Lukasz Flis added a comment -

            Andreas, thank you for update, LU-12043 looks very promising.
            I had a quick look at changes and they are all on client side - can we assume that lustre-client 2.13 may have better performance on 2.10 servers ?

            lflis Lukasz Flis added a comment - Andreas, thank you for update, LU-12043 looks very promising. I had a quick look at changes and they are all on client side - can we assume that lustre-client 2.13 may have better performance on 2.10 servers ?

            Lukasz, the patch on this ticket is not currently being developed. The patch on LU-12043 is the one that will be landing on 2.13.

            adilger Andreas Dilger added a comment - Lukasz, the patch on this ticket is not currently being developed. The patch on LU-12043 is the one that will be landing on 2.13.
            lflis Lukasz Flis added a comment -

            We would like to give this patch a try in a mixed-workload environment on our test filesystem. Is it possible to get this patch for current b2_10 (2.10.7) ?

            lflis Lukasz Flis added a comment - We would like to give this patch a try in a mixed-workload environment on our test filesystem. Is it possible to get this patch for current b2_10 (2.10.7) ?

            Hmm. This is good point to pay more attention to 1Mb size buffer. But splitting into small chunks don't provide benefit because of additional overhead of parallelization. Unfortunatelly we have many restrictions in current Lustre I/O pipeline which prevent me from better parallelization and asyncronizim.

            I discovered that I had best performance with splitting CPUs into several partitions in the way to have 2 HW cores (whith several hyper threads) into one partition. For example, it you have a machine with 8 HW cores with 2 hyper threds in each core (16 logical CPUs) the best configuration will be split it into 4 partitions. If you have several NUMA nodes it would be good to split each of this node in the same way.

            I think the 8Mb size transfer buffer is good enough to see the benefits from PIO code.

            dmiter Dmitry Eremin (Inactive) added a comment - Hmm. This is good point to pay more attention to 1Mb size buffer. But splitting into small chunks don't provide benefit because of additional overhead of parallelization. Unfortunatelly we have many restrictions in current Lustre I/O pipeline which prevent me from better parallelization and asyncronizim. I discovered that I had best performance with splitting CPUs into several partitions in the way to have 2 HW cores (whith several hyper threads) into one partition. For example, it you have a machine with 8 HW cores with 2 hyper threds in each core (16 logical CPUs) the best configuration will be split it into 4 partitions. If you have several NUMA nodes it would be good to split each of this node in the same way. I think the 8Mb size transfer buffer is good enough to see the benefits from PIO code.

            Yeah, 1M is probably more common than any size larger than it.  We should be very concerned with improving 1 MiB i/o.  (It's one thing I would like to change about the PIO code - That it is all implemented on stripe boundaries.  8 MiB stripes are very common these days, and we could benefit from parallelizing i/o smaller than 8 MiB.)

            paf Patrick Farrell (Inactive) added a comment - Yeah, 1M is probably more common than any size larger than it.  We should be very concerned with improving 1 MiB i/o.  (It's one thing I would like to change about the PIO code - That it is all implemented on stripe boundaries.  8 MiB stripes are very common these days, and we could benefit from parallelizing i/o smaller than 8 MiB.)

            The patch LU-8964 was not used in full speed because of you use small 1Mb transfer buffer (-t 1m). As I mentoined bebofe in common case the transfer buffer is split by strip size and transfer in parallel.

            In fact, 1m is not small. And,after patch, it should keep same performance for all operations against before patch unless there are any trade off.

            Which version of patch LU-8964 was used? In Patch Set 56 I significantly improve algorithm. It should more agressive read ahead.

            Sure, I can try latest your patch, but please tell us what exactly configurations (e.g. pio, number of partition, RA size, etc.) you prefer.

            ihara Shuichi Ihara (Inactive) added a comment - The patch LU-8964 was not used in full speed because of you use small 1Mb transfer buffer (-t 1m). As I mentoined bebofe in common case the transfer buffer is split by strip size and transfer in parallel. In fact, 1m is not small. And,after patch, it should keep same performance for all operations against before patch unless there are any trade off. Which version of patch LU-8964 was used? In Patch Set 56 I significantly improve algorithm. It should more agressive read ahead. Sure, I can try latest your patch, but please tell us what exactly configurations (e.g. pio, number of partition, RA size, etc.) you prefer.
            dmiter Dmitry Eremin (Inactive) added a comment - - edited

            The patch LU-8964 was not used in full speed because of you use small 1Mb transfer buffer (-t 1m). As I mentoined bebofe in common case the transfer buffer is split by strip size and transfer in parallel.

            Which version of patch LU-8964 was used? In Patch Set 56 I significantly improve algorithm. It should more agressive read ahead.

            dmiter Dmitry Eremin (Inactive) added a comment - - edited The patch LU-8964 was not used in full speed because of you use small 1Mb transfer buffer ( -t 1m ). As I mentoined bebofe in common case the transfer buffer is split by strip size and transfer in parallel. Which version of patch LU-8964 was used? In Patch Set 56 I significantly improve algorithm. It should more agressive read ahead.

            We are more investigating what's going on b2_10+LU-8709. Atatched is capture cleint performance every 1sec during IOR is running.
            When read started on cient and the performance is pretty good (getting ~5.5GB/sec), but once memory usages reached to close to max_cache_mb, performance was getting very nustable. (e.g. sometimes we see more than 5GB/sec, but sometimes less than 3GB/sec)

            ihara Shuichi Ihara (Inactive) added a comment - We are more investigating what's going on b2_10+ LU-8709 . Atatched is capture cleint performance every 1sec during IOR is running. When read started on cient and the performance is pretty good (getting ~5.5GB/sec), but once memory usages reached to close to max_cache_mb, performance was getting very nustable. (e.g. sometimes we see more than 5GB/sec, but sometimes less than 3GB/sec)
            ihara Shuichi Ihara (Inactive) added a comment - - edited

            Here is performance resutls and comapring of patch for LU-8709 and LU-8964.
            Unfortunately, I don't see any performance improvements from patch LU-8964 and it's even worse than without patch.
            In fact, LU-8709 (patch http://review.whamcloud.com/23552) is diferent approch and 2.3x performance improvements from patch.

            # pdsh -w oss[01-06],mds[11-12],dcr-vm[1-4],c[01-32] 'echo 3 > /proc/sys/vm/drop_caches '
            # lfs setstripe -S 16m -c -1 /scratch1/out
            # mpirun -np 1 /work/tools/bin/IOR -w -k -t 1m -b 256g -e -F -vv -o /scratch1/out/file
            
            # pdsh -w oss[01-06],mds[11-12],dcr-vm[1-4],c[01-32] 'echo 3 > /proc/sys/vm/drop_caches'
            # mpirun -np 1 /work/tools/bin/IOR -r -k -E -t 1m -b 256g -e -F -vv -o /scratch1/out/file
            
            Branch Single Thread Read Performance(MB/sec)
            b2_10 1,591
            b2_10+LU-8709 3,774
            master 1,793
            master+LU-8964 1,000
            master+LU-8964(pio=1) 1,594
            master+LU-8964(cpu_npartitions=10,pio=0) 1,000
            master+LU-8964(cpu_npartitions=10,pio=1) 1,820
            ihara Shuichi Ihara (Inactive) added a comment - - edited Here is performance resutls and comapring of patch for LU-8709 and LU-8964 . Unfortunately, I don't see any performance improvements from patch LU-8964 and it's even worse than without patch. In fact, LU-8709 (patch http://review.whamcloud.com/23552 ) is diferent approch and 2.3x performance improvements from patch. # pdsh -w oss[01-06],mds[11-12],dcr-vm[1-4],c[01-32] 'echo 3 > /proc/sys/vm/drop_caches ' # lfs setstripe -S 16m -c -1 /scratch1/out # mpirun -np 1 /work/tools/bin/IOR -w -k -t 1m -b 256g -e -F -vv -o /scratch1/out/file # pdsh -w oss[01-06],mds[11-12],dcr-vm[1-4],c[01-32] 'echo 3 > /proc/sys/vm/drop_caches' # mpirun -np 1 /work/tools/bin/IOR -r -k -E -t 1m -b 256g -e -F -vv -o /scratch1/out/file Branch Single Thread Read Performance(MB/sec) b2_10 1,591 b2_10+ LU-8709 3,774 master 1,793 master+ LU-8964 1,000 master+ LU-8964 (pio=1) 1,594 master+ LU-8964 (cpu_npartitions=10,pio=0) 1,000 master+ LU-8964 (cpu_npartitions=10,pio=1) 1,820

            People

              lixi_wc Li Xi
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: