Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4257

parallel dds are slower than serial dds

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.5.0
    • None
    • 3
    • 11618

    Description

      Sanger has an interesting test in which they read from the same file from 20 processes. They first run in parallel and then run serially (after flushing cache). Their expected result is that the serial and parallel runs should take about the same amount of time. What they see however is that parallel reads are about 50% slower than serial reads:

      client1# cat readfile.sh
      #!/bin/sh
      
      dd if=/lustre/scratch110/sanger/jb23/test/delete bs=4M of=/dev/null
      
      client1# for i in `seq -w 1 20 `
      do
        (time $LOC/readfile.sh )  > $LOC/results/${i}_out 2>&1 &
      done
      

      In parallel

      01_out:real 3m36.228s
      02_out:real 3m36.227s
      03_out:real 3m36.226s
      04_out:real 3m36.224s
      05_out:real 3m36.224s
      06_out:real 3m36.224s
      07_out:real 3m36.222s
      08_out:real 3m36.221s
      09_out:real 3m36.228s
      10_out:real 3m36.222s
      11_out:real 3m36.220s
      12_out:real 3m36.220s
      13_out:real 3m36.228s
      14_out:real 3m36.219s
      15_out:real 3m36.217s
      16_out:real 3m36.218s
      17_out:real 3m36.214s
      18_out:real 3m36.214s
      19_out:real 3m36.211s
      20_out:real 3m36.212s

      A serial read ( I expect all the time to be in the first read ).

      grep -i real *_serial
      01_out_serial:real 2m31.372s
      02_out_serial:real 0m1.190s
      03_out_serial:real 0m0.654s
      04_out_serial:real 0m0.562s
      05_out_serial:real 0m0.574s
      06_out_serial:real 0m0.570s
      07_out_serial:real 0m0.574s
      08_out_serial:real 0m0.461s
      09_out_serial:real 0m0.456s
      10_out_serial:real 0m0.462s
      11_out_serial:real 0m0.475s
      12_out_serial:real 0m0.473s
      13_out_serial:real 0m0.582s
      14_out_serial:real 0m0.580s
      15_out_serial:real 0m0.569s
      16_out_serial:real 0m0.679s
      17_out_serial:real 0m0.565s
      18_out_serial:real 0m0.573s
      19_out_serial:real 0m0.579s
      20_out_serial:real 0m0.472s

      And try the same experiment with nfs

      Serial access.

      root@farm3-head4:~/tmp/test/results# grep -i real *
      results/01_out_serial:real 0m19.923s
      results/02_out_serial:real 0m1.373s
      results/03_out_serial:real 0m1.237s
      results/04_out_serial:real 0m1.276s
      results/05_out_serial:real 0m1.289s
      results/06_out_serial:real 0m1.297s
      results/07_out_serial:real 0m1.265s
      results/08_out_serial:real 0m1.278s
      results/09_out_serial:real 0m1.224s
      results/10_out_serial:real 0m1.225s
      results/11_out_serial:real 0m1.221s
      ...

      So the question is:
      Why is the access slower if we are accessing the file in parallel and it is not in the cache ?

      Is there some lock contention going on with multiple readers? Or is the Lustre client sending multiple RPCs for the same data, even though there is already an outstanding request? They have tried this on 1.8.x clients as well as 2.5.0.

      Thanks.

      Attachments

        1. debug_file.out.gz
          0.2 kB
        2. io.png
          io.png
          75 kB
        3. lu-4257.tar.gz
          0.2 kB
        4. lustre_1.8.9
          850 kB
        5. lustre_2.5
          798 kB
        6. readfile.sh
          0.4 kB
        7. test.sh
          2 kB

        Issue Links

          Activity

            [LU-4257] parallel dds are slower than serial dds

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/20647/
            Subject: LU-4257 test: Correct error_ignore message
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5f03bf91e68e925149f2331a44d1e4ad858b8006

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/20647/ Subject: LU-4257 test: Correct error_ignore message Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5f03bf91e68e925149f2331a44d1e4ad858b8006

            This patch introduced an intermittent test failure LU-8248 in sanity.sh test_248.

            adilger Andreas Dilger added a comment - This patch introduced an intermittent test failure LU-8248 in sanity.sh test_248.
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20255/
            Subject: LU-4257 llite: fast read implementation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 172048eaefa834e310e6a0fa37e506579f4079df

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20255/ Subject: LU-4257 llite: fast read implementation Project: fs/lustre-release Branch: master Current Patch Set: Commit: 172048eaefa834e310e6a0fa37e506579f4079df

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20256/
            Subject: LU-4257 llite: fix up iov_iter implementation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1101120d3258509fa74f952cd8664bfdc17bd97d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20256/ Subject: LU-4257 llite: fix up iov_iter implementation Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1101120d3258509fa74f952cd8664bfdc17bd97d

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20254/
            Subject: LU-4257 obdclass: Get rid of cl_env hash table
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 45332712783a4756bf5930d6bd5f697bbc27acdb

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20254/ Subject: LU-4257 obdclass: Get rid of cl_env hash table Project: fs/lustre-release Branch: master Current Patch Set: Commit: 45332712783a4756bf5930d6bd5f697bbc27acdb

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/20574
            Subject: LU-4257 clio: replace semaphore with mutex
            Project: fs/lustre-release
            Branch: b2_4
            Current Patch Set: 1
            Commit: 3e1cbe0b81eaee6e509c825455669c89df157915

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/20574 Subject: LU-4257 clio: replace semaphore with mutex Project: fs/lustre-release Branch: b2_4 Current Patch Set: 1 Commit: 3e1cbe0b81eaee6e509c825455669c89df157915
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I came up with a solution for this issue and the initial testing result is exciting.

            The test case I used is based on the test cases shared by sanger, please check below:

            #!/bin/bash
            
            nr_cpus=$(grep -c ^processor /proc/cpuinfo)
            
            for CACHE in no yes; do
            	for BS in 4k 1M; do
            		echo "===== cache: $CACHE, block size: $BS ====="
            		[ "$CACHE" = "yes" ] && { dd if=/mnt/lustre/testfile bs=1M of=/dev/null > /dev/null 2>&1; }
            		[ "$CACHE" = "no" ] && { echo 3 > /proc/sys/vm/drop_caches; }
            
            		echo -n "      single read: "
            		dd if=/mnt/lustre/testfile bs=$BS of=/dev/null 2>&1 |grep copied |awk -F, '{print $3}'
            
            		[ "$CACHE" = "no" ] && { echo 3 > /proc/sys/vm/drop_caches; }
            
            		echo -n "      parallel read: "
            		for i in `seq -w 1 ${nr_cpus}`; do
            			dd if=/mnt/lustre/testfile bs=$BS of=/dev/null > results/${i}_out 2>&1 &
            		done
            		wait
            		grep copied results/1_out | awk -F, '{print $3}'
            	done
            done
            

            The test file is 2G in size so that it can fit in memory for cache enabled testing. I applied the patches and compare the test results w/ and w/o patches.

            The result w/ my patches:

            ===== cache: no, block size: 4k =====
                  single read:  1.2 GB/s
                  parallel read:  576 MB/s
            ===== cache: no, block size: 1M =====
                  single read:  1.4 GB/s
                  parallel read:  566 MB/s
            ===== cache: yes, block size: 4k =====
                  single read:  3.8 GB/s
                  parallel read:  1.8 GB/s
            ===== cache: yes, block size: 1M =====
                  single read:  6.4 GB/s
                  parallel read:  1.3 GB/s
            

            The test w/o my patches:

            ===== cache: no, block size: 4k =====
                  single read:  257 MB/s
                  parallel read:  148 MB/s
            ===== cache: no, block size: 1M =====
                  single read:  1.1 GB/s
                  parallel read:  420 MB/s
            ===== cache: yes, block size: 4k =====
                  single read:  361 MB/s
                  parallel read:  147 MB/s
            ===== cache: yes, block size: 1M =====
                  single read:  5.8 GB/s
                  parallel read:  1.3 GB/s
            

            The small IO performance improved significantly. I'm still doing some fine tune for the patches and I will release them as soon as I can so that you can do some evaluation.

            jay Jinshan Xiong (Inactive) added a comment - - edited I came up with a solution for this issue and the initial testing result is exciting. The test case I used is based on the test cases shared by sanger, please check below: #!/bin/bash nr_cpus=$(grep -c ^processor /proc/cpuinfo) for CACHE in no yes; do for BS in 4k 1M; do echo "===== cache: $CACHE, block size: $BS =====" [ "$CACHE" = "yes" ] && { dd if =/mnt/lustre/testfile bs=1M of=/dev/ null > /dev/ null 2>&1; } [ "$CACHE" = "no" ] && { echo 3 > /proc/sys/vm/drop_caches; } echo -n " single read: " dd if =/mnt/lustre/testfile bs=$BS of=/dev/ null 2>&1 |grep copied |awk -F, '{print $3}' [ "$CACHE" = "no" ] && { echo 3 > /proc/sys/vm/drop_caches; } echo -n " parallel read: " for i in `seq -w 1 ${nr_cpus}`; do dd if =/mnt/lustre/testfile bs=$BS of=/dev/ null > results/${i}_out 2>&1 & done wait grep copied results/1_out | awk -F, '{print $3}' done done The test file is 2G in size so that it can fit in memory for cache enabled testing. I applied the patches and compare the test results w/ and w/o patches. The result w/ my patches: ===== cache: no, block size: 4k ===== single read: 1.2 GB/s parallel read: 576 MB/s ===== cache: no, block size: 1M ===== single read: 1.4 GB/s parallel read: 566 MB/s ===== cache: yes, block size: 4k ===== single read: 3.8 GB/s parallel read: 1.8 GB/s ===== cache: yes, block size: 1M ===== single read: 6.4 GB/s parallel read: 1.3 GB/s The test w/o my patches: ===== cache: no, block size: 4k ===== single read: 257 MB/s parallel read: 148 MB/s ===== cache: no, block size: 1M ===== single read: 1.1 GB/s parallel read: 420 MB/s ===== cache: yes, block size: 4k ===== single read: 361 MB/s parallel read: 147 MB/s ===== cache: yes, block size: 1M ===== single read: 5.8 GB/s parallel read: 1.3 GB/s The small IO performance improved significantly. I'm still doing some fine tune for the patches and I will release them as soon as I can so that you can do some evaluation.

            Hi Jinshan,

            Any updates on the request made by James. We are looking for the new patch for the performance issues to try out, based on the above comment.

            Thank You,
            Manish

            manish Manish Patel (Inactive) added a comment - Hi Jinshan, Any updates on the request made by James. We are looking for the new patch for the performance issues to try out, based on the above comment. Thank You, Manish
            james beal James Beal added a comment -

            While we wait I found the following interesting.

            http://lwn.net/Articles/590243/

            Performance-oriented patches should, of course, always be accompanied by benchmark results. In this case, Waiman included a set of AIM7 benchmark results with his patch set (which did not include the pending-bit optimization). Some workloads regressed a little, but others shows improvements of 1-2% — a good result for a low-level locking improvement. The disk benchmark runs, however, improved by as much as 116%; that benchmark suffers from especially strong contention for locks in the virtual filesystem layer and ext4 filesystem code.

            james beal James Beal added a comment - While we wait I found the following interesting. http://lwn.net/Articles/590243/ Performance-oriented patches should, of course, always be accompanied by benchmark results. In this case, Waiman included a set of AIM7 benchmark results with his patch set (which did not include the pending-bit optimization). Some workloads regressed a little, but others shows improvements of 1-2% — a good result for a low-level locking improvement. The disk benchmark runs, however, improved by as much as 116%; that benchmark suffers from especially strong contention for locks in the virtual filesystem layer and ext4 filesystem code.

            People

              jay Jinshan Xiong (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: