Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-744

Single client's performance degradation on 2.1

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.2.0, Lustre 2.3.0
    • None
    • 3
    • 4018

    Description

      During the performance testing on lustre-2.1, I saw the single client's performance degradation on it.
      Here is IOR results on the single cleints with 2.1 and also lustre-1.8.6.80 for comparing.
      I ran IOR (IOR -t 1m -b 32g -w -r -vv -F -o /lustre/ior.out/file) on the single client with 1, 2, 4 and 8 processes.

      Write(MiB/sec)
      v1.8.6.80 v2.1
      446.25 411.43
      808.53 761.30
      1484.18 1151.41
      1967.42 1172.06

      Read(MiB/sec)
      v1.8.6.80 v2.1
      823.90 595.71
      1449.49 1071.76
      2502.49 1517.79
      3133.43 1746.30

      Tested on same infrastracture(hardware and network). The client just turned off the checksum on both testing.

      Attachments

        1. 2.4 Single Client 3May2013.xlsx
          34 kB
        2. 574.1.pdf
          169 kB
        3. ior-256gb.tar.gz
          32 kB
        4. ior-32gb.tar.gz
          24 kB
        5. lu744-20120909.tar.gz
          883 kB
        6. lu744-20120915.tar.gz
          874 kB
        7. lu744-20120915-02.tar.gz
          1.02 MB
        8. lu744-20121111.tar.gz
          849 kB
        9. lu744-20121113.tar.gz
          846 kB
        10. lu744-20121117.tar.gz
          2.45 MB
        11. lu744-20130104.tar.gz
          915 kB
        12. lu744-20130104-02.tar.gz
          26 kB
        13. lu744-dls-20121113.tar.gz
          10 kB
        14. orig-collectl.out
          81 kB
        15. orig-ior.out
          2 kB
        16. orig-opreport-l.out
          146 kB
        17. patched-collectl.out
          34 kB
        18. patched-ior.out
          2 kB
        19. patched-opreport-l.out
          137 kB
        20. single-client-performance.xlsx
          42 kB
        21. stats-1.8.zip
          14 kB
        22. stats-2.1.zip
          64 kB
        23. test2-various-version.zip
          264 kB
        24. test-patchset-2.zip
          147 kB

        Issue Links

          Activity

            [LU-744] Single client's performance degradation on 2.1

            Ihara, could you please extract out the performance numbers for this patch and the previous ones in a small table like was done for the previous tests?

            adilger Andreas Dilger added a comment - Ihara, could you please extract out the performance numbers for this patch and the previous ones in a small table like was done for the previous tests?

            Hi Ihara, what's the performance of b1_8 again on the same platform?

            jay Jinshan Xiong (Inactive) added a comment - Hi Ihara, what's the performance of b1_8 again on the same platform?
            jay Jinshan Xiong (Inactive) added a comment - - edited

            CPU is still a bottleneck. The write speed dropped after OSC LRU cache stepped in and immediately drove the CPU usage to 100%. Let me see if I can optimize it.

            jay Jinshan Xiong (Inactive) added a comment - - edited CPU is still a bottleneck. The write speed dropped after OSC LRU cache stepped in and immediately drove the CPU usage to 100%. Let me see if I can optimize it.

            It might help with interpreting the opreport data if the -p option is used. According the the opreport man page:

                   --image-path / -p [paths]
                          Comma-separated list of additional paths to search for binaries.  This is needed to find modules in kernels 2.6 and upwards.
            

            Without it, external module symbols don't get resolved:

            samples  %        image name               app name                 symbol name
            6340482  25.2096  obdclass                 obdclass                 /obdclass
            3473020  13.8087  osc                      osc                      /osc
            1972900   7.8442  lustre                   lustre                   /lustre
            1374077   5.4633  vmlinux                  vmlinux                  copy_user_generic_string
            842569    3.3500  lov                      lov                      /lov
            551880    2.1943  libcfs                   libcfs                   /libcfs
            

            Although the opreport-alwdg-p_lustre.out file seems to have all the useful bits.

            prakash Prakash Surya (Inactive) added a comment - It might help with interpreting the opreport data if the -p option is used. According the the opreport man page: --image-path / -p [paths] Comma-separated list of additional paths to search for binaries. This is needed to find modules in kernels 2.6 and upwards. Without it, external module symbols don't get resolved: samples % image name app name symbol name 6340482 25.2096 obdclass obdclass /obdclass 3473020 13.8087 osc osc /osc 1972900 7.8442 lustre lustre /lustre 1374077 5.4633 vmlinux vmlinux copy_user_generic_string 842569 3.3500 lov lov /lov 551880 2.1943 libcfs libcfs /libcfs Although the opreport-alwdg-p_lustre.out file seems to have all the useful bits.

            Jinshan,

            I just tested http://review.whamcloud.com/4943

            attached includes all results and oprofile output.
            it looks obviously better than previous numbers. but I wonder if we could get more better performance since we are getting 5.6GB/sec sometimes. (see collectl.out) want to keep these around these numbers

            ihara Shuichi Ihara (Inactive) added a comment - Jinshan, I just tested http://review.whamcloud.com/4943 attached includes all results and oprofile output. it looks obviously better than previous numbers. but I wonder if we could get more better performance since we are getting 5.6GB/sec sometimes. (see collectl.out) want to keep these around these numbers

            My next patch will be to remove top cache of cl_page.

            jay Jinshan Xiong (Inactive) added a comment - My next patch will be to remove top cache of cl_page.

            There is a new patch for performance tune at: http://review.whamcloud.com/4943. Please give it a try.

            jay Jinshan Xiong (Inactive) added a comment - There is a new patch for performance tune at: http://review.whamcloud.com/4943 . Please give it a try.

            Hi Ihara, this is because CPU is still under contention so the performance dropped when the hosekeeping work started. Can you please run the benchmark one more time with patches 4519, 4472 and 4617. This should help a little bit.

            jay Jinshan Xiong (Inactive) added a comment - Hi Ihara, this is because CPU is still under contention so the performance dropped when the hosekeeping work started. Can you please run the benchmark one more time with patches 4519, 4472 and 4617. This should help a little bit.
            prakash Prakash Surya (Inactive) added a comment - - edited

            Jinshan, Frederik, When using the LU-2139 patches on the client but not on the server, it is normal to see the IO pause/stall as you are seeing. I'm not sure if this is happening for this this, but what can happen is:

            1. Client performs IO
            2. Client receives completion callback for bulk RPC
            3. Bulk pages now clean but "unstable" (uncommitted on OST)
            4. NR_UNSTABLE_NFS incremented for each unstable page (due to http://review.whamcloud.com/4245)
            5. NR_UNSTABLE_NFS grows larger than (background_thresh + dirty_thresh)/2
            6. Kernel stalls IO waiting for NR_UNSTABLE_NFS to decrease (via kernel function: balance_dirty_pages)
            7. Client receives Lustre ping sometime in future (around 20 seconds later?), updating last_committed
            8. Bulk pages now "stable" on client and can be reclaimed, lowering NR_UNSTABLE_NFS
            9. Go back to step 1.

            Reading the above comments, it looks like the LU-2139 patches are working as intended (avoiding OOMs at the cost of performance). Although I admit, the performance is terrible when you hit the NR_UNSTABLE_NFS limit and the kernel halts all IO (put is better than OOM, IMO). To improve on this, http://review.whamcloud.com/4375 needs to be applied to both clients and servers. This will allow the server to proactively commit bulk pages as they come in, hopefully preventing the client from exhausting its memory with unstable pages and avoiding the "stall" in balance_dirty_pages. With it applied to the server, I'd expect NR_UNSTABLE_NFS to remain "low", and the 4GB file speeds to reflect the 1GB speeds.

            Please keep in mind, the LU-2139 patches are all experimental and subject to change.

            On the client, with the LU-2139 patches applied, you might find it interesting to watch lctl get_param llite.*.unstable_stats and cat /proc/meminfo | grep NFS_Unstable as the test is running.

            For example:

            $ watch -n0.1 'lctl get_param llite.*.unstable_stats'
            $ watch -n0.1 'cat /proc/meminfo | grep NFS_Unstable'
            

            Those will give you an idea for the amount of unstable pages the client has at a given time. If that value gets "high" (exact value depends on your dirty limits, but probably around 1/4 of RAM) then what I detailed above is most likely the cause for the bad performance.

            prakash Prakash Surya (Inactive) added a comment - - edited Jinshan, Frederik, When using the LU-2139 patches on the client but not on the server, it is normal to see the IO pause/stall as you are seeing. I'm not sure if this is happening for this this, but what can happen is: 1. Client performs IO 2. Client receives completion callback for bulk RPC 3. Bulk pages now clean but "unstable" (uncommitted on OST) 4. NR_UNSTABLE_NFS incremented for each unstable page (due to http://review.whamcloud.com/4245 ) 5. NR_UNSTABLE_NFS grows larger than (background_thresh + dirty_thresh)/2 6. Kernel stalls IO waiting for NR_UNSTABLE_NFS to decrease (via kernel function: balance_dirty_pages) 7. Client receives Lustre ping sometime in future (around 20 seconds later?), updating last_committed 8. Bulk pages now "stable" on client and can be reclaimed, lowering NR_UNSTABLE_NFS 9. Go back to step 1. Reading the above comments, it looks like the LU-2139 patches are working as intended (avoiding OOMs at the cost of performance). Although I admit, the performance is terrible when you hit the NR_UNSTABLE_NFS limit and the kernel halts all IO (put is better than OOM, IMO). To improve on this, http://review.whamcloud.com/4375 needs to be applied to both clients and servers. This will allow the server to proactively commit bulk pages as they come in, hopefully preventing the client from exhausting its memory with unstable pages and avoiding the "stall" in balance_dirty_pages. With it applied to the server, I'd expect NR_UNSTABLE_NFS to remain "low", and the 4GB file speeds to reflect the 1GB speeds. Please keep in mind, the LU-2139 patches are all experimental and subject to change. On the client, with the LU-2139 patches applied, you might find it interesting to watch lctl get_param llite.*.unstable_stats and cat /proc/meminfo | grep NFS_Unstable as the test is running. For example: $ watch -n0.1 'lctl get_param llite.*.unstable_stats' $ watch -n0.1 'cat /proc/meminfo | grep NFS_Unstable' Those will give you an idea for the amount of unstable pages the client has at a given time. If that value gets "high" (exact value depends on your dirty limits, but probably around 1/4 of RAM) then what I detailed above is most likely the cause for the bad performance.

            Jinshan,

            Yes, I upgraded MPI libbary a couple of weeks ago. I found a hardware problem and fixed it. Now mca_btl_sm_component_progress less consuming. it's still high compared to previous library though...

            This attachment includes three test results

            1. master without any patches
            2. master + 4519 (2nd patch) + 4472 (2nd patch)
            3. master + 4519 (2nd patch) + 4472 (2nd patch) and run mpi with pthread, instead of shared memory.
            

            patches help less CPU consuming and improve the performance, but still drop the performance when the client is no free memory.

            ihara Shuichi Ihara (Inactive) added a comment - Jinshan, Yes, I upgraded MPI libbary a couple of weeks ago. I found a hardware problem and fixed it. Now mca_btl_sm_component_progress less consuming. it's still high compared to previous library though... This attachment includes three test results 1. master without any patches 2. master + 4519 (2nd patch) + 4472 (2nd patch) 3. master + 4519 (2nd patch) + 4472 (2nd patch) and run mpi with pthread, instead of shared memory. patches help less CPU consuming and improve the performance, but still drop the performance when the client is no free memory.

            Hi Ihara, I saw there is significant CPU usage for library mca_btl_sm.so(11.7%) and libopen-pal.so.0.0.0(4.7%) but the performance data shown on Sep 5 they only consumed 0.13% and 0.05%. They are openmpi libraries. Did you do any upgrade on these libraries?

            Anyway, I revised patch 4519 and restored 4472 to remove memory stalls, please apply them in your next benchmark. However we have to figure out why openmpi libraries consumed so much cpu before seeing the performance improvement.

            jay Jinshan Xiong (Inactive) added a comment - Hi Ihara, I saw there is significant CPU usage for library mca_btl_sm.so(11.7%) and libopen-pal.so.0.0.0(4.7%) but the performance data shown on Sep 5 they only consumed 0.13% and 0.05%. They are openmpi libraries. Did you do any upgrade on these libraries? Anyway, I revised patch 4519 and restored 4472 to remove memory stalls, please apply them in your next benchmark. However we have to figure out why openmpi libraries consumed so much cpu before seeing the performance improvement.

            People

              jay Jinshan Xiong (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              35 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: