Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3321

2.x single thread/process throughput degraded from 1.8

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.4.0
    • Tested on 2.3.64 and 1.8.9 clients with 4 OSS x 3 - 32 GB OST ramdisks
    • 3
    • 8259

    Description

      Single thread/process throughput on tag 2.3.64 is degraded from 1.8.9 and significantly degraded when the client hits its caching limit (llite.*.max_cached_mb). Attached graph shows lnet stats sampled every second for a single dd writing 2 - 64 GB files followed by a dropping cache and reading the same two files. The tests were not done simultaenously but the graph has them starting from the same point. It also takes a significant amount of time to drop the cache on 2.3.64.

      Lustre 2.3.64
      Write (dd if=/dev/zero of=testfile bs=1M)
      68719476736 bytes (69 GB) copied, 110.459 s, 622 MB/s
      68719476736 bytes (69 GB) copied, 147.935 s, 465 MB/s

      Drop caches (echo 1 > /proc/sys/vm/drop_caches)
      real 0m43.075s

      Read (dd if=testfile of=/dev/null bs=1M)
      68719476736 bytes (69 GB) copied, 99.2963 s, 692 MB/s
      68719476736 bytes (69 GB) copied, 142.611 s, 482 MB/s

      Lustre 1.8.9
      Write (dd if=/dev/zero of=testfile bs=1M)
      68719476736 bytes (69 GB) copied, 63.3077 s, 1.1 GB/s
      68719476736 bytes (69 GB) copied, 67.4487 s, 1.0 GB/s

      Drop caches (echo 1 > /proc/sys/vm/drop_caches)
      real 0m9.189s

      Read (dd if=testfile of=/dev/null bs=1M)
      68719476736 bytes (69 GB) copied, 46.4591 s, 1.5 GB/s
      68719476736 bytes (69 GB) copied, 52.3635 s, 1.3 GB/s

      Attachments

        1. cpustat.scr
          0.5 kB
        2. dd_throughput_comparison_with_change_5446.png
          dd_throughput_comparison_with_change_5446.png
          7 kB
        3. dd_throughput_comparison.png
          dd_throughput_comparison.png
          6 kB
        4. lu-3321-singlethreadperf.tgz
          391 kB
        5. lu-3321-singlethreadperf2.tgz
          564 kB
        6. mcm8_wcd.png
          mcm8_wcd.png
          9 kB
        7. perf3.png
          perf3.png
          103 kB

        Issue Links

          Activity

            [LU-3321] 2.x single thread/process throughput degraded from 1.8
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-2946 [ LU-2946 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-2622 [ LU-2622 ]
            green Oleg Drokin made changes -
            Link New: This issue is related to LU-7912 [ LU-7912 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LDEV-25 [ LDEV-25 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LU-4786 [ LU-4786 ]
            [root@c01 ~]# lscpu 
            Architecture:          x86_64
            CPU op-mode(s):        32-bit, 64-bit
            Byte Order:            Little Endian
            CPU(s):                8
            On-line CPU(s) list:   0-7
            Thread(s) per core:    2
            Core(s) per socket:    4
            Socket(s):             1
            NUMA node(s):          1
            Vendor ID:             GenuineIntel
            CPU family:            6
            Model:                 44
            Stepping:              2
            CPU MHz:               1600.000
            BogoMIPS:              4800.65
            Virtualization:        VT-x
            L1d cache:             32K
            L1i cache:             32K
            L2 cache:              256K
            L3 cache:              12288K
            NUMA node0 CPU(s):     0-7
            
            [root@c01 ~]# cat /proc/cpuinfo 
            processor	: 0
            vendor_id	: GenuineIntel
            cpu family	: 6
            model		: 44
            model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
            stepping	: 2
            cpu MHz		: 1600.000
            cache size	: 12288 KB
            physical id	: 0
            siblings	: 8
            core id		: 0
            cpu cores	: 4
            apicid		: 0
            initial apicid	: 0
            fpu		: yes
            fpu_exception	: yes
            cpuid level	: 11
            wp		: yes
            flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
            bogomips	: 4800.65
            clflush size	: 64
            cache_alignment	: 64
            address sizes	: 40 bits physical, 48 bits virtual
            power management:
            
            jay Jinshan Xiong (Inactive) added a comment - [root@c01 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 44 Stepping: 2 CPU MHz: 1600.000 BogoMIPS: 4800.65 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0-7 [root@c01 ~]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 cpu MHz : 1600.000 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.65 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:
            pichong Gregoire Pichon made changes -
            Attachment New: cpustat.scr [ 14763 ]

            Jinshan,

            In attachment the script that generates the CPU usage graphs with gnuplot. File "filename" contains the data where each line has the following format:
            time user system idle iowait

            This can be obtained with vmstat command for global CPU usage, or from /proc/stat file for per-CPU usage.

            What model of CPU is present on OpenSFS cluster ?

            pichong Gregoire Pichon added a comment - Jinshan, In attachment the script that generates the CPU usage graphs with gnuplot. File "filename" contains the data where each line has the following format: time user system idle iowait This can be obtained with vmstat command for global CPU usage, or from /proc/stat file for per-CPU usage. What model of CPU is present on OpenSFS cluster ?
            paf Patrick Farrell (Inactive) added a comment - Just as a favor to anyone else interested, this is a complete list of patches landed against LU-3321 : http://review.whamcloud.com/#/c/7888 http://review.whamcloud.com/#/c/7890 http://review.whamcloud.com/#/c/7891 http://review.whamcloud.com/#/c/7892 http://review.whamcloud.com/#/c/8174 http://review.whamcloud.com/#/c/7893 http://review.whamcloud.com/#/c/7894 http://review.whamcloud.com/#/c/7895 http://review.whamcloud.com/#/c/8523 7889, listed in Jinshan's earlier list of patches, was abandoned.

            From the cpu stat, it clearly showed that the cpu usage is around 80% for single thread and single stripe write, this is why you can see a slight performance improvement with multiple striped file. CLIO is still CPU intensive and your CPU can only drive ~900MB/s IO on the client side. As a comparison, the CPU on OpenSFS cluster can drive ~1.2GB/s.

            Can you please provide me the test script you're using to collect data and generate diagram, therefore I can reproduce this on OpenSFS cluster?

            jay Jinshan Xiong (Inactive) added a comment - From the cpu stat, it clearly showed that the cpu usage is around 80% for single thread and single stripe write, this is why you can see a slight performance improvement with multiple striped file. CLIO is still CPU intensive and your CPU can only drive ~900MB/s IO on the client side. As a comparison, the CPU on OpenSFS cluster can drive ~1.2GB/s. Can you please provide me the test script you're using to collect data and generate diagram, therefore I can reproduce this on OpenSFS cluster?

            People

              jay Jinshan Xiong (Inactive)
              jfilizetti Jeremy Filizetti
              Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: