[LU-3321] 2.x single thread/process throughput degraded from 1.8 Created: 13/May/13  Updated: 03/May/17  Resolved: 06/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Jeremy Filizetti Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: HB
Environment:

Tested on 2.3.64 and 1.8.9 clients with 4 OSS x 3 - 32 GB OST ramdisks


Attachments: File cpustat.scr     PNG File dd_throughput_comparison.png     PNG File dd_throughput_comparison_with_change_5446.png     File lu-3321-singlethreadperf.tgz     File lu-3321-singlethreadperf2.tgz     PNG File mcm8_wcd.png     PNG File perf3.png    
Issue Links:
Related
is related to LU-744 Single client's performance degradati... Resolved
is related to LU-2139 Tracking unstable pages Resolved
is related to LU-2032 small random read i/o performance reg... Open
is related to LU-4201 Test failure sanityn test_51b: file s... Resolved
is related to LU-2622 All CPUs spinning on cl_envs_guard lo... Resolved
is related to LU-4786 Apparent denial of service from clien... Resolved
is related to LU-7912 Stale comment in osc_page_transfer_add Resolved
Epic/Theme: Performance
Severity: 3
Rank (Obsolete): 8259

 Description   

Single thread/process throughput on tag 2.3.64 is degraded from 1.8.9 and significantly degraded when the client hits its caching limit (llite.*.max_cached_mb). Attached graph shows lnet stats sampled every second for a single dd writing 2 - 64 GB files followed by a dropping cache and reading the same two files. The tests were not done simultaenously but the graph has them starting from the same point. It also takes a significant amount of time to drop the cache on 2.3.64.

Lustre 2.3.64
Write (dd if=/dev/zero of=testfile bs=1M)
68719476736 bytes (69 GB) copied, 110.459 s, 622 MB/s
68719476736 bytes (69 GB) copied, 147.935 s, 465 MB/s

Drop caches (echo 1 > /proc/sys/vm/drop_caches)
real 0m43.075s

Read (dd if=testfile of=/dev/null bs=1M)
68719476736 bytes (69 GB) copied, 99.2963 s, 692 MB/s
68719476736 bytes (69 GB) copied, 142.611 s, 482 MB/s

Lustre 1.8.9
Write (dd if=/dev/zero of=testfile bs=1M)
68719476736 bytes (69 GB) copied, 63.3077 s, 1.1 GB/s
68719476736 bytes (69 GB) copied, 67.4487 s, 1.0 GB/s

Drop caches (echo 1 > /proc/sys/vm/drop_caches)
real 0m9.189s

Read (dd if=testfile of=/dev/null bs=1M)
68719476736 bytes (69 GB) copied, 46.4591 s, 1.5 GB/s
68719476736 bytes (69 GB) copied, 52.3635 s, 1.3 GB/s



 Comments   
Comment by Peter Jones [ 13/May/13 ]

Jeremy

Could you please repeat these tests with 2.1.5 and 2.3 to see whether the behaviour is consistent with 2.3.65?

Also, could you please check whether the patches in gerrit for LU-2139 help improve this situation?

Thanks

Peter

Comment by Jeremy Filizetti [ 15/May/13 ]

Based on some feedback from Jinshan I retried this test with http://review.whamcloud.com/#change,5446 from LU-2622. This seems to have a 50% gain in performance when the cache limit is hit but still quite a drop from 1.8.9. I also moved to 2.3.65 (from 2.3.64) but verified without the patch that it was the same as 2.6.4. When I get a chance I will try to test again with 2.3 and LU-2139 patches.

Here is the times from this last test (and graph):
1.8.9
Write test
68719476736 bytes (69 GB) copied, 60.8574 s, 1.1 GB/s
68719476736 bytes (69 GB) copied, 70.3939 s, 976 MB/s
Dropping cache
real 0m12.435s
Read test
68719476736 bytes (69 GB) copied, 48.445 s, 1.4 GB/s
68719476736 bytes (69 GB) copied, 47.7765 s, 1.4 GB/s

2.3.65 + http://review.whamcloud.com/#change,5446
Write test
68719476736 bytes (69 GB) copied, 87.1735 s, 788 MB/s
68719476736 bytes (69 GB) copied, 98.4901 s, 698 MB/s
Dropping cache
real 0m16.063s
Read test
68719476736 bytes (69 GB) copied, 77.9799 s, 881 MB/s
68719476736 bytes (69 GB) copied, 93.733 s, 733 MB/s

Comment by Peter Jones [ 21/Jun/13 ]

Jinshan

Will this situation be improved with the CLIO simplification work that has been proposed?

Peter

Comment by Jinshan Xiong (Inactive) [ 21/Jun/13 ]

Hi Peter,

Yes, I think so.

Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ]

Please check patch for master at:

http://review.whamcloud.com/7888
http://review.whamcloud.com/7889
http://review.whamcloud.com/7890
http://review.whamcloud.com/7891
http://review.whamcloud.com/7892
http://review.whamcloud.com/7893
http://review.whamcloud.com/7894
http://review.whamcloud.com/7895

and patch for b2_4 at:

http://review.whamcloud.com/7896
http://review.whamcloud.com/7897
http://review.whamcloud.com/7898
http://review.whamcloud.com/7899
http://review.whamcloud.com/7900
http://review.whamcloud.com/7901
http://review.whamcloud.com/7902
http://review.whamcloud.com/7903

Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ]

Let me share a test result from Jeremy which showed performance improvement.

Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ]

And the test I did in our lab

Comment by Erich Focht [ 06/Nov/13 ]

Jinshan, a quick question: are the results you've uploaded on Oct. 11 measured on a striped file? How many stripes? Were these multiple threads on one striped file or each thread with it's own file?

I've seen very poor performance with striped files and no increase in performance when increasing the number of stripes with Lustre 2.x, just want to make sure whether your measurements are "my" use case or not.

Thanks,
Erich

Comment by Jinshan Xiong (Inactive) [ 07/Nov/13 ]

Hi Erich,

I didn't use multiple striped files, and each thread wrote their own files in multiple threads testing.

Did you use my performance patch to do the test?

I assume you're doing benchmark on multiple striped file with single thread. In my experience, if your OSTw are fast, it won't help to get better performance by striping files to multiple OSTs, because the bottleneck is on the client CPU. You can take a look at the rpc_stats of OSC: lctl get_param osc.*.rpc_stats, if the value of rpcs_in_flight is low, which implies client can't generate data fast enough to saturate OSTs.

Comment by Christopher Morrone [ 09/Nov/13 ]

We are rather disappointed by the revert of commit 93fe562. While I can understand not liking the performance impact, basic functionality trumps performance. On nodes with high processor counts, the lustre client thrashes to the point of being unusable without that patch.

I think this ticket is now a blocker for 2.6, because we can't operate with the tree in its current state.

Comment by Jinshan Xiong (Inactive) [ 09/Nov/13 ]

No worry, we've already had a patch for percpu cl_env cache. It turns out that the overhead of allocating cl_env is really high, so caching cl_env is necessary, but we just need a smart way to cache them.

Comment by Jodi Levi (Inactive) [ 26/Dec/13 ]

Once http://review.whamcloud.com/#/c/8523/ lands this ticket can be closed

Comment by John Fuchs-Chesney (Inactive) [ 15/Jan/14 ]

Can anyone tell me about the status of the final patch for this issue? Looks like some recent testing has been successful, but I don't know the other tools well enough to know if the patch is ready to land, or even already landed. Thanks.

Comment by Patrick Farrell (Inactive) [ 15/Jan/14 ]

John - There are a few components to the review/landing process.

The patch is built automatically by Jenkins, which contributes a +1 if it builds correctly (that's already true for http://review.whamcloud.com/#/c/8523/).

Maloo then runs several sets of tests, currently I think that's three. Once all the sets of tests have passed, Maloo contributes a +1. It looks like two of the three test sets have completed for the patch in question.

Separately, human reviewers contribute code reviews. A positive review gives a +1. The general standard is two +1s before landing a patch.

Then, finally, someone in the gatekeeper role - I believe that's currently Oleg Drokin and Andreas Dilger - approves the patch, which appears as +2. Then the patch is cherry-picked on to master (also by a gatekeeper).

Once the patch has been cherry-picked, it has landed.

This one is almost ready to land. The tests need to complete, then it should be approved and cherry-picked quickly. (Since this ticket is a blocker for 2.6, it'll definitely be in before that release.)

Comment by Jinshan Xiong (Inactive) [ 16/Jan/14 ]

the maloo test failed due to an unrelated bug. I will retrigger the test.

Comment by Jinshan Xiong (Inactive) [ 06/Feb/14 ]

phew, the last patch was landed.

Comment by Robin Humble [ 06/Feb/14 ]

does this also fix LU-2032 ?

Comment by Jinshan Xiong (Inactive) [ 06/Feb/14 ]

Hi Robin,

No, they are not related.

Comment by Gregoire Pichon [ 21/Feb/14 ]

Could you provide the details of the configuration where you made your measurements (client node socket, memory size, network interface, max_cached_mb setting, other client tuning, OSS node and OST storage, file striping, io size, RPC size) ?

Does the dd test include a final fsync of data to storage ?

Do you have performance results with a version of Lustre after all the patches have been landed ?

thanks.

Comment by Jinshan Xiong (Inactive) [ 06/Mar/14 ]

Hi Pichon,

Now that you're asking, I assume the performance number didn't reach your expectation. Therefore I performed the test again with latest master to make sure everything is fine. Please collect statistic data on your node if this is the case, and I will take a look.

I just performed the performance testing again on opensfs nodes with the following hardware configuration:

Client nodes:

[root@c01 lustre]# free
             total       used       free     shared    buffers     cached
Mem:      32870020   26477056    6392964          0     147936   21561448
-/+ buffers/cache:    4767672   28102348
Swap:     16506872          0   16506872
[root@c01 lustre]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               1600.000
BogoMIPS:              4800.10
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-7
[root@c01 lustre]# lspci |grep InfiniBand
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)

So the client node has 32G memory size and 4 Cores with 2 threads on each core. The network is Infiniband with 40Gb/s throughput.

Server node is another client node with patch http://review.whamcloud.com/5164 applied. I used ramdisk as OST because we don't have fast disk array. Jeremy saw real performance improvement on their real disk storage. I disabled writethrough_cache_enable on the OST to avoid consuming too much memory on caching data.

Here is the test result:

[root@c01 lustre]# dd if=/dev/zero of=/mnt/lustre/testfile bs=1M count=40960
40960+0 records in
40960+0 records out
42949672960 bytes (43 GB) copied, 39.5263 s, 1.1 GB/s
[root@c01 lustre]# lfs getstripe /mnt/lustre/testfile 
/mnt/lustre/testfile
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		 objid		 group
	     0	             2	          0x2	             0

I didn't do any configuration on the client node, even disable checksum. Also the snapshot of `collect -scml':

[root@c01 ~]# collectl -scml
waiting for 1 second sample...
#<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  20  20  5500  26162  18G 144M  10G   9G   1G  45M       0      0  1110016   1084
  24  24  5513  23691  17G 144M  11G  10G   1G  45M       0      0  1025024   1001
  20  20  5657  26083  15G 144M  12G  11G   2G  45M       0      0  1112064   1086
  21  21  5434  25963  14G 144M  13G  12G   2G  45M       0      0  1110016   1084
  20  20  5690  26326  13G 144M  14G  13G   2G  45M       0      0  1104896   1079
  21  21  5646  26094  11G 144M  15G  14G   2G  45M       0      0  1105920   1080
  21  21  5466  24678  10G 144M  16G  15G   3G  45M       0      0  1046528   1022
  20  20  5634  25563   9G 144M  17G  16G   3G  45M       0      0  1097728   1072
  20  20  5818  26008   8G 144M  18G  17G   3G  45M       0      0  1111040   1085
  20  20  5673  26467   6G 144M  20G  18G   3G  45M       0      0  1104896   1079
  24  24  6346  25027   6G 144M  20G  19G   4G  45M       0      0  1060864   1036
  33  32  7162  21258   6G 144M  20G  19G   4G  45M       0      0   960512    938
  28  28  7021  22865   6G 144M  20G  19G   4G  45M       0      0  1042432   1018
  28  28  7177  23890   6G 144M  20G  19G   4G  45M       0      0  1039360   1015
  28  28  7326  24888   6G 144M  20G  19G   4G  45M       0      0  1090560   1065
  28  28  7465  24162   6G 144M  20G  19G   4G  45M       0      0  1029120   1005
  31  31  7382  22865   6G 144M  20G  19G   4G  45M       0      0   980992    958
  28  28  7263  24392   6G 144M  20G  19G   4G  45M       0      0  1075200   1050
  28  28  7278  24312   6G 144M  20G  19G   4G  45M       0      0  1080320   1055
  28  28  7252  25150   6G 144M  20G  19G   4G  45M       0      0  1059840   1035
  28  28  7241  25082   6G 144M  20G  19G   4G  45M       0      0  1076224   1051
  33  32  7343  22373   6G 144M  20G  19G   4G  45M       0      0   966656    944
#<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  28  28  7340  24704   6G 144M  20G  19G   4G  45M       0      0  1091584   1066
  27  27  7212  24694   6G 144M  20G  19G   4G  45M       0      0  1055744   1031
  28  28  7191  24909   6G 144M  20G  19G   4G  45M       0      0  1073152   1048
  28  28  7257  25058   6G 144M  20G  19G   4G  45M       0      0  1037312   1013
  33  33  7435  22787   6G 144M  20G  19G   4G  45M       0      0   988160    965
  28  28  6961  23635   6G 144M  20G  19G   4G  45M       0      0  1044480   1020
  27  27  7129  24866   6G 144M  20G  19G   4G  45M       0      0  1045504   1021
  28  27  7024  24380   6G 144M  20G  19G   4G  45M       0      0  1053666   1029
  28  28  7058  24489   6G 144M  20G  19G   4G  45M       0      0  1041426   1017
  33  33  7234  22235   6G 144M  20G  19G   4G  45M       0      0   970752    948
  27  27  7127  24555   6G 144M  20G  19G   4G  45M       0      0  1067008   1042
  28  28  7189  24215   6G 144M  20G  19G   4G  45M       0      0  1082368   1057
  28  28  7201  24734   6G 144M  20G  19G   4G  45M       0      0  1064960   1040
  27  27  7046  24564   6G 144M  20G  19G   4G  44M       0      0  1040384   1016
   0   0    67    110   6G 144M  20G  19G   4G  44M       0      0        0      0
   0   0    63    113   6G 144M  20G  19G   4G  44M       0      0        0      0

Does the dd test include a final fsync of data to storage ?

No I didn't but I don't think this will affect the result because I ran dd with 1M block so the dirty data will be sent out immediately.

Comment by Gregoire Pichon [ 16/Apr/14 ]

Hi Jinshan,

Here are the results of the performance measurements I have done.

Configuration
Client is a node with 2 Ivybridge sockets (24 cores, 2.7GHz), 32GB memory, 1 FDR Infiniband adapter.
OSS is a node with 2 Sandybridge sockets (16 cores, 2.2GHZ), 32GB memory, 1 FDR Infiniband adapter, with 5 OSTs devices from a disk array and 1 OST ramdisk device.
Each disk array OST reaches 900 MiB/s write and 1100 MiB/s read with obdfilter-survey.

Two Lustre versions have been tested: 2.5.57 and 1.8.8-wc1
OSS cache is disabled (writethrough_cache_enable=0 and read_cache_enable=0)

Benchmark
IOR with following options:
api=POSIX
filePerProc=1
blockSize=64G
transferSize=1M
numTasks=1
fsync=1

Server and client system cache is cleared before each write test and read test
Tests are repeated 3 times, average value is computed.
Tests are run as a standard user.

Results
With disk array OSTs, best results are achieved with a stripecount of 3.

lustre version write read
lustre 2.5.57 886 MiB/s 1020 MiB/s
lustre 1.8.8-wc1 823 MiB/s 1135 MiB/s

The write performance is under the 1GiB/s performance which was my goal. Do you think this is a performance we could achieve ? What tuning would you recommand ? I will provide monitoring data in attachment for one of the lustre 2.5.57 runs.

As an element of comparison, the results with the ram device OST

lustre version write read
lustre 2.5.57 856 MiB/s 941 MiB/s
lustre 1.8.8-wc1 919 MiB/s 1300 MiB/s

Various tuning have been tested but give no improvement:
IOR transferSize, llite max_cached_mb, OSS cache enabled.

What makes a significant difference with lustre 2.5.57 is the write performance when test is run as root user, since it reaches 926 MiB/s (+4,5% compared to standard user). Should I open a separate ticket to track this difference ?

Greg.

Comment by Gregoire Pichon [ 16/Apr/14 ]

in attachment the monitoring data from one of the lustre 2.5.57 run, as standard user

Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ]

For the different between root and normal user, I guess this is due to quota check. Though you may not enable quota for this normal user, it will still check if quota is enabled for this specific user for each IO.

What's the write speed you can get by 1 stripe? Please be sure to disable debug log by `lctl set_param debug=0' on the client side. Also please monitor the CPU usage on the client side when writing a stripe file and multi-striped file specifically.

In general, if CPU is the bottleneck on the client side, it won't help improve IO speed by adding more stripes.

Comment by Gregoire Pichon [ 17/Apr/14 ]

I agree with quota beeing the potential culprit. But why is it so expensive to check if user has quota enabled ? Even if it's done for each IO, single thread has no concurrency problem.

The performance with a stripecount of 1 is: 720 MiB/s write and 896 MiB/s read (lustre 2.5.57).

In attachment monitoring data for a run with stripecount of 1 (lu-3321-singlethreadperf2.tgz)

All runs have been launched with /proc/sys/lnet/debug set to 0 on both client and server.

Monitoring data for client CPU usage is in client/mo85/cpustat directory:

  • cpustat-global and cpustat-global.png give all cpus usage
  • cpustat-cpu0 cpustat-cpu0.png give CPU0 usage (IOR is bound to that core).

Since IOR performance is close to OST device raw performance, using multiple stripes might help exceed this limit.

Comment by Jinshan Xiong (Inactive) [ 21/Apr/14 ]

From the cpu stat, it clearly showed that the cpu usage is around 80% for single thread and single stripe write, this is why you can see a slight performance improvement with multiple striped file. CLIO is still CPU intensive and your CPU can only drive ~900MB/s IO on the client side. As a comparison, the CPU on OpenSFS cluster can drive ~1.2GB/s.

Can you please provide me the test script you're using to collect data and generate diagram, therefore I can reproduce this on OpenSFS cluster?

Comment by Patrick Farrell (Inactive) [ 21/Apr/14 ]

Just as a favor to anyone else interested, this is a complete list of patches landed against LU-3321:

http://review.whamcloud.com/#/c/7888
http://review.whamcloud.com/#/c/7890
http://review.whamcloud.com/#/c/7891
http://review.whamcloud.com/#/c/7892
http://review.whamcloud.com/#/c/8174
http://review.whamcloud.com/#/c/7893
http://review.whamcloud.com/#/c/7894
http://review.whamcloud.com/#/c/7895
http://review.whamcloud.com/#/c/8523

7889, listed in Jinshan's earlier list of patches, was abandoned.

Comment by Gregoire Pichon [ 22/Apr/14 ]

Jinshan,

In attachment the script that generates the CPU usage graphs with gnuplot. File "filename" contains the data where each line has the following format:
time user system idle iowait

This can be obtained with vmstat command for global CPU usage, or from /proc/stat file for per-CPU usage.

What model of CPU is present on OpenSFS cluster ?

Comment by Jinshan Xiong (Inactive) [ 22/Apr/14 ]
[root@c01 ~]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               1600.000
BogoMIPS:              4800.65
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-7

[root@c01 ~]# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
cpu MHz		: 1600.000
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.65
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:
Generated at Sat Feb 10 01:32:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.