[LU-3321] 2.x single thread/process throughput degraded from 1.8 Created: 13/May/13 Updated: 03/May/17 Resolved: 06/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jeremy Filizetti | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
Tested on 2.3.64 and 1.8.9 clients with 4 OSS x 3 - 32 GB OST ramdisks |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Epic/Theme: | Performance | ||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 8259 | ||||||||||||||||||||||||||||||||
| Description |
|
Single thread/process throughput on tag 2.3.64 is degraded from 1.8.9 and significantly degraded when the client hits its caching limit (llite.*.max_cached_mb). Attached graph shows lnet stats sampled every second for a single dd writing 2 - 64 GB files followed by a dropping cache and reading the same two files. The tests were not done simultaenously but the graph has them starting from the same point. It also takes a significant amount of time to drop the cache on 2.3.64. Lustre 2.3.64 Drop caches (echo 1 > /proc/sys/vm/drop_caches) Read (dd if=testfile of=/dev/null bs=1M) Lustre 1.8.9 Drop caches (echo 1 > /proc/sys/vm/drop_caches) Read (dd if=testfile of=/dev/null bs=1M) |
| Comments |
| Comment by Peter Jones [ 13/May/13 ] | ||||||||||||||||||
|
Jeremy Could you please repeat these tests with 2.1.5 and 2.3 to see whether the behaviour is consistent with 2.3.65? Also, could you please check whether the patches in gerrit for Thanks Peter | ||||||||||||||||||
| Comment by Jeremy Filizetti [ 15/May/13 ] | ||||||||||||||||||
|
Based on some feedback from Jinshan I retried this test with http://review.whamcloud.com/#change,5446 from Here is the times from this last test (and graph): 2.3.65 + http://review.whamcloud.com/#change,5446 | ||||||||||||||||||
| Comment by Peter Jones [ 21/Jun/13 ] | ||||||||||||||||||
|
Jinshan Will this situation be improved with the CLIO simplification work that has been proposed? Peter | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 21/Jun/13 ] | ||||||||||||||||||
|
Hi Peter, Yes, I think so. | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ] | ||||||||||||||||||
|
Please check patch for master at: http://review.whamcloud.com/7888 and patch for b2_4 at: http://review.whamcloud.com/7896 | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ] | ||||||||||||||||||
|
Let me share a test result from Jeremy which showed performance improvement. | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ] | ||||||||||||||||||
|
And the test I did in our lab | ||||||||||||||||||
| Comment by Erich Focht [ 06/Nov/13 ] | ||||||||||||||||||
|
Jinshan, a quick question: are the results you've uploaded on Oct. 11 measured on a striped file? How many stripes? Were these multiple threads on one striped file or each thread with it's own file? I've seen very poor performance with striped files and no increase in performance when increasing the number of stripes with Lustre 2.x, just want to make sure whether your measurements are "my" use case or not. Thanks, | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 07/Nov/13 ] | ||||||||||||||||||
|
Hi Erich, I didn't use multiple striped files, and each thread wrote their own files in multiple threads testing. Did you use my performance patch to do the test? I assume you're doing benchmark on multiple striped file with single thread. In my experience, if your OSTw are fast, it won't help to get better performance by striping files to multiple OSTs, because the bottleneck is on the client CPU. You can take a look at the rpc_stats of OSC: lctl get_param osc.*.rpc_stats, if the value of rpcs_in_flight is low, which implies client can't generate data fast enough to saturate OSTs. | ||||||||||||||||||
| Comment by Christopher Morrone [ 09/Nov/13 ] | ||||||||||||||||||
|
We are rather disappointed by the revert of commit 93fe562. While I can understand not liking the performance impact, basic functionality trumps performance. On nodes with high processor counts, the lustre client thrashes to the point of being unusable without that patch. I think this ticket is now a blocker for 2.6, because we can't operate with the tree in its current state. | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 09/Nov/13 ] | ||||||||||||||||||
|
No worry, we've already had a patch for percpu cl_env cache. It turns out that the overhead of allocating cl_env is really high, so caching cl_env is necessary, but we just need a smart way to cache them. | ||||||||||||||||||
| Comment by Jodi Levi (Inactive) [ 26/Dec/13 ] | ||||||||||||||||||
|
Once http://review.whamcloud.com/#/c/8523/ lands this ticket can be closed | ||||||||||||||||||
| Comment by John Fuchs-Chesney (Inactive) [ 15/Jan/14 ] | ||||||||||||||||||
|
Can anyone tell me about the status of the final patch for this issue? Looks like some recent testing has been successful, but I don't know the other tools well enough to know if the patch is ready to land, or even already landed. Thanks. | ||||||||||||||||||
| Comment by Patrick Farrell (Inactive) [ 15/Jan/14 ] | ||||||||||||||||||
|
John - There are a few components to the review/landing process. The patch is built automatically by Jenkins, which contributes a +1 if it builds correctly (that's already true for http://review.whamcloud.com/#/c/8523/). Maloo then runs several sets of tests, currently I think that's three. Once all the sets of tests have passed, Maloo contributes a +1. It looks like two of the three test sets have completed for the patch in question. Separately, human reviewers contribute code reviews. A positive review gives a +1. The general standard is two +1s before landing a patch. Then, finally, someone in the gatekeeper role - I believe that's currently Oleg Drokin and Andreas Dilger - approves the patch, which appears as +2. Then the patch is cherry-picked on to master (also by a gatekeeper). Once the patch has been cherry-picked, it has landed. This one is almost ready to land. The tests need to complete, then it should be approved and cherry-picked quickly. (Since this ticket is a blocker for 2.6, it'll definitely be in before that release.) | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 16/Jan/14 ] | ||||||||||||||||||
|
the maloo test failed due to an unrelated bug. I will retrigger the test. | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 06/Feb/14 ] | ||||||||||||||||||
|
phew, the last patch was landed. | ||||||||||||||||||
| Comment by Robin Humble [ 06/Feb/14 ] | ||||||||||||||||||
|
does this also fix LU-2032 ? | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 06/Feb/14 ] | ||||||||||||||||||
|
Hi Robin, No, they are not related. | ||||||||||||||||||
| Comment by Gregoire Pichon [ 21/Feb/14 ] | ||||||||||||||||||
|
Could you provide the details of the configuration where you made your measurements (client node socket, memory size, network interface, max_cached_mb setting, other client tuning, OSS node and OST storage, file striping, io size, RPC size) ? Does the dd test include a final fsync of data to storage ? Do you have performance results with a version of Lustre after all the patches have been landed ? thanks. | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 06/Mar/14 ] | ||||||||||||||||||
|
Hi Pichon, Now that you're asking, I assume the performance number didn't reach your expectation. Therefore I performed the test again with latest master to make sure everything is fine. Please collect statistic data on your node if this is the case, and I will take a look. I just performed the performance testing again on opensfs nodes with the following hardware configuration: Client nodes: [root@c01 lustre]# free
total used free shared buffers cached
Mem: 32870020 26477056 6392964 0 147936 21561448
-/+ buffers/cache: 4767672 28102348
Swap: 16506872 0 16506872
[root@c01 lustre]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 1600.000
BogoMIPS: 4800.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-7
[root@c01 lustre]# lspci |grep InfiniBand
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
So the client node has 32G memory size and 4 Cores with 2 threads on each core. The network is Infiniband with 40Gb/s throughput. Server node is another client node with patch http://review.whamcloud.com/5164 applied. I used ramdisk as OST because we don't have fast disk array. Jeremy saw real performance improvement on their real disk storage. I disabled writethrough_cache_enable on the OST to avoid consuming too much memory on caching data. Here is the test result: [root@c01 lustre]# dd if=/dev/zero of=/mnt/lustre/testfile bs=1M count=40960 40960+0 records in 40960+0 records out 42949672960 bytes (43 GB) copied, 39.5263 s, 1.1 GB/s [root@c01 lustre]# lfs getstripe /mnt/lustre/testfile /mnt/lustre/testfile lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 2 0x2 0 I didn't do any configuration on the client node, even disable checksum. Also the snapshot of `collect -scml': [root@c01 ~]# collectl -scml waiting for 1 second sample... #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 20 20 5500 26162 18G 144M 10G 9G 1G 45M 0 0 1110016 1084 24 24 5513 23691 17G 144M 11G 10G 1G 45M 0 0 1025024 1001 20 20 5657 26083 15G 144M 12G 11G 2G 45M 0 0 1112064 1086 21 21 5434 25963 14G 144M 13G 12G 2G 45M 0 0 1110016 1084 20 20 5690 26326 13G 144M 14G 13G 2G 45M 0 0 1104896 1079 21 21 5646 26094 11G 144M 15G 14G 2G 45M 0 0 1105920 1080 21 21 5466 24678 10G 144M 16G 15G 3G 45M 0 0 1046528 1022 20 20 5634 25563 9G 144M 17G 16G 3G 45M 0 0 1097728 1072 20 20 5818 26008 8G 144M 18G 17G 3G 45M 0 0 1111040 1085 20 20 5673 26467 6G 144M 20G 18G 3G 45M 0 0 1104896 1079 24 24 6346 25027 6G 144M 20G 19G 4G 45M 0 0 1060864 1036 33 32 7162 21258 6G 144M 20G 19G 4G 45M 0 0 960512 938 28 28 7021 22865 6G 144M 20G 19G 4G 45M 0 0 1042432 1018 28 28 7177 23890 6G 144M 20G 19G 4G 45M 0 0 1039360 1015 28 28 7326 24888 6G 144M 20G 19G 4G 45M 0 0 1090560 1065 28 28 7465 24162 6G 144M 20G 19G 4G 45M 0 0 1029120 1005 31 31 7382 22865 6G 144M 20G 19G 4G 45M 0 0 980992 958 28 28 7263 24392 6G 144M 20G 19G 4G 45M 0 0 1075200 1050 28 28 7278 24312 6G 144M 20G 19G 4G 45M 0 0 1080320 1055 28 28 7252 25150 6G 144M 20G 19G 4G 45M 0 0 1059840 1035 28 28 7241 25082 6G 144M 20G 19G 4G 45M 0 0 1076224 1051 33 32 7343 22373 6G 144M 20G 19G 4G 45M 0 0 966656 944 #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 28 28 7340 24704 6G 144M 20G 19G 4G 45M 0 0 1091584 1066 27 27 7212 24694 6G 144M 20G 19G 4G 45M 0 0 1055744 1031 28 28 7191 24909 6G 144M 20G 19G 4G 45M 0 0 1073152 1048 28 28 7257 25058 6G 144M 20G 19G 4G 45M 0 0 1037312 1013 33 33 7435 22787 6G 144M 20G 19G 4G 45M 0 0 988160 965 28 28 6961 23635 6G 144M 20G 19G 4G 45M 0 0 1044480 1020 27 27 7129 24866 6G 144M 20G 19G 4G 45M 0 0 1045504 1021 28 27 7024 24380 6G 144M 20G 19G 4G 45M 0 0 1053666 1029 28 28 7058 24489 6G 144M 20G 19G 4G 45M 0 0 1041426 1017 33 33 7234 22235 6G 144M 20G 19G 4G 45M 0 0 970752 948 27 27 7127 24555 6G 144M 20G 19G 4G 45M 0 0 1067008 1042 28 28 7189 24215 6G 144M 20G 19G 4G 45M 0 0 1082368 1057 28 28 7201 24734 6G 144M 20G 19G 4G 45M 0 0 1064960 1040 27 27 7046 24564 6G 144M 20G 19G 4G 44M 0 0 1040384 1016 0 0 67 110 6G 144M 20G 19G 4G 44M 0 0 0 0 0 0 63 113 6G 144M 20G 19G 4G 44M 0 0 0 0
No I didn't but I don't think this will affect the result because I ran dd with 1M block so the dirty data will be sent out immediately. | ||||||||||||||||||
| Comment by Gregoire Pichon [ 16/Apr/14 ] | ||||||||||||||||||
|
Hi Jinshan, Here are the results of the performance measurements I have done. Configuration Two Lustre versions have been tested: 2.5.57 and 1.8.8-wc1 Benchmark Server and client system cache is cleared before each write test and read test Results
The write performance is under the 1GiB/s performance which was my goal. Do you think this is a performance we could achieve ? What tuning would you recommand ? I will provide monitoring data in attachment for one of the lustre 2.5.57 runs. As an element of comparison, the results with the ram device OST
Various tuning have been tested but give no improvement: What makes a significant difference with lustre 2.5.57 is the write performance when test is run as root user, since it reaches 926 MiB/s (+4,5% compared to standard user). Should I open a separate ticket to track this difference ? Greg. | ||||||||||||||||||
| Comment by Gregoire Pichon [ 16/Apr/14 ] | ||||||||||||||||||
|
in attachment the monitoring data from one of the lustre 2.5.57 run, as standard user | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ] | ||||||||||||||||||
|
For the different between root and normal user, I guess this is due to quota check. Though you may not enable quota for this normal user, it will still check if quota is enabled for this specific user for each IO. What's the write speed you can get by 1 stripe? Please be sure to disable debug log by `lctl set_param debug=0' on the client side. Also please monitor the CPU usage on the client side when writing a stripe file and multi-striped file specifically. In general, if CPU is the bottleneck on the client side, it won't help improve IO speed by adding more stripes. | ||||||||||||||||||
| Comment by Gregoire Pichon [ 17/Apr/14 ] | ||||||||||||||||||
|
I agree with quota beeing the potential culprit. But why is it so expensive to check if user has quota enabled ? Even if it's done for each IO, single thread has no concurrency problem. The performance with a stripecount of 1 is: 720 MiB/s write and 896 MiB/s read (lustre 2.5.57). In attachment monitoring data for a run with stripecount of 1 (lu-3321-singlethreadperf2.tgz) All runs have been launched with /proc/sys/lnet/debug set to 0 on both client and server. Monitoring data for client CPU usage is in client/mo85/cpustat directory:
Since IOR performance is close to OST device raw performance, using multiple stripes might help exceed this limit. | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 21/Apr/14 ] | ||||||||||||||||||
|
From the cpu stat, it clearly showed that the cpu usage is around 80% for single thread and single stripe write, this is why you can see a slight performance improvement with multiple striped file. CLIO is still CPU intensive and your CPU can only drive ~900MB/s IO on the client side. As a comparison, the CPU on OpenSFS cluster can drive ~1.2GB/s. Can you please provide me the test script you're using to collect data and generate diagram, therefore I can reproduce this on OpenSFS cluster? | ||||||||||||||||||
| Comment by Patrick Farrell (Inactive) [ 21/Apr/14 ] | ||||||||||||||||||
|
Just as a favor to anyone else interested, this is a complete list of patches landed against http://review.whamcloud.com/#/c/7888 7889, listed in Jinshan's earlier list of patches, was abandoned. | ||||||||||||||||||
| Comment by Gregoire Pichon [ 22/Apr/14 ] | ||||||||||||||||||
|
Jinshan, In attachment the script that generates the CPU usage graphs with gnuplot. File "filename" contains the data where each line has the following format: This can be obtained with vmstat command for global CPU usage, or from /proc/stat file for per-CPU usage. What model of CPU is present on OpenSFS cluster ? | ||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 22/Apr/14 ] | ||||||||||||||||||
[root@c01 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 44 Stepping: 2 CPU MHz: 1600.000 BogoMIPS: 4800.65 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0-7 [root@c01 ~]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 cpu MHz : 1600.000 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.65 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: |