[LU-3178] single stream write to striped file is slow with 2.X clients Created: 16/Apr/13 Updated: 26/Sep/14 Resolved: 15/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Erich Focht | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | client | ||
| Environment: |
Four OSS blocks with 2 servers and a NEC SNA460 (AKA NetApp E5560) RAID with 8 OSTs each. CentOS 6.3 OS. Servers run Lustre 2.1.5. Testing with clients with Lustre 1.8.8, 2.1.5, 2.3.63. |
||
| Attachments: |
|
| Severity: | 3 |
| Epic: | client, performance |
| Rank (Obsolete): | 7747 |
| Description |
|
Single stream write to a striped file has poor performance on Lustre 2.* clients. Discussion started on the hpdd-discuss mailing list (Subject: striped write performance in lustre 2.X?). With Lustre 1.8.* (servers+clients) it was rather easy to exceed 1GB/s with a 4-8 striped file. With Lustre 2.1.5 servers again the Lustre 1.8.8 client is the fastest option (though not as fast as it used to be). Comparing Lustre 1.8.8 client run with a 2.3.63ish client including some tunings: 1.8.8 client: # dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256gbx bs=4M count=$((256/4*1000)) 268435456000 bytes (268 GB) copied, 349.921 s, 767 MB/s 2.3.63 client: # dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256gby bs=4M count=$((256/4*1000)) 268435456000 bytes (268 GB) copied, 689.333 s, 389 MB/s 1.8.8 client rpc_stats (part): read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 11594 18 18
1: 0 0 0 | 13130 20 38
2: 0 0 0 | 16592 25 64
3: 0 0 0 | 16543 25 90
4: 0 0 0 | 3762 5 96
5: 0 0 0 | 1598 2 98
6: 0 0 0 | 338 0 99
7: 0 0 0 | 401 0 99
8: 0 0 0 | 45 0 99
9: 0 0 0 | 1 0 100
read write
offset rpcs % cum % | rpcs % cum %
0: 0 0 0 | 64000 99 99
1: 0 0 0 | 0 0 99
2: 0 0 0 | 0 0 99
4: 0 0 0 | 0 0 99
8: 0 0 0 | 0 0 99
16: 0 0 0 | 0 0 99
32: 0 0 0 | 0 0 99
64: 0 0 0 | 0 0 99
128: 0 0 0 | 4 0 100
1.8.8 client perf top:
2151.00 9.3% ll_ra_read_init [lustre]
1781.00 7.7% copy_user_generic_string [kernel.kallsyms]
1734.00 7.5% _spin_lock [kernel.kallsyms]
1300.00 5.6% lov_putref [lov]
781.00 3.4% radix_tree_tag_clear [kernel.kallsyms]
720.00 3.1% _spin_lock_irqsave [kernel.kallsyms]
667.00 2.9% intel_idle [kernel.kallsyms]
607.00 2.6% osc_destroy [osc]
554.00 2.4% __clear_user [kernel.kallsyms]
551.00 2.4% refresh_entry [lvfs]
451.00 2.0% kiblnd_ni_fini_pools [ko2iblnd]
399.00 1.7% radix_tree_tag_set [kernel.kallsyms]
356.00 1.5% __percpu_counter_add [kernel.kallsyms]
352.00 1.5% lov_queue_group_io [lov]
340.00 1.5% mark_page_accessed [kernel.kallsyms]
338.00 1.5% _spin_lock_irq [kernel.kallsyms]
307.00 1.3% test_clear_page_writeback [kernel.kallsyms]
306.00 1.3% lov_stripe_unlock [lov]
302.00 1.3% _spin_unlock_irqrestore [kernel.kallsyms]
280.00 1.2% osc_update_grant [osc]
272.00 1.2% __dec_zone_state [kernel.kallsyms]
221.00 1.0% lop_makes_rpc [osc]
2.3.63 client rpc_stats read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 0 0 0
1: 0 0 0 | 16030 25 25
2: 0 0 0 | 16516 25 50
3: 0 0 0 | 23525 36 87
4: 0 0 0 | 7909 12 99
5: 0 0 0 | 7 0 99
6: 0 0 0 | 5 0 99
7: 0 0 0 | 4 0 99
8: 0 0 0 | 4 0 100
read write
offset rpcs % cum % | rpcs % cum %
0: 0 0 0 | 1 0 0
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 0 0 0
64: 0 0 0 | 0 0 0
128: 0 0 0 | 0 0 0
256: 0 0 0 | 1 0 0
512: 0 0 0 | 2 0 0
1024: 0 0 0 | 4 0 0
2048: 0 0 0 | 8 0 0
4096: 0 0 0 | 16 0 0
8192: 0 0 0 | 32 0 0
16384: 0 0 0 | 64 0 0
32768: 0 0 0 | 128 0 0
65536: 0 0 0 | 256 0 0
131072: 0 0 0 | 512 0 1
262144: 0 0 0 | 1024 1 3
524288: 0 0 0 | 2048 3 6
1048576: 0 0 0 | 4096 6 12
2097152: 0 0 0 | 8192 12 25
4194304: 0 0 0 | 16384 25 51
8388608: 0 0 0 | 31232 48 100
2.3.63 client perf top (up to 1%):
5.03% [kernel] [k] copy_user_generic_string
3.79% [obdclass] [k] key_fini
3.78% [kernel] [k] _spin_lock
2.43% [obdclass] [k] keys_fill
2.13% [kernel] [k] kmem_cache_alloc
1.99% [kernel] [k] __clear_user
1.92% [kernel] [k] kmem_cache_free
1.78% [lvfs] [k] lprocfs_counter_add
1.48% [kernel] [k] kfree
1.34% [kernel] [k] __kmalloc
1.33% [kernel] [k] radix_tree_delete
1.31% [kernel] [k] _spin_lock_irqsave
1.28% [obdclass] [k] cl_page_put
1.15% [kernel] [k] radix_tree_insert
1.13% [lvfs] [k] lprocfs_counter_sub
1.04% [osc] [k] osc_lru_del
1.02% [obdclass] [k] cl_page_find0
1.01% [obdclass] [k] lu_context_key_get
Upon request from Jinshan I ran 'collectl -scml' during the dd: #### 2.3.63 client: [root@sabi22 ~]# collectl -scml -i 15 waiting for 15 second sample... #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 0 0 78 192 60G 112K 1G 43M 170M 80M 0 0 0 0 2 2 2578 10780 53G 112K 7G 5G 1G 85M 0 0 392124 96 5 5 4639 20295 39G 112K 17G 16G 4G 85M 0 0 741103 181 4 4 4000 16842 31G 112K 24G 22G 6G 85M 0 0 609758 149 4 4 3046 11499 31G 112K 24G 22G 6G 85M 0 0 411785 101 4 4 3060 11592 31G 112K 24G 22G 6G 85M 0 0 412058 101 4 4 3096 11445 31G 112K 24G 23G 6G 85M 0 0 405231 99 4 4 3080 11440 31G 112K 24G 23G 6G 85M 0 0 404958 99 4 4 3052 11386 30G 112K 24G 23G 6G 85M 0 0 405777 99 4 4 3066 11477 30G 112K 24G 23G 6G 85M 0 0 407689 100 4 4 3044 11552 30G 112K 25G 23G 6G 85M 0 0 412058 101 4 4 3058 11382 30G 112K 25G 23G 6G 85M 0 0 403319 98 4 4 3095 11412 30G 112K 25G 23G 6G 85M 0 0 404958 99 4 4 3067 11370 30G 112K 25G 24G 6G 85M 0 0 403593 99 4 4 3083 11478 29G 112K 25G 24G 6G 85M 0 0 407689 100 4 4 3107 11374 29G 112K 25G 24G 6G 85M 0 0 401135 98 4 4 3094 11280 29G 112K 25G 24G 6G 85M 0 0 396766 97 4 4 3078 11282 29G 112K 25G 24G 6G 85M 0 0 398404 97 4 4 3098 11241 29G 112K 26G 24G 6G 85M 0 0 396493 97 4 4 3067 11238 28G 112K 26G 24G 6G 85M 0 0 393762 96 4 4 3080 11513 28G 112K 26G 24G 6G 85M 0 0 395947 97 4 4 3068 11327 28G 112K 26G 24G 6G 85M 0 0 394854 96 #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 4 4 3069 11444 28G 112K 26G 25G 6G 85M 0 0 396766 97 4 4 3060 11337 28G 112K 26G 25G 7G 85M 0 0 392397 96 4 4 3050 11220 28G 112K 26G 25G 7G 85M 0 0 388301 95 4 4 3076 11355 28G 112K 26G 25G 7G 85M 0 0 394854 96 4 4 3096 11359 27G 112K 27G 25G 7G 85M 0 0 394308 96 4 4 3049 11316 27G 112K 27G 25G 7G 85M 0 0 394854 96 4 4 3060 11325 27G 112K 27G 25G 7G 85M 0 0 396493 97 4 4 3058 11381 27G 112K 27G 25G 7G 85M 0 0 396220 97 4 4 3071 11343 27G 112K 27G 26G 7G 85M 0 0 395127 96 4 4 3015 11213 27G 112K 27G 26G 7G 85M 0 0 390485 95 4 4 3046 11328 26G 112K 27G 26G 7G 85M 0 0 394308 96 4 4 3069 11454 26G 112K 27G 26G 7G 85M 0 0 397858 97 4 4 3058 11524 26G 112K 28G 26G 7G 85M 0 0 399497 98 3 3 3033 11427 26G 112K 28G 26G 7G 85M 0 0 397858 97 3 3 3048 11500 26G 112K 28G 26G 7G 85M 0 0 402227 98 4 4 3056 11291 26G 112K 28G 26G 7G 85M 0 0 395127 96 4 4 3067 11223 25G 112K 28G 27G 7G 85M 0 0 394035 96 3 3 3055 11366 25G 112K 28G 27G 7G 85M 0 0 398404 97 4 4 3036 11337 25G 112K 28G 27G 7G 85M 0 0 397585 97 4 4 3057 11307 25G 112K 28G 27G 7G 85M 0 0 395947 97 4 4 3080 11241 25G 112K 29G 27G 7G 85M 0 0 393489 96 1 1 1392 5003 25G 112K 29G 27G 7G 80M 0 0 170667 42 #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 0 0 101 216 25G 112K 29G 27G 7G 80M 0 0 0 0 Doesn't seem to fill the memory... Further details: [root@sabi22 ~]# lctl get_param llite.*.max_cached_mb llite.lnec-ffff8808348d2c00.max_cached_mb= users: 32 max_cached_mb: 48397 used_mb: 24610 unused_mb: 23787 reclaim_count: 0 [root@sabi22 ~]# lctl get_param osc.*.max_dirty_mb osc.lnec-OST0000-osc-ffff8808348d2c00.max_dirty_mb=32 [...] osc.lnec-OST0018-osc-ffff8808348d2c00.max_dirty_mb=32 osc.lnec-OST0019-osc-ffff8808348d2c00.max_dirty_mb=32 osc.lnec-OST001a-osc-ffff8808348d2c00.max_dirty_mb=32 osc.lnec-OST001b-osc-ffff8808348d2c00.max_dirty_mb=32 osc.lnec-OST001c-osc-ffff8808348d2c00.max_dirty_mb=32 osc.lnec-OST001d-osc-ffff8808348d2c00.max_dirty_mb=32 osc.lnec-OST001e-osc-ffff8808348d2c00.max_dirty_mb=32 osc.lnec-OST001f-osc-ffff8808348d2c00.max_dirty_mb=32 [root@sabi22 EF]# lfs getstripe /mnt/lnec/pool4/stripe_4_4096/test256a /mnt/lnec/pool4/stripe_4_4096/test256a lmm_stripe_count: 4 lmm_stripe_size: 4194304 lmm_layout_gen: 0 lmm_stripe_offset: 27 lmm_pool: osb4 obdidx objid objid group 27 6792 0x1a88 0 31 5891 0x1703 0 24 6983 0x1b47 0 28 5251 0x1483 0 |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 16/Apr/13 ] |
|
Can you please go back to commit 7bcf31e8fa8ccb900d3c015a5e7202b8bcca97bd and see how it's going? Also please collect oprofile results while you're doing test, thanks! |
| Comment by Erich Focht [ 16/Apr/13 ] |
|
opreport for a dd run with checkout 0f1d64724480bdfa268d18153272c3484c90e578 |
| Comment by Erich Focht [ 16/Apr/13 ] |
|
Jinshan, just finished a run with what I had previously (0f1d64724480bdfa268d18153272c3484c90e578). Will now move to the version you suggested. As mentioned before: I can change the client but not the servers. They are 2.1.5. |
| Comment by Jinshan Xiong (Inactive) [ 16/Apr/13 ] |
|
It's fine. It's not necessary to change server version. BTW, there're one option to control cache behavior on the OST, like writecache, blah you can try to disable it to see how it's going. |
| Comment by Erich Focht [ 16/Apr/13 ] |
|
I usually run with writethrough_cache disabled, only the oprofiled run was with enabled caches. Just did another measurement and the behavior is the same (still the 0f1d64... git). Only this time at some point the memory of the client filled up. Also: there was a point where the better bandwidth came back for a short time. If you care for collectl or oprofile of this run, I can upload. Otherwise I'll report back when I have the 7bcf31 client results (tomorrow). |
| Comment by Jinshan Xiong (Inactive) [ 16/Apr/13 ] |
|
Sure please upload the log of collectl. Just want to get as much information as possible in case I really need it |
| Comment by Erich Focht [ 16/Apr/13 ] |
|
collectl and oprofile of another dd run (4-way striped file, 256GB, writethrough_cache on servers disabled). Lustre git checkout was 0f1d64724480bdfa268d18153272c3484c90e578 |
| Comment by Erich Focht [ 17/Apr/13 ] |
|
Attached collectl and oprofile output for the client built with git checkout 7bcf31e8fa8ccb900d3c015a5e7202b8bcca97bd. The results are the best among all 2.X tests with the single stream write to striped file: dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256f bs=4M count=64000 64000+0 records in 64000+0 records out 268435456000 bytes (268 GB) copied, 433.208 s, 620 MB/s |
| Comment by Jinshan Xiong (Inactive) [ 23/Apr/13 ] |
|
It looks like recent patches degraded the performance, would you please try to revert the patch: d5390bbde9b525b13ab91f5610abe1c4bb98eff9 from your git tree and check how it;s going. |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ] |
|
Hello Erich, |
| Comment by John Fuchs-Chesney (Inactive) [ 15/Mar/14 ] |
|
Looks like we will not make any further progress on this issue. |
| Comment by Cory Spitz [ 26/Sep/14 ] |
|
> Looks like we will not make any further progress on this issue. That's a shame. I don't think that "single stream write to striped file is slow with 2.x clients" (compared to 1.8.x) is still yet fixed in 2.6.0. Or is it simply that other tickets track that issue now? |