[LU-2481] performance issue of parallel read Created: 12/Dec/12 Updated: 13/May/13 Resolved: 13/May/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | Lustre 2.1.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Gregoire Pichon | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ptr | ||
| Environment: |
lustre 2.1.3 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Epic: | client, performance, read | ||||||||
| Rank (Obsolete): | 5822 | ||||||||
| Description |
|
One of our customer (TGCC - lustre 2.1.3) is facing a performance issue while reading in the same file from many tasks on one client at the same time. Performance drops while increasing the number of tasks. I have reproduced the issue and obtained these results: tasks Mean Total (MB/s) dd 1 339 339 dd 2 107 214 dd 4 78 312 dd 8 64 512 dd 16 26 428 Applying the patch http://review.whamcloud.com/#change,3627 on lustre 2.1.3 gives no real improvement. tasks Mean Total (MB/s) dd 1 542 542 dd 2 118 236 dd 4 74 298 dd 8 73 591 dd 16 52 833 The profiling shows contention in CLIO page routines. samples % linenr info app name symbol name 1149893 34.8489 (no location information) libpython2.6.so.1.0 /usr/lib64/libpython2.6.so.1.0 338676 10.2640 dec_and_lock.c:21 vmlinux _atomic_dec_and_lock 335241 10.1599 cl_page.c:404 obdclass.ko cl_page_find0 293448 8.8933 cl_page.c:729 obdclass.ko cl_vmpage_page 235856 7.1479 filemap.c:963 vmlinux grab_cache_page_nowait 142686 4.3243 cl_page.c:661 obdclass.ko cl_page_put 80065 2.4265 filemap.c:667 vmlinux find_get_page 46623 1.4130 copy_user_64.S:240 vmlinux copy_user_generic_string 34127 1.0343 (no location information) libpthread-2.12.so pthread_rwlock_rdlock 33685 1.0209 (no location information) libc-2.12.so getenv 31725 0.9615 (no location information) libpthread-2.12.so pthread_rwlock_unlock 30492 0.9241 swap.c:288 vmlinux activate_page 28528 0.8646 (no location information) libc-2.12.so __dcigettext 23703 0.7183 (no location information) libc-2.12.so __strlen_sse42 22345 0.6772 vvp_page.c:120 lustre.ko vvp_page_unassume 20646 0.6257 swap.c:183 vmlinux put_page 18525 0.5614 intel_idle.c:215 vmlinux intel_idle 12774 0.3871 open.c:961 vmlinux sys_close 12702 0.3849 filemap.c:579 vmlinux unlock_page 12685 0.3844 (no location information) libc-2.12.so __memcpy_ssse3 12134 0.3677 cl_page.c:874 obdclass.ko cl_page_invoid 11734 0.3556 (no location information) libpthread-2.12.so __close_nocancel 10573 0.3204 cl_page.c:884 obdclass.ko cl_page_owner_clear 10510 0.3185 (no location information) libc-2.12.so strerror_r 9783 0.2965 cl_page.c:1068 obdclass.ko cl_page_unassume 9472 0.2871 (no location information) oprofiled /usr/bin/oprofiled 9438 0.2860 (no location information) libc-2.12.so __strncmp_sse2 8826 0.2675 rw.c:724 lustre.ko ll_readahead 8769 0.2658 filemap.c:527 vmlinux page_waitqueue 8095 0.2453 entry_64.S:470 vmlinux system_call_after_swapgs 7756 0.2351 cl_page.c:561 obdclass.ko cl_page_state_set0 7615 0.2308 (no location information) libc-2.12.so __stpcpy_sse2 7366 0.2232 entry_64.S:462 vmlinux system_call 7097 0.2151 sched.c:4430 vmlinux find_busiest_group 6898 0.2091 (no location information) libperl.so /usr/lib64/perl5/CORE/libperl.so 6887 0.2087 (no location information) libc-2.12.so __ctype_b_loc 6780 0.2055 wait.c:251 vmlinux __wake_up_bit 6631 0.2010 cl_page.c:139 obdclass.ko cl_page_at_trusted 6588 0.1997 (no location information) libc-2.12.so __strlen_sse2 6336 0.1920 lov_io.c:239 lov.ko lov_page_stripe 5289 0.1603 lov_io.c:208 lov.ko lov_sub_get 5263 0.1595 ring_buffer.c:2834 vmlinux rb_get_reader_page 4595 0.1393 lvfs_lib.c:76 lvfs.ko lprocfs_counter_add 4373 0.1325 (no location information) libc-2.12.so strerror 4347 0.1317 lov_io.c:252 lov.ko lov_page_subio 4337 0.1314 cl_page.c:1036 obdclass.ko cl_page_assume 4310 0.1306 (no location information) libc-2.12.so __mempcpy_sse2 3903 0.1183 lov_page.c:92 lov.ko lov_page_own 3866 0.1172 mutex.c:409 vmlinux __mutex_lock_slowpath 3841 0.1164 lcommon_cl.c:1097 lustre.ko cl2vm_page 3698 0.1121 radix-tree.c:414 vmlinux radix_tree_lookup_slot 3506 0.1063 vvp_page.c:109 lustre.ko vvp_page_assume 3491 0.1058 lu_object.c:1115 obdclass.ko lu_object_locate 3490 0.1058 (no location information) ipmi_si.ko port_inb 3457 0.1048 ring_buffer.c:3221 vmlinux ring_buffer_consume 3431 0.1040 cl_page.c:898 obdclass.ko cl_page_owner_set ... I would have expected the per-task performance to maintain at the level (or slightly lower) of the one-reader case. It's what I observe on ext4 for instance. I can easily reproduce the issue and can provide additional information (lustre traces, system traces, system dump, profiling, ...) if needed. Please, note that this issue is serious for this customer because its cluster has lustre-client nodes with many cores (up to 128 cores) and this impacts dramatically applications run time. |
| Comments |
| Comment by Gregoire Pichon [ 12/Dec/12 ] |
|
Here are the reproducer (testcase.sh) and the complete profiling output with 16 tasks and |
| Comment by Gregoire Pichon [ 12/Dec/12 ] |
|
I have uploaded on ftp.whamcloud.com in /uploads/ ftp> ls -l 227 Entering Passive Mode (72,18,218,227,28,64). 150 Here comes the directory listing. -rw-r--r-- 1 123 114 126464137 Dec 12 05:49 vmcore -rw-r--r-- 1 123 114 121162472 Dec 12 05:54 vmlinux Here are some elements of the dump analysis.
I suppose the page remains under PG_locked while being read over the network from the OSTs. Are the CLIO structures (cl_page and cl_object) for the read file shared by all the tasks or duplicated for each task ? |
| Comment by Jinshan Xiong (Inactive) [ 12/Dec/12 ] |
|
|
| Comment by Jinshan Xiong (Inactive) [ 14/Dec/12 ] |
|
Does the file stripe to multiple OSTs? So you just launched a bunch of processes with mpirun in one node, and each process just reads the whole file. I'm not sure what's the use case for this testing. You can try to make each process to read different portions of file and I think you can get better performance. You can further try to stripe-read so that read RPCs will be sequential when they reach each OST. |
| Comment by Gregoire Pichon [ 18/Dec/12 ] |
|
Yes, the file stripes to all OSTs (30 OSTs). I don't know the application that our customer is running, but it does not seem to me unusual for an application to have its tasks first read some shared input data. Reading different portions of a file might improve performance but it is another use case. How CLIO page layer has been designed for single shared file access ? Are there some internal structure shared or duplicated for each task ? |
| Comment by Alexandre Louvet [ 18/Dec/12 ] |
|
Let says the application is embarrassingly parallel. In such case, any process is fully independent and not run using mpi and we can imagine that all processes will read the same input parameter files. Since all processes are reading the same input file I would imagine that :
|
| Comment by Jinshan Xiong (Inactive) [ 17/Jan/13 ] |
|
Yes, Alexandre Louvet pointed out the exact point. Can you please make sure the behavior of your application, if it's the exact the same way as your test program, I'm afraid you have revised it. |
| Comment by Gregoire Pichon [ 18/Jan/13 ] |
|
Jinshan, |
| Comment by Jinshan Xiong (Inactive) [ 24/Jan/13 ] |
|
Hi Gregoire, In that case, I think you could try real parallel read which means each process just read a part of file, for example stripe by stripe. You can start from the testing program before changing the application. Here is what I suggest you to revise the test program as follows: Usually you can see some performance improvement this way, however, single thread can't saturate an OST if you're using a complex storage system for the OST. In that case, you can try to have M*2 threads, and make two threads to read each OST non-overlappedly. |
| Comment by Alexandre Louvet [ 25/Jan/13 ] |
|
Jinshan, We know about all of that, but this is not the problem. The application is a group of program running on the same node independently. There is no MPI and openmp or pthread will increase the complexity of the application. We definitively need to have a good performance while multiple processes read from the same file (and don't misunderstand us, we don't ask that the performance for one process taken from a group become faster than one process alone in the box, we just want that the aggregate bandwidth become "reasonable"). The original issue was reported on larger nodes than the one used (I guess) by Gregoire. It is a 128 cores system with 16 NHM socket (so 16 Numa zone). Here are some numbers on this box : while running the same test on the local disk (a single regular spindle) gives : I agree that the fact the box is a large numa node emphasis the issue. Running with more processes doesn't help and the aggregate performance continue to goes down. I hope it is more clear. |
| Comment by Jinshan Xiong (Inactive) [ 25/Jan/13 ] |
|
ah I see what you mean. Actually you don't care about the raw throughput from OST to client, you just want to see aggregate performance increasing when the number of threads is increasing, even most of the CPUs just read data from local cache. I'll investigate this issue. Before doing that, can you please run your test program again with readahead disabled. You can disable readahead by writing 0 to max_read_ahead_mb lprocfs entry. |
| Comment by Jinshan Xiong (Inactive) [ 25/Jan/13 ] |
|
Also please provide me the brw_stats when test is finished. Please remember to reset it before test starts. |
| Comment by Gregoire Pichon [ 28/Jan/13 ] |
|
Here are stats you requested. There are four runs. The first two are with max_read_ahead_mb=40 (default value). The last two are with max_read_ahead_mb=0 and only 30MB read. Run tasks Mean Total (MB/s) 20130128_104229 1 453 453 20130128_102900 16 9 150 20130128_111400 1 0.40 0.40 20130128_111541 16 0.39 6 When read ahead is disabled, rpcs seems to be made of 1 page only, even though the application (dd) performs read of 1MB. Amazing! |
| Comment by Gregoire Pichon [ 28/Jan/13 ] |
|
just to be clear: client is 'berlin20' and OSSs are 'berlin7' and 'berlin8'. |
| Comment by Jinshan Xiong (Inactive) [ 28/Jan/13 ] |
|
I've known the problem. Actually we have fixed this problem in 2.4. I'll spend some time on listing which patches are needed to backport to fix it. I was confused by what you really want otherwise your issue could be fixed much earlier, sorry about that. |
| Comment by Jinshan Xiong (Inactive) [ 07/Feb/13 ] |
|
I'd like to backport patch 60e052d0 and 33257361 back to 2.3, they may help reduce the contention of global coh_page_guard. |
| Comment by Peter Jones [ 07/Feb/13 ] |
|
Jinshan Do you mean b2_1 rather than b2_3? It is the former that Bull are using... Peter |
| Comment by Gregoire Pichon [ 14/Feb/13 ] |
|
I have backported in b2_1 the two patches mentioned by Jinshan. http://review.whamcloud.com/5428 I will publish the tests results soon. |
| Comment by Gregoire Pichon [ 19/Feb/13 ] |
|
Here are some results on a 32 cores client node: lustre 2.1.3
tasks Mean Total (MB/s)
dd 1 449 449
dd 2 114 228
dd 4 71 287
dd 8 29 236
dd 16 9 155
lustre 2.1.3 + #5428 + #5429
tasks Mean Total (MB/s)
dd 1 468 468
dd 2 115 230
dd 4 105 420
dd 8 103 824
dd 16 69 1110
dd 32 52 1676
The combined two patches provide a significant improvement. We are now waiting for the patch review and landing process. |
| Comment by Andreas Dilger [ 13/May/13 ] |
|
Both of the patches have landed to b2_1, so closing this bug. |