[LU-2481] performance issue of parallel read - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.1.5
Affects Version/s: Lustre 2.1.3
Labels:
- ptr
Environment:
lustre 2.1.3
kernel 2.6.32-220
ofed 1.5.4
bullxlinux 6.2 (based on redhat 6.2)

Severity:
3
Epic:
- client
- performance
- read
Rank (Obsolete):
5822

Description

One of our customer (TGCC - lustre 2.1.3) is facing a performance issue while reading in the same file from many tasks on one client at the same time. Performance drops while increasing the number of tasks.

I have reproduced the issue and obtained these results:

    tasks     Mean    Total (MB/s)
dd      1      339      339
dd      2      107      214
dd      4       78      312
dd      8       64      512
dd     16       26      428

Applying the patch http://review.whamcloud.com/#change,3627 on lustre 2.1.3 gives no real improvement.

    tasks     Mean    Total (MB/s)
dd      1      542      542
dd      2      118      236
dd      4       74      298
dd      8       73      591
dd     16       52      833

The profiling shows contention in CLIO page routines.
(I suppose the libpython entry comes from the mpirun synchronization of the 16 tasks)

samples  %        linenr info                 app name                 symbol name
1149893  34.8489  (no location information)   libpython2.6.so.1.0      /usr/lib64/libpython2.6.so.1.0
338676   10.2640  dec_and_lock.c:21           vmlinux                  _atomic_dec_and_lock
335241   10.1599  cl_page.c:404               obdclass.ko              cl_page_find0
293448    8.8933  cl_page.c:729               obdclass.ko              cl_vmpage_page
235856    7.1479  filemap.c:963               vmlinux                  grab_cache_page_nowait
142686    4.3243  cl_page.c:661               obdclass.ko              cl_page_put
80065     2.4265  filemap.c:667               vmlinux                  find_get_page
46623     1.4130  copy_user_64.S:240          vmlinux                  copy_user_generic_string
34127     1.0343  (no location information)   libpthread-2.12.so       pthread_rwlock_rdlock
33685     1.0209  (no location information)   libc-2.12.so             getenv
31725     0.9615  (no location information)   libpthread-2.12.so       pthread_rwlock_unlock
30492     0.9241  swap.c:288                  vmlinux                  activate_page
28528     0.8646  (no location information)   libc-2.12.so             __dcigettext
23703     0.7183  (no location information)   libc-2.12.so             __strlen_sse42
22345     0.6772  vvp_page.c:120              lustre.ko                vvp_page_unassume
20646     0.6257  swap.c:183                  vmlinux                  put_page
18525     0.5614  intel_idle.c:215            vmlinux                  intel_idle
12774     0.3871  open.c:961                  vmlinux                  sys_close
12702     0.3849  filemap.c:579               vmlinux                  unlock_page
12685     0.3844  (no location information)   libc-2.12.so             __memcpy_ssse3
12134     0.3677  cl_page.c:874               obdclass.ko              cl_page_invoid
11734     0.3556  (no location information)   libpthread-2.12.so       __close_nocancel
10573     0.3204  cl_page.c:884               obdclass.ko              cl_page_owner_clear
10510     0.3185  (no location information)   libc-2.12.so             strerror_r
9783      0.2965  cl_page.c:1068              obdclass.ko              cl_page_unassume
9472      0.2871  (no location information)   oprofiled                /usr/bin/oprofiled
9438      0.2860  (no location information)   libc-2.12.so             __strncmp_sse2
8826      0.2675  rw.c:724                    lustre.ko                ll_readahead
8769      0.2658  filemap.c:527               vmlinux                  page_waitqueue
8095      0.2453  entry_64.S:470              vmlinux                  system_call_after_swapgs
7756      0.2351  cl_page.c:561               obdclass.ko              cl_page_state_set0
7615      0.2308  (no location information)   libc-2.12.so             __stpcpy_sse2
7366      0.2232  entry_64.S:462              vmlinux                  system_call
7097      0.2151  sched.c:4430                vmlinux                  find_busiest_group
6898      0.2091  (no location information)   libperl.so               /usr/lib64/perl5/CORE/libperl.so
6887      0.2087  (no location information)   libc-2.12.so             __ctype_b_loc
6780      0.2055  wait.c:251                  vmlinux                  __wake_up_bit
6631      0.2010  cl_page.c:139               obdclass.ko              cl_page_at_trusted
6588      0.1997  (no location information)   libc-2.12.so             __strlen_sse2
6336      0.1920  lov_io.c:239                lov.ko                   lov_page_stripe
5289      0.1603  lov_io.c:208                lov.ko                   lov_sub_get
5263      0.1595  ring_buffer.c:2834          vmlinux                  rb_get_reader_page
4595      0.1393  lvfs_lib.c:76               lvfs.ko                  lprocfs_counter_add
4373      0.1325  (no location information)   libc-2.12.so             strerror
4347      0.1317  lov_io.c:252                lov.ko                   lov_page_subio
4337      0.1314  cl_page.c:1036              obdclass.ko              cl_page_assume
4310      0.1306  (no location information)   libc-2.12.so             __mempcpy_sse2
3903      0.1183  lov_page.c:92               lov.ko                   lov_page_own
3866      0.1172  mutex.c:409                 vmlinux                  __mutex_lock_slowpath
3841      0.1164  lcommon_cl.c:1097           lustre.ko                cl2vm_page
3698      0.1121  radix-tree.c:414            vmlinux                  radix_tree_lookup_slot
3506      0.1063  vvp_page.c:109              lustre.ko                vvp_page_assume
3491      0.1058  lu_object.c:1115            obdclass.ko              lu_object_locate
3490      0.1058  (no location information)   ipmi_si.ko               port_inb
3457      0.1048  ring_buffer.c:3221          vmlinux                  ring_buffer_consume
3431      0.1040  cl_page.c:898               obdclass.ko              cl_page_owner_set
...

I would have expected the per-task performance to maintain at the level (or slightly lower) of the one-reader case. It's what I observe on ext4 for instance.

I can easily reproduce the issue and can provide additional information (lustre traces, system traces, system dump, profiling, ...) if needed.

Please, note that this issue is serious for this customer because its cluster has lustre-client nodes with many cores (up to 128 cores) and this impacts dramatically applications run time.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oprofile.report.lu1666.16tasks.txt
220 kB
12/Dec/12 8:35 AM
run.stats.gz
43 kB
28/Jan/13 5:35 AM
testcase.sh
0.8 kB
12/Dec/12 8:35 AM

Issue Links

is related to

LU-1666 hdr->coh_page_guard contention (single shared file performance)

Resolved

Activity

[LU-2481] performance issue of parallel read

Andreas Dilger added a comment - 13/May/13 7:15 PM

Both of the patches have landed to b2_1, so closing this bug.

Andreas Dilger added a comment - 13/May/13 7:15 PM Both of the patches have landed to b2_1, so closing this bug.

Gregoire Pichon added a comment - 19/Feb/13 2:57 AM

Here are some results on a 32 cores client node:

lustre 2.1.3
    tasks     Mean    Total (MB/s)
dd      1      449      449
dd      2      114      228
dd      4       71      287
dd      8       29      236
dd     16        9      155


lustre 2.1.3 + #5428 + #5429
    tasks     Mean    Total (MB/s)
dd      1      468      468
dd      2      115      230
dd      4      105      420
dd      8      103      824
dd     16       69     1110
dd     32       52     1676

The combined two patches provide a significant improvement.
Customer is happy with the performance gain on its application.

We are now waiting for the patch review and landing process.

Gregoire Pichon added a comment - 19/Feb/13 2:57 AM Here are some results on a 32 cores client node: lustre 2.1.3 tasks Mean Total (MB/s) dd 1 449 449 dd 2 114 228 dd 4 71 287 dd 8 29 236 dd 16 9 155 lustre 2.1.3 + #5428 + #5429 tasks Mean Total (MB/s) dd 1 468 468 dd 2 115 230 dd 4 105 420 dd 8 103 824 dd 16 69 1110 dd 32 52 1676 The combined two patches provide a significant improvement. Customer is happy with the performance gain on its application. We are now waiting for the patch review and landing process.

Gregoire Pichon added a comment - 14/Feb/13 7:39 AM

I have backported in b2_1 the two patches mentioned by Jinshan.

http://review.whamcloud.com/5428
http://review.whamcloud.com/5429

I will publish the tests results soon.

Gregoire Pichon added a comment - 14/Feb/13 7:39 AM I have backported in b2_1 the two patches mentioned by Jinshan. http://review.whamcloud.com/5428 http://review.whamcloud.com/5429 I will publish the tests results soon.

Peter Jones added a comment - 07/Feb/13 9:06 AM

Jinshan

Do you mean b2_1 rather than b2_3? It is the former that Bull are using...

Peter

Peter Jones added a comment - 07/Feb/13 9:06 AM Jinshan Do you mean b2_1 rather than b2_3? It is the former that Bull are using... Peter

Jinshan Xiong (Inactive) added a comment - 07/Feb/13 2:06 AM

I'd like to backport patch 60e052d0 and 33257361 back to 2.3, they may help reduce the contention of global coh_page_guard.

Jinshan Xiong (Inactive) added a comment - 07/Feb/13 2:06 AM I'd like to backport patch 60e052d0 and 33257361 back to 2.3, they may help reduce the contention of global coh_page_guard.

Jinshan Xiong (Inactive) added a comment - 28/Jan/13 7:04 PM

I've known the problem. Actually we have fixed this problem in 2.4. I'll spend some time on listing which patches are needed to backport to fix it.

I was confused by what you really want otherwise your issue could be fixed much earlier, sorry about that.

Jinshan Xiong (Inactive) added a comment - 28/Jan/13 7:04 PM I've known the problem. Actually we have fixed this problem in 2.4. I'll spend some time on listing which patches are needed to backport to fix it. I was confused by what you really want otherwise your issue could be fixed much earlier, sorry about that.

Gregoire Pichon added a comment - 28/Jan/13 5:42 AM

just to be clear: client is 'berlin20' and OSSs are 'berlin7' and 'berlin8'.

Gregoire Pichon added a comment - 28/Jan/13 5:42 AM just to be clear: client is 'berlin20' and OSSs are 'berlin7' and 'berlin8'.

Gregoire Pichon added a comment - 28/Jan/13 5:35 AM

Here are stats you requested.

There are four runs. The first two are with max_read_ahead_mb=40 (default value). The last two are with max_read_ahead_mb=0 and only 30MB read.

Run              tasks     Mean    Total (MB/s)
20130128_104229      1      453      453
20130128_102900     16        9      150

20130128_111400      1     0.40     0.40
20130128_111541     16     0.39        6

When read ahead is disabled, rpcs seems to be made of 1 page only, even though the application (dd) performs read of 1MB. Amazing!

Gregoire Pichon added a comment - 28/Jan/13 5:35 AM Here are stats you requested. There are four runs. The first two are with max_read_ahead_mb=40 (default value). The last two are with max_read_ahead_mb=0 and only 30MB read. Run tasks Mean Total (MB/s) 20130128_104229 1 453 453 20130128_102900 16 9 150 20130128_111400 1 0.40 0.40 20130128_111541 16 0.39 6 When read ahead is disabled, rpcs seems to be made of 1 page only, even though the application (dd) performs read of 1MB. Amazing!

Jinshan Xiong (Inactive) added a comment - 25/Jan/13 12:05 PM - edited

Also please provide me the brw_stats when test is finished. Please remember to reset it before test starts.

Jinshan Xiong (Inactive) added a comment - 25/Jan/13 12:05 PM - edited Also please provide me the brw_stats when test is finished. Please remember to reset it before test starts.

Jinshan Xiong (Inactive) added a comment - 25/Jan/13 12:04 PM

ah I see what you mean. Actually you don't care about the raw throughput from OST to client, you just want to see aggregate performance increasing when the number of threads is increasing, even most of the CPUs just read data from local cache.

I'll investigate this issue. Before doing that, can you please run your test program again with readahead disabled. You can disable readahead by writing 0 to max_read_ahead_mb lprocfs entry.

Jinshan Xiong (Inactive) added a comment - 25/Jan/13 12:04 PM ah I see what you mean. Actually you don't care about the raw throughput from OST to client, you just want to see aggregate performance increasing when the number of threads is increasing, even most of the CPUs just read data from local cache. I'll investigate this issue. Before doing that, can you please run your test program again with readahead disabled. You can disable readahead by writing 0 to max_read_ahead_mb lprocfs entry.

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Gregoire Pichon

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 12/Dec/12 8:12 AM

Updated:: 13/May/13 7:16 PM

Resolved:: 13/May/13 7:16 PM