[LU-2481] performance issue of parallel read Created: 12/Dec/12  Updated: 13/May/13  Resolved: 13/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: Lustre 2.1.5

Type: Bug Priority: Major
Reporter: Gregoire Pichon Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: ptr
Environment:

lustre 2.1.3
kernel 2.6.32-220
ofed 1.5.4
bullxlinux 6.2 (based on redhat 6.2)


Attachments: Text File oprofile.report.lu1666.16tasks.txt     File run.stats.gz     File testcase.sh    
Issue Links:
Related
is related to LU-1666 hdr->coh_page_guard contention (singl... Resolved
Severity: 3
Epic: client, performance, read
Rank (Obsolete): 5822

 Description   

One of our customer (TGCC - lustre 2.1.3) is facing a performance issue while reading in the same file from many tasks on one client at the same time. Performance drops while increasing the number of tasks.

I have reproduced the issue and obtained these results:

    tasks     Mean    Total (MB/s)
dd      1      339      339
dd      2      107      214
dd      4       78      312
dd      8       64      512
dd     16       26      428

Applying the patch http://review.whamcloud.com/#change,3627 on lustre 2.1.3 gives no real improvement.

    tasks     Mean    Total (MB/s)
dd      1      542      542
dd      2      118      236
dd      4       74      298
dd      8       73      591
dd     16       52      833

The profiling shows contention in CLIO page routines.
(I suppose the libpython entry comes from the mpirun synchronization of the 16 tasks)

samples  %        linenr info                 app name                 symbol name
1149893  34.8489  (no location information)   libpython2.6.so.1.0      /usr/lib64/libpython2.6.so.1.0
338676   10.2640  dec_and_lock.c:21           vmlinux                  _atomic_dec_and_lock
335241   10.1599  cl_page.c:404               obdclass.ko              cl_page_find0
293448    8.8933  cl_page.c:729               obdclass.ko              cl_vmpage_page
235856    7.1479  filemap.c:963               vmlinux                  grab_cache_page_nowait
142686    4.3243  cl_page.c:661               obdclass.ko              cl_page_put
80065     2.4265  filemap.c:667               vmlinux                  find_get_page
46623     1.4130  copy_user_64.S:240          vmlinux                  copy_user_generic_string
34127     1.0343  (no location information)   libpthread-2.12.so       pthread_rwlock_rdlock
33685     1.0209  (no location information)   libc-2.12.so             getenv
31725     0.9615  (no location information)   libpthread-2.12.so       pthread_rwlock_unlock
30492     0.9241  swap.c:288                  vmlinux                  activate_page
28528     0.8646  (no location information)   libc-2.12.so             __dcigettext
23703     0.7183  (no location information)   libc-2.12.so             __strlen_sse42
22345     0.6772  vvp_page.c:120              lustre.ko                vvp_page_unassume
20646     0.6257  swap.c:183                  vmlinux                  put_page
18525     0.5614  intel_idle.c:215            vmlinux                  intel_idle
12774     0.3871  open.c:961                  vmlinux                  sys_close
12702     0.3849  filemap.c:579               vmlinux                  unlock_page
12685     0.3844  (no location information)   libc-2.12.so             __memcpy_ssse3
12134     0.3677  cl_page.c:874               obdclass.ko              cl_page_invoid
11734     0.3556  (no location information)   libpthread-2.12.so       __close_nocancel
10573     0.3204  cl_page.c:884               obdclass.ko              cl_page_owner_clear
10510     0.3185  (no location information)   libc-2.12.so             strerror_r
9783      0.2965  cl_page.c:1068              obdclass.ko              cl_page_unassume
9472      0.2871  (no location information)   oprofiled                /usr/bin/oprofiled
9438      0.2860  (no location information)   libc-2.12.so             __strncmp_sse2
8826      0.2675  rw.c:724                    lustre.ko                ll_readahead
8769      0.2658  filemap.c:527               vmlinux                  page_waitqueue
8095      0.2453  entry_64.S:470              vmlinux                  system_call_after_swapgs
7756      0.2351  cl_page.c:561               obdclass.ko              cl_page_state_set0
7615      0.2308  (no location information)   libc-2.12.so             __stpcpy_sse2
7366      0.2232  entry_64.S:462              vmlinux                  system_call
7097      0.2151  sched.c:4430                vmlinux                  find_busiest_group
6898      0.2091  (no location information)   libperl.so               /usr/lib64/perl5/CORE/libperl.so
6887      0.2087  (no location information)   libc-2.12.so             __ctype_b_loc
6780      0.2055  wait.c:251                  vmlinux                  __wake_up_bit
6631      0.2010  cl_page.c:139               obdclass.ko              cl_page_at_trusted
6588      0.1997  (no location information)   libc-2.12.so             __strlen_sse2
6336      0.1920  lov_io.c:239                lov.ko                   lov_page_stripe
5289      0.1603  lov_io.c:208                lov.ko                   lov_sub_get
5263      0.1595  ring_buffer.c:2834          vmlinux                  rb_get_reader_page
4595      0.1393  lvfs_lib.c:76               lvfs.ko                  lprocfs_counter_add
4373      0.1325  (no location information)   libc-2.12.so             strerror
4347      0.1317  lov_io.c:252                lov.ko                   lov_page_subio
4337      0.1314  cl_page.c:1036              obdclass.ko              cl_page_assume
4310      0.1306  (no location information)   libc-2.12.so             __mempcpy_sse2
3903      0.1183  lov_page.c:92               lov.ko                   lov_page_own
3866      0.1172  mutex.c:409                 vmlinux                  __mutex_lock_slowpath
3841      0.1164  lcommon_cl.c:1097           lustre.ko                cl2vm_page
3698      0.1121  radix-tree.c:414            vmlinux                  radix_tree_lookup_slot
3506      0.1063  vvp_page.c:109              lustre.ko                vvp_page_assume
3491      0.1058  lu_object.c:1115            obdclass.ko              lu_object_locate
3490      0.1058  (no location information)   ipmi_si.ko               port_inb
3457      0.1048  ring_buffer.c:3221          vmlinux                  ring_buffer_consume
3431      0.1040  cl_page.c:898               obdclass.ko              cl_page_owner_set
...

I would have expected the per-task performance to maintain at the level (or slightly lower) of the one-reader case. It's what I observe on ext4 for instance.

I can easily reproduce the issue and can provide additional information (lustre traces, system traces, system dump, profiling, ...) if needed.

Please, note that this issue is serious for this customer because its cluster has lustre-client nodes with many cores (up to 128 cores) and this impacts dramatically applications run time.



 Comments   
Comment by Gregoire Pichon [ 12/Dec/12 ]

Here are the reproducer (testcase.sh) and the complete profiling output with 16 tasks and LU-1666 patch applied.

Comment by Gregoire Pichon [ 12/Dec/12 ]

I have uploaded on ftp.whamcloud.com in /uploads/LU-2481 a system dump taken during the testcase with 16 tasks.

ftp> ls -l
227 Entering Passive Mode (72,18,218,227,28,64).
150 Here comes the directory listing.
-rw-r--r--    1 123      114      126464137 Dec 12 05:49 vmcore
-rw-r--r--    1 123      114      121162472 Dec 12 05:54 vmlinux

Here are some elements of the dump analysis.
The dd tasks are all in the generic_file_aio_read()/do_generic_file_read() routine waiting for the page to be uptodate and unlocked after being copied from the remote OSTs to the system page cache.

  • tasks 3434, 3435, 3436, 3438, 3440, 3442, 3447, 3449
    they are reading file at offset 0x2d000000 and waiting on struct page 0xffffea000f0d8ab8
  • tasks 3437, 3439, 3441, 3443, 3444, 3445, 3446, 3448
    they are reading file at offset 0x2d900000 and waiting on struct page 0xffffea001d517300

I suppose the page remains under PG_locked while being read over the network from the OSTs.
Unfortunately, this does not explain why it makes a difference between 1 and several tasks !

Are the CLIO structures (cl_page and cl_object) for the read file shared by all the tasks or duplicated for each task ?

Comment by Jinshan Xiong (Inactive) [ 12/Dec/12 ]

LU-1666 is discussing single file write performance.

Comment by Jinshan Xiong (Inactive) [ 14/Dec/12 ]

Does the file stripe to multiple OSTs?

So you just launched a bunch of processes with mpirun in one node, and each process just reads the whole file. I'm not sure what's the use case for this testing. You can try to make each process to read different portions of file and I think you can get better performance. You can further try to stripe-read so that read RPCs will be sequential when they reach each OST.

Comment by Gregoire Pichon [ 18/Dec/12 ]

Yes, the file stripes to all OSTs (30 OSTs).
It is already doing stripe-reads: dd block size is 1MB and filesystem stripe size is 1MB.

I don't know the application that our customer is running, but it does not seem to me unusual for an application to have its tasks first read some shared input data. Reading different portions of a file might improve performance but it is another use case.

How CLIO page layer has been designed for single shared file access ? Are there some internal structure shared or duplicated for each task ?

Comment by Alexandre Louvet [ 18/Dec/12 ]

Let says the application is embarrassingly parallel. In such case, any process is fully independent and not run using mpi and we can imagine that all processes will read the same input parameter files. Since all processes are reading the same input file I would imagine that :

  • one process will do the real IO (ie send RPC and populate the buffer cache)
  • all others will read datas from the buffer cache
    In such case, the performance of the first process should be (almost) the same in such case than if the process was alone. The aggregate bandwidth of all processes should be around (#processes * performance_of_the_first_one). This is not what we observe. Increasing the number of processes reduce the aggregate bandwidth. On our large systems (128 cpus) the aggregate performance become extremely slow.
Comment by Jinshan Xiong (Inactive) [ 17/Jan/13 ]

Yes, Alexandre Louvet pointed out the exact point. Can you please make sure the behavior of your application, if it's the exact the same way as your test program, I'm afraid you have revised it.

Comment by Gregoire Pichon [ 18/Jan/13 ]

Jinshan,
Sorry, I am not sure to understand your previous comment. Could you elaborate ?
The test program I have provided does behave the same as the real application.

Comment by Jinshan Xiong (Inactive) [ 24/Jan/13 ]

Hi Gregoire,

In that case, I think you could try real parallel read which means each process just read a part of file, for example stripe by stripe. You can start from the testing program before changing the application.

Here is what I suggest you to revise the test program as follows:
1. say the file to be read is striped over M OSTs, and stripe size is N bytes;
2. to have M threads, control each thread to read a specific OST, for example, for thread i, i=[0..N-1), it will read file portion of [(i+j*M)*N, (i+1+j*M)*N), j=[0, 1, ... EoF)

Usually you can see some performance improvement this way, however, single thread can't saturate an OST if you're using a complex storage system for the OST. In that case, you can try to have M*2 threads, and make two threads to read each OST non-overlappedly.

Comment by Alexandre Louvet [ 25/Jan/13 ]

Jinshan,

We know about all of that, but this is not the problem. The application is a group of program running on the same node independently. There is no MPI and openmp or pthread will increase the complexity of the application. We definitively need to have a good performance while multiple processes read from the same file (and don't misunderstand us, we don't ask that the performance for one process taken from a group become faster than one process alone in the box, we just want that the aggregate bandwidth become "reasonable").

The original issue was reported on larger nodes than the one used (I guess) by Gregoire. It is a 128 cores system with 16 NHM socket (so 16 Numa zone). Here are some numbers on this box :
1 : 493 MB/s per process (493 MB/s aggregate)
2 : 49 MB/s per process ( 98 MB/s aggregate)
3 : 54 MB/s per process (162 MB/s aggregate)
4 : 41 MB/s per process (164 MB/s aggregate)
8 : 12 MB/s per process ( 96 MB/s aggregate)

while running the same test on the local disk (a single regular spindle) gives :
1 : 102 MB/s per process (102 MB/s aggregate)
2 : 86 MB/s per process (172 MB/s aggregate)
4 : 101 MB/s per process (404 MB/s aggregate)
8 : 80 MB/s per process (640 MB/s aggregate)
16: 60 MB/s per process (960 MB/s aggregate)
This is what we are supposed to observe. The aggregate bandwidth grow while the number of processes increase (not in the ratio of processes but it grow).

I agree that the fact the box is a large numa node emphasis the issue. Running with more processes doesn't help and the aggregate performance continue to goes down.

I hope it is more clear.

Comment by Jinshan Xiong (Inactive) [ 25/Jan/13 ]

ah I see what you mean. Actually you don't care about the raw throughput from OST to client, you just want to see aggregate performance increasing when the number of threads is increasing, even most of the CPUs just read data from local cache.

I'll investigate this issue. Before doing that, can you please run your test program again with readahead disabled. You can disable readahead by writing 0 to max_read_ahead_mb lprocfs entry.

Comment by Jinshan Xiong (Inactive) [ 25/Jan/13 ]

Also please provide me the brw_stats when test is finished. Please remember to reset it before test starts.

Comment by Gregoire Pichon [ 28/Jan/13 ]

Here are stats you requested.

There are four runs. The first two are with max_read_ahead_mb=40 (default value). The last two are with max_read_ahead_mb=0 and only 30MB read.

Run              tasks     Mean    Total (MB/s)
20130128_104229      1      453      453
20130128_102900     16        9      150

20130128_111400      1     0.40     0.40
20130128_111541     16     0.39        6

When read ahead is disabled, rpcs seems to be made of 1 page only, even though the application (dd) performs read of 1MB. Amazing!

Comment by Gregoire Pichon [ 28/Jan/13 ]

just to be clear: client is 'berlin20' and OSSs are 'berlin7' and 'berlin8'.

Comment by Jinshan Xiong (Inactive) [ 28/Jan/13 ]

I've known the problem. Actually we have fixed this problem in 2.4. I'll spend some time on listing which patches are needed to backport to fix it.

I was confused by what you really want otherwise your issue could be fixed much earlier, sorry about that.

Comment by Jinshan Xiong (Inactive) [ 07/Feb/13 ]

I'd like to backport patch 60e052d0 and 33257361 back to 2.3, they may help reduce the contention of global coh_page_guard.

Comment by Peter Jones [ 07/Feb/13 ]

Jinshan

Do you mean b2_1 rather than b2_3? It is the former that Bull are using...

Peter

Comment by Gregoire Pichon [ 14/Feb/13 ]

I have backported in b2_1 the two patches mentioned by Jinshan.

http://review.whamcloud.com/5428
http://review.whamcloud.com/5429

I will publish the tests results soon.

Comment by Gregoire Pichon [ 19/Feb/13 ]

Here are some results on a 32 cores client node:

lustre 2.1.3
    tasks     Mean    Total (MB/s)
dd      1      449      449
dd      2      114      228
dd      4       71      287
dd      8       29      236
dd     16        9      155


lustre 2.1.3 + #5428 + #5429
    tasks     Mean    Total (MB/s)
dd      1      468      468
dd      2      115      230
dd      4      105      420
dd      8      103      824
dd     16       69     1110
dd     32       52     1676

The combined two patches provide a significant improvement.
Customer is happy with the performance gain on its application.

We are now waiting for the patch review and landing process.

Comment by Andreas Dilger [ 13/May/13 ]

Both of the patches have landed to b2_1, so closing this bug.

Generated at Sat Feb 10 01:25:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.