Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.1.5
    • Lustre 2.1.3
    • lustre 2.1.3
      kernel 2.6.32-220
      ofed 1.5.4
      bullxlinux 6.2 (based on redhat 6.2)

    Description

      One of our customer (TGCC - lustre 2.1.3) is facing a performance issue while reading in the same file from many tasks on one client at the same time. Performance drops while increasing the number of tasks.

      I have reproduced the issue and obtained these results:

          tasks     Mean    Total (MB/s)
      dd      1      339      339
      dd      2      107      214
      dd      4       78      312
      dd      8       64      512
      dd     16       26      428
      

      Applying the patch http://review.whamcloud.com/#change,3627 on lustre 2.1.3 gives no real improvement.

          tasks     Mean    Total (MB/s)
      dd      1      542      542
      dd      2      118      236
      dd      4       74      298
      dd      8       73      591
      dd     16       52      833
      

      The profiling shows contention in CLIO page routines.
      (I suppose the libpython entry comes from the mpirun synchronization of the 16 tasks)

      samples  %        linenr info                 app name                 symbol name
      1149893  34.8489  (no location information)   libpython2.6.so.1.0      /usr/lib64/libpython2.6.so.1.0
      338676   10.2640  dec_and_lock.c:21           vmlinux                  _atomic_dec_and_lock
      335241   10.1599  cl_page.c:404               obdclass.ko              cl_page_find0
      293448    8.8933  cl_page.c:729               obdclass.ko              cl_vmpage_page
      235856    7.1479  filemap.c:963               vmlinux                  grab_cache_page_nowait
      142686    4.3243  cl_page.c:661               obdclass.ko              cl_page_put
      80065     2.4265  filemap.c:667               vmlinux                  find_get_page
      46623     1.4130  copy_user_64.S:240          vmlinux                  copy_user_generic_string
      34127     1.0343  (no location information)   libpthread-2.12.so       pthread_rwlock_rdlock
      33685     1.0209  (no location information)   libc-2.12.so             getenv
      31725     0.9615  (no location information)   libpthread-2.12.so       pthread_rwlock_unlock
      30492     0.9241  swap.c:288                  vmlinux                  activate_page
      28528     0.8646  (no location information)   libc-2.12.so             __dcigettext
      23703     0.7183  (no location information)   libc-2.12.so             __strlen_sse42
      22345     0.6772  vvp_page.c:120              lustre.ko                vvp_page_unassume
      20646     0.6257  swap.c:183                  vmlinux                  put_page
      18525     0.5614  intel_idle.c:215            vmlinux                  intel_idle
      12774     0.3871  open.c:961                  vmlinux                  sys_close
      12702     0.3849  filemap.c:579               vmlinux                  unlock_page
      12685     0.3844  (no location information)   libc-2.12.so             __memcpy_ssse3
      12134     0.3677  cl_page.c:874               obdclass.ko              cl_page_invoid
      11734     0.3556  (no location information)   libpthread-2.12.so       __close_nocancel
      10573     0.3204  cl_page.c:884               obdclass.ko              cl_page_owner_clear
      10510     0.3185  (no location information)   libc-2.12.so             strerror_r
      9783      0.2965  cl_page.c:1068              obdclass.ko              cl_page_unassume
      9472      0.2871  (no location information)   oprofiled                /usr/bin/oprofiled
      9438      0.2860  (no location information)   libc-2.12.so             __strncmp_sse2
      8826      0.2675  rw.c:724                    lustre.ko                ll_readahead
      8769      0.2658  filemap.c:527               vmlinux                  page_waitqueue
      8095      0.2453  entry_64.S:470              vmlinux                  system_call_after_swapgs
      7756      0.2351  cl_page.c:561               obdclass.ko              cl_page_state_set0
      7615      0.2308  (no location information)   libc-2.12.so             __stpcpy_sse2
      7366      0.2232  entry_64.S:462              vmlinux                  system_call
      7097      0.2151  sched.c:4430                vmlinux                  find_busiest_group
      6898      0.2091  (no location information)   libperl.so               /usr/lib64/perl5/CORE/libperl.so
      6887      0.2087  (no location information)   libc-2.12.so             __ctype_b_loc
      6780      0.2055  wait.c:251                  vmlinux                  __wake_up_bit
      6631      0.2010  cl_page.c:139               obdclass.ko              cl_page_at_trusted
      6588      0.1997  (no location information)   libc-2.12.so             __strlen_sse2
      6336      0.1920  lov_io.c:239                lov.ko                   lov_page_stripe
      5289      0.1603  lov_io.c:208                lov.ko                   lov_sub_get
      5263      0.1595  ring_buffer.c:2834          vmlinux                  rb_get_reader_page
      4595      0.1393  lvfs_lib.c:76               lvfs.ko                  lprocfs_counter_add
      4373      0.1325  (no location information)   libc-2.12.so             strerror
      4347      0.1317  lov_io.c:252                lov.ko                   lov_page_subio
      4337      0.1314  cl_page.c:1036              obdclass.ko              cl_page_assume
      4310      0.1306  (no location information)   libc-2.12.so             __mempcpy_sse2
      3903      0.1183  lov_page.c:92               lov.ko                   lov_page_own
      3866      0.1172  mutex.c:409                 vmlinux                  __mutex_lock_slowpath
      3841      0.1164  lcommon_cl.c:1097           lustre.ko                cl2vm_page
      3698      0.1121  radix-tree.c:414            vmlinux                  radix_tree_lookup_slot
      3506      0.1063  vvp_page.c:109              lustre.ko                vvp_page_assume
      3491      0.1058  lu_object.c:1115            obdclass.ko              lu_object_locate
      3490      0.1058  (no location information)   ipmi_si.ko               port_inb
      3457      0.1048  ring_buffer.c:3221          vmlinux                  ring_buffer_consume
      3431      0.1040  cl_page.c:898               obdclass.ko              cl_page_owner_set
      ...
      

      I would have expected the per-task performance to maintain at the level (or slightly lower) of the one-reader case. It's what I observe on ext4 for instance.

      I can easily reproduce the issue and can provide additional information (lustre traces, system traces, system dump, profiling, ...) if needed.

      Please, note that this issue is serious for this customer because its cluster has lustre-client nodes with many cores (up to 128 cores) and this impacts dramatically applications run time.

      Attachments

        1. oprofile.report.lu1666.16tasks.txt
          220 kB
        2. run.stats.gz
          43 kB
        3. testcase.sh
          0.8 kB

        Issue Links

          Activity

            [LU-2481] performance issue of parallel read
            pjones Peter Jones added a comment -

            Jinshan

            Do you mean b2_1 rather than b2_3? It is the former that Bull are using...

            Peter

            pjones Peter Jones added a comment - Jinshan Do you mean b2_1 rather than b2_3? It is the former that Bull are using... Peter

            I'd like to backport patch 60e052d0 and 33257361 back to 2.3, they may help reduce the contention of global coh_page_guard.

            jay Jinshan Xiong (Inactive) added a comment - I'd like to backport patch 60e052d0 and 33257361 back to 2.3, they may help reduce the contention of global coh_page_guard.

            I've known the problem. Actually we have fixed this problem in 2.4. I'll spend some time on listing which patches are needed to backport to fix it.

            I was confused by what you really want otherwise your issue could be fixed much earlier, sorry about that.

            jay Jinshan Xiong (Inactive) added a comment - I've known the problem. Actually we have fixed this problem in 2.4. I'll spend some time on listing which patches are needed to backport to fix it. I was confused by what you really want otherwise your issue could be fixed much earlier, sorry about that.

            just to be clear: client is 'berlin20' and OSSs are 'berlin7' and 'berlin8'.

            pichong Gregoire Pichon added a comment - just to be clear: client is 'berlin20' and OSSs are 'berlin7' and 'berlin8'.

            Here are stats you requested.

            There are four runs. The first two are with max_read_ahead_mb=40 (default value). The last two are with max_read_ahead_mb=0 and only 30MB read.

            Run              tasks     Mean    Total (MB/s)
            20130128_104229      1      453      453
            20130128_102900     16        9      150
            
            20130128_111400      1     0.40     0.40
            20130128_111541     16     0.39        6
            

            When read ahead is disabled, rpcs seems to be made of 1 page only, even though the application (dd) performs read of 1MB. Amazing!

            pichong Gregoire Pichon added a comment - Here are stats you requested. There are four runs. The first two are with max_read_ahead_mb=40 (default value). The last two are with max_read_ahead_mb=0 and only 30MB read. Run tasks Mean Total (MB/s) 20130128_104229 1 453 453 20130128_102900 16 9 150 20130128_111400 1 0.40 0.40 20130128_111541 16 0.39 6 When read ahead is disabled, rpcs seems to be made of 1 page only, even though the application (dd) performs read of 1MB. Amazing!
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Also please provide me the brw_stats when test is finished. Please remember to reset it before test starts.

            jay Jinshan Xiong (Inactive) added a comment - - edited Also please provide me the brw_stats when test is finished. Please remember to reset it before test starts.

            ah I see what you mean. Actually you don't care about the raw throughput from OST to client, you just want to see aggregate performance increasing when the number of threads is increasing, even most of the CPUs just read data from local cache.

            I'll investigate this issue. Before doing that, can you please run your test program again with readahead disabled. You can disable readahead by writing 0 to max_read_ahead_mb lprocfs entry.

            jay Jinshan Xiong (Inactive) added a comment - ah I see what you mean. Actually you don't care about the raw throughput from OST to client, you just want to see aggregate performance increasing when the number of threads is increasing, even most of the CPUs just read data from local cache. I'll investigate this issue. Before doing that, can you please run your test program again with readahead disabled. You can disable readahead by writing 0 to max_read_ahead_mb lprocfs entry.

            Jinshan,

            We know about all of that, but this is not the problem. The application is a group of program running on the same node independently. There is no MPI and openmp or pthread will increase the complexity of the application. We definitively need to have a good performance while multiple processes read from the same file (and don't misunderstand us, we don't ask that the performance for one process taken from a group become faster than one process alone in the box, we just want that the aggregate bandwidth become "reasonable").

            The original issue was reported on larger nodes than the one used (I guess) by Gregoire. It is a 128 cores system with 16 NHM socket (so 16 Numa zone). Here are some numbers on this box :
            1 : 493 MB/s per process (493 MB/s aggregate)
            2 : 49 MB/s per process ( 98 MB/s aggregate)
            3 : 54 MB/s per process (162 MB/s aggregate)
            4 : 41 MB/s per process (164 MB/s aggregate)
            8 : 12 MB/s per process ( 96 MB/s aggregate)

            while running the same test on the local disk (a single regular spindle) gives :
            1 : 102 MB/s per process (102 MB/s aggregate)
            2 : 86 MB/s per process (172 MB/s aggregate)
            4 : 101 MB/s per process (404 MB/s aggregate)
            8 : 80 MB/s per process (640 MB/s aggregate)
            16: 60 MB/s per process (960 MB/s aggregate)
            This is what we are supposed to observe. The aggregate bandwidth grow while the number of processes increase (not in the ratio of processes but it grow).

            I agree that the fact the box is a large numa node emphasis the issue. Running with more processes doesn't help and the aggregate performance continue to goes down.

            I hope it is more clear.

            louveta Alexandre Louvet (Inactive) added a comment - Jinshan, We know about all of that, but this is not the problem. The application is a group of program running on the same node independently. There is no MPI and openmp or pthread will increase the complexity of the application. We definitively need to have a good performance while multiple processes read from the same file (and don't misunderstand us, we don't ask that the performance for one process taken from a group become faster than one process alone in the box, we just want that the aggregate bandwidth become "reasonable"). The original issue was reported on larger nodes than the one used (I guess) by Gregoire. It is a 128 cores system with 16 NHM socket (so 16 Numa zone). Here are some numbers on this box : 1 : 493 MB/s per process (493 MB/s aggregate) 2 : 49 MB/s per process ( 98 MB/s aggregate) 3 : 54 MB/s per process (162 MB/s aggregate) 4 : 41 MB/s per process (164 MB/s aggregate) 8 : 12 MB/s per process ( 96 MB/s aggregate) while running the same test on the local disk (a single regular spindle) gives : 1 : 102 MB/s per process (102 MB/s aggregate) 2 : 86 MB/s per process (172 MB/s aggregate) 4 : 101 MB/s per process (404 MB/s aggregate) 8 : 80 MB/s per process (640 MB/s aggregate) 16: 60 MB/s per process (960 MB/s aggregate) This is what we are supposed to observe. The aggregate bandwidth grow while the number of processes increase (not in the ratio of processes but it grow). I agree that the fact the box is a large numa node emphasis the issue. Running with more processes doesn't help and the aggregate performance continue to goes down. I hope it is more clear.

            Hi Gregoire,

            In that case, I think you could try real parallel read which means each process just read a part of file, for example stripe by stripe. You can start from the testing program before changing the application.

            Here is what I suggest you to revise the test program as follows:
            1. say the file to be read is striped over M OSTs, and stripe size is N bytes;
            2. to have M threads, control each thread to read a specific OST, for example, for thread i, i=[0..N-1), it will read file portion of [(i+j*M)*N, (i+1+j*M)*N), j=[0, 1, ... EoF)

            Usually you can see some performance improvement this way, however, single thread can't saturate an OST if you're using a complex storage system for the OST. In that case, you can try to have M*2 threads, and make two threads to read each OST non-overlappedly.

            jay Jinshan Xiong (Inactive) added a comment - Hi Gregoire, In that case, I think you could try real parallel read which means each process just read a part of file, for example stripe by stripe. You can start from the testing program before changing the application. Here is what I suggest you to revise the test program as follows: 1. say the file to be read is striped over M OSTs, and stripe size is N bytes; 2. to have M threads, control each thread to read a specific OST, for example, for thread i, i=[0..N-1), it will read file portion of [(i+j*M)*N, (i+1+j*M)*N), j=[0, 1, ... EoF) Usually you can see some performance improvement this way, however, single thread can't saturate an OST if you're using a complex storage system for the OST. In that case, you can try to have M*2 threads, and make two threads to read each OST non-overlappedly.
            pichong Gregoire Pichon added a comment - - edited

            Jinshan,
            Sorry, I am not sure to understand your previous comment. Could you elaborate ?
            The test program I have provided does behave the same as the real application.

            pichong Gregoire Pichon added a comment - - edited Jinshan, Sorry, I am not sure to understand your previous comment. Could you elaborate ? The test program I have provided does behave the same as the real application.

            Yes, Alexandre Louvet pointed out the exact point. Can you please make sure the behavior of your application, if it's the exact the same way as your test program, I'm afraid you have revised it.

            jay Jinshan Xiong (Inactive) added a comment - Yes, Alexandre Louvet pointed out the exact point. Can you please make sure the behavior of your application, if it's the exact the same way as your test program, I'm afraid you have revised it.

            People

              jay Jinshan Xiong (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: