Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.1.5
    • Lustre 2.1.3
    • lustre 2.1.3
      kernel 2.6.32-220
      ofed 1.5.4
      bullxlinux 6.2 (based on redhat 6.2)

    Description

      One of our customer (TGCC - lustre 2.1.3) is facing a performance issue while reading in the same file from many tasks on one client at the same time. Performance drops while increasing the number of tasks.

      I have reproduced the issue and obtained these results:

          tasks     Mean    Total (MB/s)
      dd      1      339      339
      dd      2      107      214
      dd      4       78      312
      dd      8       64      512
      dd     16       26      428
      

      Applying the patch http://review.whamcloud.com/#change,3627 on lustre 2.1.3 gives no real improvement.

          tasks     Mean    Total (MB/s)
      dd      1      542      542
      dd      2      118      236
      dd      4       74      298
      dd      8       73      591
      dd     16       52      833
      

      The profiling shows contention in CLIO page routines.
      (I suppose the libpython entry comes from the mpirun synchronization of the 16 tasks)

      samples  %        linenr info                 app name                 symbol name
      1149893  34.8489  (no location information)   libpython2.6.so.1.0      /usr/lib64/libpython2.6.so.1.0
      338676   10.2640  dec_and_lock.c:21           vmlinux                  _atomic_dec_and_lock
      335241   10.1599  cl_page.c:404               obdclass.ko              cl_page_find0
      293448    8.8933  cl_page.c:729               obdclass.ko              cl_vmpage_page
      235856    7.1479  filemap.c:963               vmlinux                  grab_cache_page_nowait
      142686    4.3243  cl_page.c:661               obdclass.ko              cl_page_put
      80065     2.4265  filemap.c:667               vmlinux                  find_get_page
      46623     1.4130  copy_user_64.S:240          vmlinux                  copy_user_generic_string
      34127     1.0343  (no location information)   libpthread-2.12.so       pthread_rwlock_rdlock
      33685     1.0209  (no location information)   libc-2.12.so             getenv
      31725     0.9615  (no location information)   libpthread-2.12.so       pthread_rwlock_unlock
      30492     0.9241  swap.c:288                  vmlinux                  activate_page
      28528     0.8646  (no location information)   libc-2.12.so             __dcigettext
      23703     0.7183  (no location information)   libc-2.12.so             __strlen_sse42
      22345     0.6772  vvp_page.c:120              lustre.ko                vvp_page_unassume
      20646     0.6257  swap.c:183                  vmlinux                  put_page
      18525     0.5614  intel_idle.c:215            vmlinux                  intel_idle
      12774     0.3871  open.c:961                  vmlinux                  sys_close
      12702     0.3849  filemap.c:579               vmlinux                  unlock_page
      12685     0.3844  (no location information)   libc-2.12.so             __memcpy_ssse3
      12134     0.3677  cl_page.c:874               obdclass.ko              cl_page_invoid
      11734     0.3556  (no location information)   libpthread-2.12.so       __close_nocancel
      10573     0.3204  cl_page.c:884               obdclass.ko              cl_page_owner_clear
      10510     0.3185  (no location information)   libc-2.12.so             strerror_r
      9783      0.2965  cl_page.c:1068              obdclass.ko              cl_page_unassume
      9472      0.2871  (no location information)   oprofiled                /usr/bin/oprofiled
      9438      0.2860  (no location information)   libc-2.12.so             __strncmp_sse2
      8826      0.2675  rw.c:724                    lustre.ko                ll_readahead
      8769      0.2658  filemap.c:527               vmlinux                  page_waitqueue
      8095      0.2453  entry_64.S:470              vmlinux                  system_call_after_swapgs
      7756      0.2351  cl_page.c:561               obdclass.ko              cl_page_state_set0
      7615      0.2308  (no location information)   libc-2.12.so             __stpcpy_sse2
      7366      0.2232  entry_64.S:462              vmlinux                  system_call
      7097      0.2151  sched.c:4430                vmlinux                  find_busiest_group
      6898      0.2091  (no location information)   libperl.so               /usr/lib64/perl5/CORE/libperl.so
      6887      0.2087  (no location information)   libc-2.12.so             __ctype_b_loc
      6780      0.2055  wait.c:251                  vmlinux                  __wake_up_bit
      6631      0.2010  cl_page.c:139               obdclass.ko              cl_page_at_trusted
      6588      0.1997  (no location information)   libc-2.12.so             __strlen_sse2
      6336      0.1920  lov_io.c:239                lov.ko                   lov_page_stripe
      5289      0.1603  lov_io.c:208                lov.ko                   lov_sub_get
      5263      0.1595  ring_buffer.c:2834          vmlinux                  rb_get_reader_page
      4595      0.1393  lvfs_lib.c:76               lvfs.ko                  lprocfs_counter_add
      4373      0.1325  (no location information)   libc-2.12.so             strerror
      4347      0.1317  lov_io.c:252                lov.ko                   lov_page_subio
      4337      0.1314  cl_page.c:1036              obdclass.ko              cl_page_assume
      4310      0.1306  (no location information)   libc-2.12.so             __mempcpy_sse2
      3903      0.1183  lov_page.c:92               lov.ko                   lov_page_own
      3866      0.1172  mutex.c:409                 vmlinux                  __mutex_lock_slowpath
      3841      0.1164  lcommon_cl.c:1097           lustre.ko                cl2vm_page
      3698      0.1121  radix-tree.c:414            vmlinux                  radix_tree_lookup_slot
      3506      0.1063  vvp_page.c:109              lustre.ko                vvp_page_assume
      3491      0.1058  lu_object.c:1115            obdclass.ko              lu_object_locate
      3490      0.1058  (no location information)   ipmi_si.ko               port_inb
      3457      0.1048  ring_buffer.c:3221          vmlinux                  ring_buffer_consume
      3431      0.1040  cl_page.c:898               obdclass.ko              cl_page_owner_set
      ...
      

      I would have expected the per-task performance to maintain at the level (or slightly lower) of the one-reader case. It's what I observe on ext4 for instance.

      I can easily reproduce the issue and can provide additional information (lustre traces, system traces, system dump, profiling, ...) if needed.

      Please, note that this issue is serious for this customer because its cluster has lustre-client nodes with many cores (up to 128 cores) and this impacts dramatically applications run time.

      Attachments

        1. oprofile.report.lu1666.16tasks.txt
          220 kB
        2. run.stats.gz
          43 kB
        3. testcase.sh
          0.8 kB

        Issue Links

          Activity

            [LU-2481] performance issue of parallel read

            Both of the patches have landed to b2_1, so closing this bug.

            adilger Andreas Dilger added a comment - Both of the patches have landed to b2_1, so closing this bug.

            Here are some results on a 32 cores client node:

            lustre 2.1.3
                tasks     Mean    Total (MB/s)
            dd      1      449      449
            dd      2      114      228
            dd      4       71      287
            dd      8       29      236
            dd     16        9      155
            
            
            lustre 2.1.3 + #5428 + #5429
                tasks     Mean    Total (MB/s)
            dd      1      468      468
            dd      2      115      230
            dd      4      105      420
            dd      8      103      824
            dd     16       69     1110
            dd     32       52     1676
            

            The combined two patches provide a significant improvement.
            Customer is happy with the performance gain on its application.

            We are now waiting for the patch review and landing process.

            pichong Gregoire Pichon added a comment - Here are some results on a 32 cores client node: lustre 2.1.3 tasks Mean Total (MB/s) dd 1 449 449 dd 2 114 228 dd 4 71 287 dd 8 29 236 dd 16 9 155 lustre 2.1.3 + #5428 + #5429 tasks Mean Total (MB/s) dd 1 468 468 dd 2 115 230 dd 4 105 420 dd 8 103 824 dd 16 69 1110 dd 32 52 1676 The combined two patches provide a significant improvement. Customer is happy with the performance gain on its application. We are now waiting for the patch review and landing process.

            I have backported in b2_1 the two patches mentioned by Jinshan.

            http://review.whamcloud.com/5428
            http://review.whamcloud.com/5429

            I will publish the tests results soon.

            pichong Gregoire Pichon added a comment - I have backported in b2_1 the two patches mentioned by Jinshan. http://review.whamcloud.com/5428 http://review.whamcloud.com/5429 I will publish the tests results soon.
            pjones Peter Jones added a comment -

            Jinshan

            Do you mean b2_1 rather than b2_3? It is the former that Bull are using...

            Peter

            pjones Peter Jones added a comment - Jinshan Do you mean b2_1 rather than b2_3? It is the former that Bull are using... Peter

            I'd like to backport patch 60e052d0 and 33257361 back to 2.3, they may help reduce the contention of global coh_page_guard.

            jay Jinshan Xiong (Inactive) added a comment - I'd like to backport patch 60e052d0 and 33257361 back to 2.3, they may help reduce the contention of global coh_page_guard.

            I've known the problem. Actually we have fixed this problem in 2.4. I'll spend some time on listing which patches are needed to backport to fix it.

            I was confused by what you really want otherwise your issue could be fixed much earlier, sorry about that.

            jay Jinshan Xiong (Inactive) added a comment - I've known the problem. Actually we have fixed this problem in 2.4. I'll spend some time on listing which patches are needed to backport to fix it. I was confused by what you really want otherwise your issue could be fixed much earlier, sorry about that.

            just to be clear: client is 'berlin20' and OSSs are 'berlin7' and 'berlin8'.

            pichong Gregoire Pichon added a comment - just to be clear: client is 'berlin20' and OSSs are 'berlin7' and 'berlin8'.

            Here are stats you requested.

            There are four runs. The first two are with max_read_ahead_mb=40 (default value). The last two are with max_read_ahead_mb=0 and only 30MB read.

            Run              tasks     Mean    Total (MB/s)
            20130128_104229      1      453      453
            20130128_102900     16        9      150
            
            20130128_111400      1     0.40     0.40
            20130128_111541     16     0.39        6
            

            When read ahead is disabled, rpcs seems to be made of 1 page only, even though the application (dd) performs read of 1MB. Amazing!

            pichong Gregoire Pichon added a comment - Here are stats you requested. There are four runs. The first two are with max_read_ahead_mb=40 (default value). The last two are with max_read_ahead_mb=0 and only 30MB read. Run tasks Mean Total (MB/s) 20130128_104229 1 453 453 20130128_102900 16 9 150 20130128_111400 1 0.40 0.40 20130128_111541 16 0.39 6 When read ahead is disabled, rpcs seems to be made of 1 page only, even though the application (dd) performs read of 1MB. Amazing!
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Also please provide me the brw_stats when test is finished. Please remember to reset it before test starts.

            jay Jinshan Xiong (Inactive) added a comment - - edited Also please provide me the brw_stats when test is finished. Please remember to reset it before test starts.

            ah I see what you mean. Actually you don't care about the raw throughput from OST to client, you just want to see aggregate performance increasing when the number of threads is increasing, even most of the CPUs just read data from local cache.

            I'll investigate this issue. Before doing that, can you please run your test program again with readahead disabled. You can disable readahead by writing 0 to max_read_ahead_mb lprocfs entry.

            jay Jinshan Xiong (Inactive) added a comment - ah I see what you mean. Actually you don't care about the raw throughput from OST to client, you just want to see aggregate performance increasing when the number of threads is increasing, even most of the CPUs just read data from local cache. I'll investigate this issue. Before doing that, can you please run your test program again with readahead disabled. You can disable readahead by writing 0 to max_read_ahead_mb lprocfs entry.

            People

              jay Jinshan Xiong (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: