Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3178

single stream write to striped file is slow with 2.X clients

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.4.0, Lustre 2.1.5
    • Four OSS blocks with 2 servers and a NEC SNA460 (AKA NetApp E5560) RAID with 8 OSTs each. CentOS 6.3 OS. Servers run Lustre 2.1.5. Testing with clients with Lustre 1.8.8, 2.1.5, 2.3.63.

    Description

      Single stream write to a striped file has poor performance on Lustre 2.* clients. Discussion started on the hpdd-discuss mailing list (Subject: striped write performance in lustre 2.X?).

      With Lustre 1.8.* (servers+clients) it was rather easy to exceed 1GB/s with a 4-8 striped file. With Lustre 2.1.5 servers again the Lustre 1.8.8 client is the fastest option (though not as fast as it used to be). Comparing Lustre 1.8.8 client run with a 2.3.63ish client including some tunings:

      1.8.8 client:

      # dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256gbx bs=4M count=$((256/4*1000))
      268435456000 bytes (268 GB) copied, 349.921 s, 767 MB/s
      

      2.3.63 client:

      # dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256gby bs=4M count=$((256/4*1000))
      268435456000 bytes (268 GB) copied, 689.333 s, 389 MB/s
      

      1.8.8 client rpc_stats (part):
      (typical OST rpcs in flight for full run)

                              read                    write
      rpcs in flight        rpcs   % cum % |       rpcs   % cum %
      0:                       0   0   0   |      11594  18  18
      1:                       0   0   0   |      13130  20  38
      2:                       0   0   0   |      16592  25  64
      3:                       0   0   0   |      16543  25  90
      4:                       0   0   0   |       3762   5  96
      5:                       0   0   0   |       1598   2  98
      6:                       0   0   0   |        338   0  99
      7:                       0   0   0   |        401   0  99
      8:                       0   0   0   |         45   0  99
      9:                       0   0   0   |          1   0 100
      
                              read                    write
      offset                rpcs   % cum % |       rpcs   % cum %
      0:                       0   0   0   |      64000  99  99
      1:                       0   0   0   |          0   0  99
      2:                       0   0   0   |          0   0  99
      4:                       0   0   0   |          0   0  99
      8:                       0   0   0   |          0   0  99
      16:                      0   0   0   |          0   0  99
      32:                      0   0   0   |          0   0  99
      64:                      0   0   0   |          0   0  99
      128:                     0   0   0   |          4   0 100
      
      1.8.8 client perf top:
                   2151.00  9.3% ll_ra_read_init [lustre]
                   1781.00  7.7% copy_user_generic_string [kernel.kallsyms]
                   1734.00  7.5% _spin_lock [kernel.kallsyms]
                   1300.00  5.6% lov_putref [lov]
                    781.00  3.4% radix_tree_tag_clear [kernel.kallsyms]
                    720.00  3.1% _spin_lock_irqsave [kernel.kallsyms]
                    667.00  2.9% intel_idle [kernel.kallsyms]
                    607.00  2.6% osc_destroy [osc]
                    554.00  2.4% __clear_user [kernel.kallsyms]
                    551.00  2.4% refresh_entry [lvfs]
                    451.00  2.0% kiblnd_ni_fini_pools [ko2iblnd]
                    399.00  1.7% radix_tree_tag_set [kernel.kallsyms]
                    356.00  1.5% __percpu_counter_add [kernel.kallsyms]
                    352.00  1.5% lov_queue_group_io [lov]
                    340.00  1.5% mark_page_accessed [kernel.kallsyms]
                    338.00  1.5% _spin_lock_irq [kernel.kallsyms]
                    307.00  1.3% test_clear_page_writeback [kernel.kallsyms]
                    306.00  1.3% lov_stripe_unlock [lov]
                    302.00  1.3% _spin_unlock_irqrestore [kernel.kallsyms]
                    280.00  1.2% osc_update_grant [osc]
                    272.00  1.2% __dec_zone_state [kernel.kallsyms]
                    221.00  1.0% lop_makes_rpc [osc]
      

      2.3.63 client rpc_stats
      (typical OST rpcs in flight stats from rpc_stats)

                              read                    write
      rpcs in flight        rpcs   % cum % |       rpcs   % cum %
      0:                       0   0   0   |          0   0   0
      1:                       0   0   0   |      16030  25  25
      2:                       0   0   0   |      16516  25  50
      3:                       0   0   0   |      23525  36  87
      4:                       0   0   0   |       7909  12  99
      5:                       0   0   0   |          7   0  99
      6:                       0   0   0   |          5   0  99
      7:                       0   0   0   |          4   0  99
      8:                       0   0   0   |          4   0 100
      
                              read                    write
      offset                rpcs   % cum % |       rpcs   % cum %
      0:                       0   0   0   |          1   0   0
      1:                       0   0   0   |          0   0   0
      2:                       0   0   0   |          0   0   0
      4:                       0   0   0   |          0   0   0
      8:                       0   0   0   |          0   0   0
      16:                      0   0   0   |          0   0   0
      32:                      0   0   0   |          0   0   0
      64:                      0   0   0   |          0   0   0
      128:                     0   0   0   |          0   0   0
      256:                     0   0   0   |          1   0   0
      512:                     0   0   0   |          2   0   0
      1024:                    0   0   0   |          4   0   0
      2048:                    0   0   0   |          8   0   0
      4096:                    0   0   0   |         16   0   0
      8192:                    0   0   0   |         32   0   0
      16384:                   0   0   0   |         64   0   0
      32768:                   0   0   0   |        128   0   0
      65536:                   0   0   0   |        256   0   0
      131072:                  0   0   0   |        512   0   1
      262144:                  0   0   0   |       1024   1   3
      524288:                  0   0   0   |       2048   3   6
      1048576:                         0   0   0   |       4096   6 12
      2097152:                         0   0   0   |       8192  12 25
      4194304:                         0   0   0   |      16384  25 51
      8388608:                         0   0   0   |      31232 48 100
      2.3.63 client perf top (up to 1%):
        5.03%  [kernel]            [k] copy_user_generic_string
        3.79%  [obdclass]          [k] key_fini
        3.78%  [kernel]            [k] _spin_lock
        2.43%  [obdclass]          [k] keys_fill
        2.13%  [kernel]            [k] kmem_cache_alloc
        1.99%  [kernel]            [k] __clear_user
        1.92%  [kernel]            [k] kmem_cache_free
        1.78%  [lvfs]              [k] lprocfs_counter_add
        1.48%  [kernel]            [k] kfree
        1.34%  [kernel]            [k] __kmalloc
        1.33%  [kernel]            [k] radix_tree_delete
        1.31%  [kernel]            [k] _spin_lock_irqsave
        1.28%  [obdclass]          [k] cl_page_put
        1.15%  [kernel]            [k] radix_tree_insert
        1.13%  [lvfs]              [k] lprocfs_counter_sub
        1.04%  [osc]               [k] osc_lru_del
        1.02%  [obdclass]          [k] cl_page_find0
        1.01%  [obdclass]          [k] lu_context_key_get
      

      Upon request from Jinshan I ran 'collectl -scml' during the dd:

      #### 2.3.63 client:
      [root@sabi22 ~]# collectl -scml -i 15
      waiting for 15 second sample...
      #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client-------->
      #cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
         0   0    78    192  60G 112K   1G  43M 170M  80M       0      0        0      0
         2   2  2578  10780  53G 112K   7G   5G   1G  85M       0      0   392124     96
         5   5  4639  20295  39G 112K  17G  16G   4G  85M       0      0   741103    181
         4   4  4000  16842  31G 112K  24G  22G   6G  85M       0      0   609758    149
         4   4  3046  11499  31G 112K  24G  22G   6G  85M       0      0   411785    101
         4   4  3060  11592  31G 112K  24G  22G   6G  85M       0      0   412058    101
         4   4  3096  11445  31G 112K  24G  23G   6G  85M       0      0   405231     99
         4   4  3080  11440  31G 112K  24G  23G   6G  85M       0      0   404958     99
         4   4  3052  11386  30G 112K  24G  23G   6G  85M       0      0   405777     99
         4   4  3066  11477  30G 112K  24G  23G   6G  85M       0      0   407689    100
         4   4  3044  11552  30G 112K  25G  23G   6G  85M       0      0   412058    101
         4   4  3058  11382  30G 112K  25G  23G   6G  85M       0      0   403319     98
         4   4  3095  11412  30G 112K  25G  23G   6G  85M       0      0   404958     99
         4   4  3067  11370  30G 112K  25G  24G   6G  85M       0      0   403593     99
         4   4  3083  11478  29G 112K  25G  24G   6G  85M       0      0   407689    100
         4   4  3107  11374  29G 112K  25G  24G   6G  85M       0      0   401135     98
         4   4  3094  11280  29G 112K  25G  24G   6G  85M       0      0   396766     97
         4   4  3078  11282  29G 112K  25G  24G   6G  85M       0      0   398404     97
         4   4  3098  11241  29G 112K  26G  24G   6G  85M       0      0   396493     97
         4   4  3067  11238  28G 112K  26G  24G   6G  85M       0      0   393762     96
         4   4  3080  11513  28G 112K  26G  24G   6G  85M       0      0   395947     97
         4   4  3068  11327  28G 112K  26G  24G   6G  85M       0      0   394854     96
      #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client-------->
      #cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
         4   4  3069  11444  28G 112K  26G  25G   6G  85M       0      0   396766     97
         4   4  3060  11337  28G 112K  26G  25G   7G  85M       0      0   392397     96
         4   4  3050  11220  28G 112K  26G  25G   7G  85M       0      0   388301     95
         4   4  3076  11355  28G 112K  26G  25G   7G  85M       0      0   394854     96
         4   4  3096  11359  27G 112K  27G  25G   7G  85M       0      0   394308     96
         4   4  3049  11316  27G 112K  27G  25G   7G  85M       0      0   394854     96
         4   4  3060  11325  27G 112K  27G  25G   7G  85M       0      0   396493     97
         4   4  3058  11381  27G 112K  27G  25G   7G  85M       0      0   396220     97
         4   4  3071  11343  27G 112K  27G  26G   7G  85M       0      0   395127     96
         4   4  3015  11213  27G 112K  27G  26G   7G  85M       0      0   390485     95
         4   4  3046  11328  26G 112K  27G  26G   7G  85M       0      0   394308     96
         4   4  3069  11454  26G 112K  27G  26G   7G  85M       0      0   397858     97
         4   4  3058  11524  26G 112K  28G  26G   7G  85M       0      0   399497     98
         3   3  3033  11427  26G 112K  28G  26G   7G  85M       0      0   397858     97
         3   3  3048  11500  26G 112K  28G  26G   7G  85M       0      0   402227     98
         4   4  3056  11291  26G 112K  28G  26G   7G  85M       0      0   395127     96
         4   4  3067  11223  25G 112K  28G  27G   7G  85M       0      0   394035     96
         3   3  3055  11366  25G 112K  28G  27G   7G  85M       0      0   398404     97
         4   4  3036  11337  25G 112K  28G  27G   7G  85M       0      0   397585     97
         4   4  3057  11307  25G 112K  28G  27G   7G  85M       0      0   395947     97
         4   4  3080  11241  25G 112K  29G  27G   7G  85M       0      0   393489     96
         1   1  1392   5003  25G 112K  29G  27G   7G  80M       0      0   170667     42
      #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client-------->
      #cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
         0   0   101    216  25G 112K  29G  27G   7G  80M       0      0        0      0
      

      Doesn't seem to fill the memory...

      Further details:

      [root@sabi22 ~]# lctl get_param llite.*.max_cached_mb
      llite.lnec-ffff8808348d2c00.max_cached_mb=
      users: 32
      max_cached_mb: 48397
      used_mb: 24610
      unused_mb: 23787
      reclaim_count: 0
      
      [root@sabi22 ~]# lctl get_param osc.*.max_dirty_mb
      osc.lnec-OST0000-osc-ffff8808348d2c00.max_dirty_mb=32
      [...]
      osc.lnec-OST0018-osc-ffff8808348d2c00.max_dirty_mb=32
      osc.lnec-OST0019-osc-ffff8808348d2c00.max_dirty_mb=32
      osc.lnec-OST001a-osc-ffff8808348d2c00.max_dirty_mb=32
      osc.lnec-OST001b-osc-ffff8808348d2c00.max_dirty_mb=32
      osc.lnec-OST001c-osc-ffff8808348d2c00.max_dirty_mb=32
      osc.lnec-OST001d-osc-ffff8808348d2c00.max_dirty_mb=32
      osc.lnec-OST001e-osc-ffff8808348d2c00.max_dirty_mb=32
      osc.lnec-OST001f-osc-ffff8808348d2c00.max_dirty_mb=32
      
      [root@sabi22 EF]# lfs getstripe /mnt/lnec/pool4/stripe_4_4096/test256a
      /mnt/lnec/pool4/stripe_4_4096/test256a
      lmm_stripe_count:   4
      lmm_stripe_size:    4194304
      lmm_layout_gen:     0
      lmm_stripe_offset:  27
      lmm_pool:           osb4
              obdidx           objid           objid           group
                  27            6792         0x1a88                0
                  31            5891         0x1703                0
                  24            6983         0x1b47                0
                  28            5251         0x1483                0
      

      Attachments

        1. oprofile-dd-7bcf31e
          3.21 MB
        2. oprofile-dd-0f1d647
          3.91 MB
        3. opreport_algdw__dd_2.3.63
          3.97 MB
        4. collectl-dd-7bcf31e
          3 kB
        5. collectl-dd-0f1d647
          6 kB

        Activity

          [LU-3178] single stream write to striped file is slow with 2.X clients
          spitzcor Cory Spitz added a comment -

          > Looks like we will not make any further progress on this issue.

          That's a shame. I don't think that "single stream write to striped file is slow with 2.x clients" (compared to 1.8.x) is still yet fixed in 2.6.0. Or is it simply that other tickets track that issue now?

          spitzcor Cory Spitz added a comment - > Looks like we will not make any further progress on this issue. That's a shame. I don't think that "single stream write to striped file is slow with 2.x clients" (compared to 1.8.x) is still yet fixed in 2.6.0. Or is it simply that other tickets track that issue now?

          Looks like we will not make any further progress on this issue.

          jfc John Fuchs-Chesney (Inactive) added a comment - Looks like we will not make any further progress on this issue.

          Hello Erich,
          Is there anything further we can do to help on this ticket?
          Thanks,
          ~ jfc.

          jfc John Fuchs-Chesney (Inactive) added a comment - Hello Erich, Is there anything further we can do to help on this ticket? Thanks, ~ jfc.

          It looks like recent patches degraded the performance, would you please try to revert the patch: d5390bbde9b525b13ab91f5610abe1c4bb98eff9 from your git tree and check how it;s going.

          jay Jinshan Xiong (Inactive) added a comment - It looks like recent patches degraded the performance, would you please try to revert the patch: d5390bbde9b525b13ab91f5610abe1c4bb98eff9 from your git tree and check how it;s going.
          efocht Erich Focht added a comment -

          Attached collectl and oprofile output for the client built with git checkout 7bcf31e8fa8ccb900d3c015a5e7202b8bcca97bd. The results are the best among all 2.X tests with the single stream write to striped file:

          dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256f bs=4M count=64000
          64000+0 records in
          64000+0 records out
          268435456000 bytes (268 GB) copied, 433.208 s, 620 MB/s
          
          efocht Erich Focht added a comment - Attached collectl and oprofile output for the client built with git checkout 7bcf31e8fa8ccb900d3c015a5e7202b8bcca97bd. The results are the best among all 2.X tests with the single stream write to striped file: dd if=/dev/zero of=/mnt/lnec/pool4/stripe_4_4096/test256f bs=4M count=64000 64000+0 records in 64000+0 records out 268435456000 bytes (268 GB) copied, 433.208 s, 620 MB/s
          efocht Erich Focht added a comment -

          collectl and oprofile of another dd run (4-way striped file, 256GB, writethrough_cache on servers disabled). Lustre git checkout was 0f1d64724480bdfa268d18153272c3484c90e578

          efocht Erich Focht added a comment - collectl and oprofile of another dd run (4-way striped file, 256GB, writethrough_cache on servers disabled). Lustre git checkout was 0f1d64724480bdfa268d18153272c3484c90e578

          Sure please upload the log of collectl. Just want to get as much information as possible in case I really need it

          jay Jinshan Xiong (Inactive) added a comment - Sure please upload the log of collectl. Just want to get as much information as possible in case I really need it
          efocht Erich Focht added a comment -

          I usually run with writethrough_cache disabled, only the oprofiled run was with enabled caches. Just did another measurement and the behavior is the same (still the 0f1d64... git). Only this time at some point the memory of the client filled up. Also: there was a point where the better bandwidth came back for a short time. If you care for collectl or oprofile of this run, I can upload. Otherwise I'll report back when I have the 7bcf31 client results (tomorrow).

          efocht Erich Focht added a comment - I usually run with writethrough_cache disabled, only the oprofiled run was with enabled caches. Just did another measurement and the behavior is the same (still the 0f1d64... git). Only this time at some point the memory of the client filled up. Also: there was a point where the better bandwidth came back for a short time. If you care for collectl or oprofile of this run, I can upload. Otherwise I'll report back when I have the 7bcf31 client results (tomorrow).

          It's fine. It's not necessary to change server version. BTW, there're one option to control cache behavior on the OST, like writecache, blah you can try to disable it to see how it's going.

          jay Jinshan Xiong (Inactive) added a comment - It's fine. It's not necessary to change server version. BTW, there're one option to control cache behavior on the OST, like writecache, blah you can try to disable it to see how it's going.
          efocht Erich Focht added a comment -

          Jinshan, just finished a run with what I had previously (0f1d64724480bdfa268d18153272c3484c90e578). Will now move to the version you suggested. As mentioned before: I can change the client but not the servers. They are 2.1.5.

          efocht Erich Focht added a comment - Jinshan, just finished a run with what I had previously (0f1d64724480bdfa268d18153272c3484c90e578). Will now move to the version you suggested. As mentioned before: I can change the client but not the servers. They are 2.1.5.

          People

            jay Jinshan Xiong (Inactive)
            efocht Erich Focht
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: