Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9796

Speedup file creation under heavy concurrency

Details

    • 9223372036854775807

    Description

      In general, there are some metadata performance regression with/without quota.
      Now, Project quota introduced, it's time to measure metadata and improve metadata performance when quota is enabled.
      Also, it seems upstream kernel has some performance optimizations for ext4 when quota enabled. It might be possible to get those optimization for Lustre.

      Attachments

        Activity

          [LU-9796] Speedup file creation under heavy concurrency
          ihara Shuichi Ihara (Inactive) added a comment - - edited

          attached is test mds-survey resutls (File creation) with/withotu patch. https://jira.hpdd.intel.com/secure/attachment/28338/LU-9796.png
          1 x MDS(1 x Platinum 8160 CPU, 128GB DDR4 memory)
          1 x MDT(SFA7700X, RAID10 with 4 x Toshiba RI SSD)
          Lustre-2.10.1CR (user/group quota enabled) on CentOS7.3

          patch helps file creation speedup especially at high number of thread. we are now reaching ~250K ops/sec for file creation with single MDT on RHEL7 kernel.

          ihara Shuichi Ihara (Inactive) added a comment - - edited attached is test mds-survey resutls (File creation) with/withotu patch. https://jira.hpdd.intel.com/secure/attachment/28338/LU-9796.png 1 x MDS(1 x Platinum 8160 CPU, 128GB DDR4 memory) 1 x MDT(SFA7700X, RAID10 with 4 x Toshiba RI SSD) Lustre-2.10.1CR (user/group quota enabled) on CentOS7.3 patch helps file creation speedup especially at high number of thread. we are now reaching ~250K ops/sec for file creation with single MDT on RHEL7 kernel.

          Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/29032
          Subject: LU-9796 ldiskfs: improve inode allocation performace
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 0f94e45786dc92f9ec048908234153a547d64462

          gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/29032 Subject: LU-9796 ldiskfs: improve inode allocation performace Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0f94e45786dc92f9ec048908234153a547d64462

          Ah, I like that much better! And it looks like it's much faster too, at least for creates. Nice. (In fact, faster than the original code before the regression, isn't it?)

          paf Patrick Farrell (Inactive) added a comment - Ah, I like that much better! And it looks like it's much faster too, at least for creates. Nice. (In fact, faster than the original code before the regression, isn't it?)
          wangshilong Wang Shilong (Inactive) added a comment - - edited

          Hello Patrick,

          I have refreshed the patch to use new approach which looks better for me.
          https://patchwork.ozlabs.org/patch/799014/

          I don't like a timed sleep too._

          wangshilong Wang Shilong (Inactive) added a comment - - edited Hello Patrick, I have refreshed the patch to use new approach which looks better for me. https://patchwork.ozlabs.org/patch/799014/ I don't like a timed sleep too. _

          Ah, sorry for my misunderstanding.

          So then it seems your problem is "inode stealing", where another thread uses the inode bit/number in the group before you can do it. So you end up contending for that lock because you're having to try over and over... (Which also is why it makes some sense that you count attempts rather than look directly at lock contention.)

          I don't know what the comments on the list have been (it looks like silence so far), but it really bothers me to see "insert an arbitrary timer delay" as the solution here. That doesn't seem very future proof. Isn't there something we can do directly about the stealing? Increase lock coverage, change how we find the bit, set it to in use before we do all the testing and unset it if the testing fails, that sort of thing? It's hard to say exactly what would be safe.

          Because it looks like the current problem is every thread is doing work for every inode, but only one is really getting to use the work it does. So perhaps we should lock around the find_next_zero_bit and set the bit there. That complicates the error path, but it seems like it would (mostly) guarantee forward progress for each thread. Perhaps that's not safe for other reasons, perhaps we can't set that bit (even temporarily) until we know the other things we check... I don't know...

          But inserting a timed sleep...

          paf Patrick Farrell (Inactive) added a comment - Ah, sorry for my misunderstanding. So then it seems your problem is "inode stealing", where another thread uses the inode bit/number in the group before you can do it. So you end up contending for that lock because you're having to try over and over... (Which also is why it makes some sense that you count attempts rather than look directly at lock contention.) I don't know what the comments on the list have been (it looks like silence so far), but it really bothers me to see "insert an arbitrary timer delay" as the solution here. That doesn't seem very future proof. Isn't there something we can do directly about the stealing? Increase lock coverage, change how we find the bit, set it to in use before we do all the testing and unset it if the testing fails, that sort of thing? It's hard to say exactly what would be safe. Because it looks like the current problem is every thread is doing work for every inode, but only one is really getting to use the work it does. So perhaps we should lock around the find_next_zero_bit and set the bit there. That complicates the error path, but it seems like it would (mostly) guarantee forward progress for each thread. Perhaps that's not safe for other reasons, perhaps we can't set that bit (even temporarily) until we know the other things we check... I don't know... But inserting a timed sleep...
          ihara Shuichi Ihara (Inactive) added a comment - - edited

          Patrick,
          What you said queued spinlock is CONFIG_QUEUED_SPINLOCKS, right?
          Good to know, we didn't aware of it, but that was enabled in our all upstream testing.

          # grep -i spinlocks .config 
          CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
          CONFIG_QUEUED_SPINLOCKS=y
          CONFIG_PARAVIRT_SPINLOCKS=y
          # Lock Debugging (spinlocks, mutexes, etc...)
          

          But we saw same contentions even with that is enabled.

          ihara Shuichi Ihara (Inactive) added a comment - - edited Patrick, What you said queued spinlock is CONFIG_QUEUED_SPINLOCKS, right? Good to know, we didn't aware of it, but that was enabled in our all upstream testing. # grep -i spinlocks .config CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_PARAVIRT_SPINLOCKS=y # Lock Debugging (spinlocks, mutexes, etc...) But we saw same contentions even with that is enabled.

          Hello Patrick,

          We did do same benchmark test for RHEL7 and Latest upstream kernel, the result is same, we hit
          same lock contention problem, and patch consistently improved the performace, at least there is
          some benchmark numbers in patch changelog that you did not notice...

          Thanks,
          Shilong

          wangshilong Wang Shilong (Inactive) added a comment - Hello Patrick, We did do same benchmark test for RHEL7 and Latest upstream kernel, the result is same, we hit same lock contention problem, and patch consistently improved the performace, at least there is some benchmark numbers in patch changelog that you did not notice... Thanks, Shilong

          Shilong,

          I'm not on the ext4 list so I won't comment there, but that code should almost certainly use ext4_fs_is_busy rather than counting attempts directly.

          Separately, did you try that patch on the current upstream kernel?

          Is the problem spinlock contention (most time spent in lock/unlock, not just waiting for the lock - a problem which is fixed in newer kernels) or is it actually waiting for the lock? (Most time spent waiting for lock, because it really is held.)

          Anyway, if you haven't, you should try this on a newer kernel - Spinlock contention as a peformance problem is more or less fixed with queued spinlocks. (Multiple waiters for a spinlock now has minimal performance impact on lock/unlock, whereas in earlier kernels, multiple waiters cause locking and unlocking to take many times longer. It won't fix the problem of having to wait for the lock, but removes lock contention itself as the cause of performance issues.) RHEL7 doesn't have them, sadly.

          If your problem really is spinlock contention, it's not going to show up on newer kernels. We might still want the patch for RHEL7.

          paf Patrick Farrell (Inactive) added a comment - Shilong, I'm not on the ext4 list so I won't comment there, but that code should almost certainly use ext4_fs_is_busy rather than counting attempts directly. Separately, did you try that patch on the current upstream kernel? Is the problem spinlock contention (most time spent in lock/unlock, not just waiting for the lock - a problem which is fixed in newer kernels) or is it actually waiting for the lock? (Most time spent waiting for lock, because it really is held.) Anyway, if you haven't, you should try this on a newer kernel - Spinlock contention as a peformance problem is more or less fixed with queued spinlocks. (Multiple waiters for a spinlock now has minimal performance impact on lock/unlock, whereas in earlier kernels, multiple waiters cause locking and unlocking to take many times longer. It won't fix the problem of having to wait for the lock, but removes lock contention itself as the cause of performance issues.) RHEL7 doesn't have them, sadly. If your problem really is spinlock contention, it's not going to show up on newer kernels. We might still want the patch for RHEL7.
          wangshilong Wang Shilong (Inactive) added a comment - upstream ext4 patch could be seen here: http://marc.info/?l=linux-ext4&m=150164892405614&w=2
          ihara Shuichi Ihara (Inactive) added a comment - - edited

          After more investigation, Shilong made a prototype patch to reduce contenction with _raw_spin_lock(). He can submit patch sonner.
          Without patch (Unique Directory)

          #  for i in `seq 0 4`; do sleep 5; tests_str="create" thrhi=48 thrlo=48 file_count=2000000 \
          dir_count=48 mds-survey; sleep 5; tests_str="destroy" thrhi=48 thrlo=48 file_count=2000000 \
          dir_count=48 mds-survey; done
          	create       destroy
          1 	135,864 	241,262 
          2 	137,432 	245,305 
          3 	135,347 	228,004 
          4 	135,641 	223,845 
          5 	137,300 	242,719 
          

          With patch (Unique Directory)

          #  for i in `seq 0 4`; do sleep 5; tests_str="create" thrhi=48 thrlo=48 file_count=2000000 \
          dir_count=48 mds-survey; sleep 5; tests_str="destroy" thrhi=48 thrlo=48 file_count=2000000 \
          dir_count=48 mds-survey; done
                    create	destroy
          1 	223,178 	273,882 
          2 	181,892 	269,771 
          3 	202,841 	235,435 
          4 	195,022 	220,466 
          5 	193,564 	263,998
          

          we are getting 50% perforamnce improvment for creation for now.

          ihara Shuichi Ihara (Inactive) added a comment - - edited After more investigation, Shilong made a prototype patch to reduce contenction with _raw_spin_lock(). He can submit patch sonner. Without patch (Unique Directory) # for i in `seq 0 4`; do sleep 5; tests_str="create" thrhi=48 thrlo=48 file_count=2000000 \ dir_count=48 mds-survey; sleep 5; tests_str="destroy" thrhi=48 thrlo=48 file_count=2000000 \ dir_count=48 mds-survey; done create destroy 1 135,864 241,262 2 137,432 245,305 3 135,347 228,004 4 135,641 223,845 5 137,300 242,719 With patch (Unique Directory) # for i in `seq 0 4`; do sleep 5; tests_str="create" thrhi=48 thrlo=48 file_count=2000000 \ dir_count=48 mds-survey; sleep 5; tests_str="destroy" thrhi=48 thrlo=48 file_count=2000000 \ dir_count=48 mds-survey; done create destroy 1 223,178 273,882 2 181,892 269,771 3 202,841 235,435 4 195,022 220,466 5 193,564 263,998 we are getting 50% perforamnce improvment for creation for now.

          patch https://review.whamcloud.com/28276 improves performance on Lustre, but cause performance regression with mds-survey beoucse of a lot of _raw_spin_lock() calls.

          Without patch

          [root@mds04 lustre-release-ee-ddn]# tests_str="create" thrhi=48 thrlo=48 file_count=2000000 dir_count=48 mds-survey
          Tue Aug  1 16:13:54 JST 2017 /usr/bin/mds-survey from mds04
          mdt 1 file 2000000 dir   48 thr   48 create 119346.25 [ 38996.65, 143993.95]
          
          

          WIth patch

          [root@mds04 ~]#  tests_str="create" thrhi=48 thrlo=48 file_count=2000000 dir_count=48 mds-survey
          Tue Aug  1 17:01:32 JST 2017 /usr/bin/mds-survey from mds04
          mdt 1 file 2000000 dir   48 thr   48 create 35689.72 [ 24997.45, 46995.91] 
          
          

          perf-tools/bin/funccost '_raw_spin_lock*,ldiskfs*,jbd2*' found difference of function costs with and without patch.

          without patch

          perf-tools/bin/funccost  '_raw_spin_lock*,ldiskfs*,jbd2*'
          
          FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)       
          ldiskfs_create_inode           668090967            2000016      334.04      
          jbd2_journal_get_write_access  660607521            28364899     23.29       
          ldiskfs_dx_add_entry           76861625             1994470      38.54       
          ldiskfs_mark_inode_dirty       31383256             8050242      3.90        
          ldiskfs_reserve_inode_write    30442931             12050274     2.53        
          ldiskfs_mark_dquot_dirty       29298284             4051620      7.23        
          ldiskfs_write_dquot            27393055             4051620      6.76        
          ldiskfs_xattr_trusted_set      25413629             4000032      6.35        
          ldiskfs_xattr_set              24943944             4000032      6.24        
          ldiskfs_xattr_set_handle       22973138             4000032      5.74        
          jbd2_journal_dirty_metadata    20879428             23877421     0.87        
          ldiskfs_bread                  18514454             15808891     1.17        
          ldiskfs_mark_iloc_dirty        18151829             12050274     1.51        
          ldiskfs_get_inode_loc          16463367             18050473     0.91        
          ldiskfs_dirty_inode            15518054             4025922      3.85        
          ldiskfs_getblk                 15396652             15816550     0.97        
          _raw_spin_lock_irqsave         13349523             41367053     0.32        
          jbd2_journal_put_journal_head  10369726             52749533     0.20        
          ldiskfs_find_entry             10361625             2002129      5.18        
          ldiskfs_dx_find_entry          10064239             1994470      5.05        
          ldiskfs_quota_write            9133495              3600359      2.54        
          _raw_spin_lock                 8851352              104218345    0.08    
          
          

          with patch

          perf-tools/bin/funccost  '_raw_spin_lock*,ldiskfs*,jbd2*'
          
          FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)       
          ldiskfs_create_inode           2510051302           2000016      1255.02     
          _raw_spin_lock                 2382710363           191888590    12.42       
          ldiskfs_read_inode_bitmap      57616803             2002599      28.77       
          jbd2_journal_get_write_access  44399656             103374244    0.43        
          ldiskfs_dx_add_entry           42191184             1994470      21.15       
          ldiskfs_bread                  31150135             16197888     1.92        
          ldiskfs_mark_inode_dirty       19684052             8050194      2.45        
          ldiskfs_mark_dquot_dirty       18764931             4051620      4.63        
          ldiskfs_write_dquot            17041111             4051620      4.21        
          ldiskfs_xattr_trusted_set      14533087             4000032      3.63        
          ldiskfs_reserve_inode_write    14086373             12050226     1.17        
          ldiskfs_xattr_set              14036664             4000032      3.51        
          ldiskfs_getblk                 13523271             16205547     0.83        
          ldiskfs_xattr_set_handle       11857052             4000032      2.96        
          ldiskfs_mark_iloc_dirty        11561327             12050226     0.96        
          ldiskfs_get_inode_loc          11498196             18050426     0.64        
          ldiskfs_find_entry             10792485             2002129      5.39    
          
          
          ihara Shuichi Ihara (Inactive) added a comment - patch https://review.whamcloud.com/28276 improves performance on Lustre, but cause performance regression with mds-survey beoucse of a lot of _raw_spin_lock() calls. Without patch [root@mds04 lustre-release-ee-ddn]# tests_str="create" thrhi=48 thrlo=48 file_count=2000000 dir_count=48 mds-survey Tue Aug  1 16:13:54 JST 2017 /usr/bin/mds-survey from mds04 mdt 1 file 2000000 dir   48 thr   48 create 119346.25 [ 38996.65, 143993.95] WIth patch [root@mds04 ~]# tests_str="create" thrhi=48 thrlo=48 file_count=2000000 dir_count=48 mds-survey Tue Aug 1 17:01:32 JST 2017 /usr/bin/mds-survey from mds04 mdt 1 file 2000000 dir 48 thr 48 create 35689.72 [ 24997.45, 46995.91] perf-tools/bin/funccost '_raw_spin_lock*,ldiskfs*,jbd2*' found difference of function costs with and without patch. without patch perf-tools/bin/funccost '_raw_spin_lock*,ldiskfs*,jbd2*' FUNC TOTAL_TIME(us) COUNT AVG(us) ldiskfs_create_inode 668090967 2000016 334.04 jbd2_journal_get_write_access 660607521 28364899 23.29 ldiskfs_dx_add_entry 76861625 1994470 38.54 ldiskfs_mark_inode_dirty 31383256 8050242 3.90 ldiskfs_reserve_inode_write 30442931 12050274 2.53 ldiskfs_mark_dquot_dirty 29298284 4051620 7.23 ldiskfs_write_dquot 27393055 4051620 6.76 ldiskfs_xattr_trusted_set 25413629 4000032 6.35 ldiskfs_xattr_set 24943944 4000032 6.24 ldiskfs_xattr_set_handle 22973138 4000032 5.74 jbd2_journal_dirty_metadata 20879428 23877421 0.87 ldiskfs_bread 18514454 15808891 1.17 ldiskfs_mark_iloc_dirty 18151829 12050274 1.51 ldiskfs_get_inode_loc 16463367 18050473 0.91 ldiskfs_dirty_inode 15518054 4025922 3.85 ldiskfs_getblk 15396652 15816550 0.97 _raw_spin_lock_irqsave 13349523 41367053 0.32 jbd2_journal_put_journal_head 10369726 52749533 0.20 ldiskfs_find_entry 10361625 2002129 5.18 ldiskfs_dx_find_entry 10064239 1994470 5.05 ldiskfs_quota_write 9133495 3600359 2.54 _raw_spin_lock 8851352 104218345 0.08 with patch perf-tools/bin/funccost '_raw_spin_lock*,ldiskfs*,jbd2*' FUNC TOTAL_TIME(us) COUNT AVG(us) ldiskfs_create_inode 2510051302 2000016 1255.02 _raw_spin_lock 2382710363 191888590 12.42 ldiskfs_read_inode_bitmap 57616803 2002599 28.77 jbd2_journal_get_write_access 44399656 103374244 0.43 ldiskfs_dx_add_entry 42191184 1994470 21.15 ldiskfs_bread 31150135 16197888 1.92 ldiskfs_mark_inode_dirty 19684052 8050194 2.45 ldiskfs_mark_dquot_dirty 18764931 4051620 4.63 ldiskfs_write_dquot 17041111 4051620 4.21 ldiskfs_xattr_trusted_set 14533087 4000032 3.63 ldiskfs_reserve_inode_write 14086373 12050226 1.17 ldiskfs_xattr_set 14036664 4000032 3.51 ldiskfs_getblk 13523271 16205547 0.83 ldiskfs_xattr_set_handle 11857052 4000032 2.96 ldiskfs_mark_iloc_dirty 11561327 12050226 0.96 ldiskfs_get_inode_loc 11498196 18050426 0.64 ldiskfs_find_entry 10792485 2002129 5.39

          People

            hongchao.zhang Hongchao Zhang
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: