Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9796

Speedup file creation under heavy concurrency

Details

    • 9223372036854775807

    Description

      In general, there are some metadata performance regression with/without quota.
      Now, Project quota introduced, it's time to measure metadata and improve metadata performance when quota is enabled.
      Also, it seems upstream kernel has some performance optimizations for ext4 when quota enabled. It might be possible to get those optimization for Lustre.

      Attachments

        Activity

          [LU-9796] Speedup file creation under heavy concurrency

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28276/
          Subject: LU-9796 kernel: improve metadata performaces for RHEL7
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 17fe3c192e101ace75b2f4d7f7e9ff7d8d85480e

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28276/ Subject: LU-9796 kernel: improve metadata performaces for RHEL7 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 17fe3c192e101ace75b2f4d7f7e9ff7d8d85480e

          I see recent mods in master have landed for this issue, but only for RHEL 7.x
          Is this not needed for RHEL 6.x, SLES 11/12, Ubuntu ?

          bogl Bob Glossman (Inactive) added a comment - I see recent mods in master have landed for this issue, but only for RHEL 7.x Is this not needed for RHEL 6.x, SLES 11/12, Ubuntu ?
          pjones Peter Jones added a comment -

          Does the remaining patch need to land too or can that be abandoned and this ticket marked resolved?

          pjones Peter Jones added a comment - Does the remaining patch need to land too or can that be abandoned and this ticket marked resolved?

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29032/
          Subject: LU-9796 ldiskfs: improve inode allocation performance
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 3f0a7241c434d9556308299eea069628715816c2

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29032/ Subject: LU-9796 ldiskfs: improve inode allocation performance Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3f0a7241c434d9556308299eea069628715816c2
          ihara Shuichi Ihara (Inactive) added a comment - - edited

          attached is test mds-survey resutls (File creation) with/withotu patch. https://jira.hpdd.intel.com/secure/attachment/28338/LU-9796.png
          1 x MDS(1 x Platinum 8160 CPU, 128GB DDR4 memory)
          1 x MDT(SFA7700X, RAID10 with 4 x Toshiba RI SSD)
          Lustre-2.10.1CR (user/group quota enabled) on CentOS7.3

          patch helps file creation speedup especially at high number of thread. we are now reaching ~250K ops/sec for file creation with single MDT on RHEL7 kernel.

          ihara Shuichi Ihara (Inactive) added a comment - - edited attached is test mds-survey resutls (File creation) with/withotu patch. https://jira.hpdd.intel.com/secure/attachment/28338/LU-9796.png 1 x MDS(1 x Platinum 8160 CPU, 128GB DDR4 memory) 1 x MDT(SFA7700X, RAID10 with 4 x Toshiba RI SSD) Lustre-2.10.1CR (user/group quota enabled) on CentOS7.3 patch helps file creation speedup especially at high number of thread. we are now reaching ~250K ops/sec for file creation with single MDT on RHEL7 kernel.

          Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/29032
          Subject: LU-9796 ldiskfs: improve inode allocation performace
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 0f94e45786dc92f9ec048908234153a547d64462

          gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/29032 Subject: LU-9796 ldiskfs: improve inode allocation performace Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0f94e45786dc92f9ec048908234153a547d64462

          Ah, I like that much better! And it looks like it's much faster too, at least for creates. Nice. (In fact, faster than the original code before the regression, isn't it?)

          paf Patrick Farrell (Inactive) added a comment - Ah, I like that much better! And it looks like it's much faster too, at least for creates. Nice. (In fact, faster than the original code before the regression, isn't it?)
          wangshilong Wang Shilong (Inactive) added a comment - - edited

          Hello Patrick,

          I have refreshed the patch to use new approach which looks better for me.
          https://patchwork.ozlabs.org/patch/799014/

          I don't like a timed sleep too._

          wangshilong Wang Shilong (Inactive) added a comment - - edited Hello Patrick, I have refreshed the patch to use new approach which looks better for me. https://patchwork.ozlabs.org/patch/799014/ I don't like a timed sleep too. _

          Ah, sorry for my misunderstanding.

          So then it seems your problem is "inode stealing", where another thread uses the inode bit/number in the group before you can do it. So you end up contending for that lock because you're having to try over and over... (Which also is why it makes some sense that you count attempts rather than look directly at lock contention.)

          I don't know what the comments on the list have been (it looks like silence so far), but it really bothers me to see "insert an arbitrary timer delay" as the solution here. That doesn't seem very future proof. Isn't there something we can do directly about the stealing? Increase lock coverage, change how we find the bit, set it to in use before we do all the testing and unset it if the testing fails, that sort of thing? It's hard to say exactly what would be safe.

          Because it looks like the current problem is every thread is doing work for every inode, but only one is really getting to use the work it does. So perhaps we should lock around the find_next_zero_bit and set the bit there. That complicates the error path, but it seems like it would (mostly) guarantee forward progress for each thread. Perhaps that's not safe for other reasons, perhaps we can't set that bit (even temporarily) until we know the other things we check... I don't know...

          But inserting a timed sleep...

          paf Patrick Farrell (Inactive) added a comment - Ah, sorry for my misunderstanding. So then it seems your problem is "inode stealing", where another thread uses the inode bit/number in the group before you can do it. So you end up contending for that lock because you're having to try over and over... (Which also is why it makes some sense that you count attempts rather than look directly at lock contention.) I don't know what the comments on the list have been (it looks like silence so far), but it really bothers me to see "insert an arbitrary timer delay" as the solution here. That doesn't seem very future proof. Isn't there something we can do directly about the stealing? Increase lock coverage, change how we find the bit, set it to in use before we do all the testing and unset it if the testing fails, that sort of thing? It's hard to say exactly what would be safe. Because it looks like the current problem is every thread is doing work for every inode, but only one is really getting to use the work it does. So perhaps we should lock around the find_next_zero_bit and set the bit there. That complicates the error path, but it seems like it would (mostly) guarantee forward progress for each thread. Perhaps that's not safe for other reasons, perhaps we can't set that bit (even temporarily) until we know the other things we check... I don't know... But inserting a timed sleep...
          ihara Shuichi Ihara (Inactive) added a comment - - edited

          Patrick,
          What you said queued spinlock is CONFIG_QUEUED_SPINLOCKS, right?
          Good to know, we didn't aware of it, but that was enabled in our all upstream testing.

          # grep -i spinlocks .config 
          CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
          CONFIG_QUEUED_SPINLOCKS=y
          CONFIG_PARAVIRT_SPINLOCKS=y
          # Lock Debugging (spinlocks, mutexes, etc...)
          

          But we saw same contentions even with that is enabled.

          ihara Shuichi Ihara (Inactive) added a comment - - edited Patrick, What you said queued spinlock is CONFIG_QUEUED_SPINLOCKS, right? Good to know, we didn't aware of it, but that was enabled in our all upstream testing. # grep -i spinlocks .config CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_PARAVIRT_SPINLOCKS=y # Lock Debugging (spinlocks, mutexes, etc...) But we saw same contentions even with that is enabled.

          Hello Patrick,

          We did do same benchmark test for RHEL7 and Latest upstream kernel, the result is same, we hit
          same lock contention problem, and patch consistently improved the performace, at least there is
          some benchmark numbers in patch changelog that you did not notice...

          Thanks,
          Shilong

          wangshilong Wang Shilong (Inactive) added a comment - Hello Patrick, We did do same benchmark test for RHEL7 and Latest upstream kernel, the result is same, we hit same lock contention problem, and patch consistently improved the performace, at least there is some benchmark numbers in patch changelog that you did not notice... Thanks, Shilong

          People

            hongchao.zhang Hongchao Zhang
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: