Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
9223372036854775807
Description
In general, there are some metadata performance regression with/without quota.
Now, Project quota introduced, it's time to measure metadata and improve metadata performance when quota is enabled.
Also, it seems upstream kernel has some performance optimizations for ext4 when quota enabled. It might be possible to get those optimization for Lustre.
Attachments
Activity
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/29032
Subject: LU-9796 ldiskfs: improve inode allocation performace
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f94e45786dc92f9ec048908234153a547d64462
Ah, I like that much better! And it looks like it's much faster too, at least for creates. Nice. (In fact, faster than the original code before the regression, isn't it?)
Hello Patrick,
I have refreshed the patch to use new approach which looks better for me.
https://patchwork.ozlabs.org/patch/799014/
I don't like a timed sleep too._
Ah, sorry for my misunderstanding.
So then it seems your problem is "inode stealing", where another thread uses the inode bit/number in the group before you can do it. So you end up contending for that lock because you're having to try over and over... (Which also is why it makes some sense that you count attempts rather than look directly at lock contention.)
I don't know what the comments on the list have been (it looks like silence so far), but it really bothers me to see "insert an arbitrary timer delay" as the solution here. That doesn't seem very future proof. Isn't there something we can do directly about the stealing? Increase lock coverage, change how we find the bit, set it to in use before we do all the testing and unset it if the testing fails, that sort of thing? It's hard to say exactly what would be safe.
Because it looks like the current problem is every thread is doing work for every inode, but only one is really getting to use the work it does. So perhaps we should lock around the find_next_zero_bit and set the bit there. That complicates the error path, but it seems like it would (mostly) guarantee forward progress for each thread. Perhaps that's not safe for other reasons, perhaps we can't set that bit (even temporarily) until we know the other things we check... I don't know...
But inserting a timed sleep...
Patrick,
What you said queued spinlock is CONFIG_QUEUED_SPINLOCKS, right?
Good to know, we didn't aware of it, but that was enabled in our all upstream testing.
# grep -i spinlocks .config CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_PARAVIRT_SPINLOCKS=y # Lock Debugging (spinlocks, mutexes, etc...)
But we saw same contentions even with that is enabled.
Hello Patrick,
We did do same benchmark test for RHEL7 and Latest upstream kernel, the result is same, we hit
same lock contention problem, and patch consistently improved the performace, at least there is
some benchmark numbers in patch changelog that you did not notice...
Thanks,
Shilong
Shilong,
I'm not on the ext4 list so I won't comment there, but that code should almost certainly use ext4_fs_is_busy rather than counting attempts directly.
Separately, did you try that patch on the current upstream kernel?
Is the problem spinlock contention (most time spent in lock/unlock, not just waiting for the lock - a problem which is fixed in newer kernels) or is it actually waiting for the lock? (Most time spent waiting for lock, because it really is held.)
Anyway, if you haven't, you should try this on a newer kernel - Spinlock contention as a peformance problem is more or less fixed with queued spinlocks. (Multiple waiters for a spinlock now has minimal performance impact on lock/unlock, whereas in earlier kernels, multiple waiters cause locking and unlocking to take many times longer. It won't fix the problem of having to wait for the lock, but removes lock contention itself as the cause of performance issues.) RHEL7 doesn't have them, sadly.
If your problem really is spinlock contention, it's not going to show up on newer kernels. We might still want the patch for RHEL7.
upstream ext4 patch could be seen here:
http://marc.info/?l=linux-ext4&m=150164892405614&w=2
After more investigation, Shilong made a prototype patch to reduce contenction with _raw_spin_lock(). He can submit patch sonner.
Without patch (Unique Directory)
# for i in `seq 0 4`; do sleep 5; tests_str="create" thrhi=48 thrlo=48 file_count=2000000 \ dir_count=48 mds-survey; sleep 5; tests_str="destroy" thrhi=48 thrlo=48 file_count=2000000 \ dir_count=48 mds-survey; done create destroy 1 135,864 241,262 2 137,432 245,305 3 135,347 228,004 4 135,641 223,845 5 137,300 242,719
With patch (Unique Directory)
# for i in `seq 0 4`; do sleep 5; tests_str="create" thrhi=48 thrlo=48 file_count=2000000 \ dir_count=48 mds-survey; sleep 5; tests_str="destroy" thrhi=48 thrlo=48 file_count=2000000 \ dir_count=48 mds-survey; done create destroy 1 223,178 273,882 2 181,892 269,771 3 202,841 235,435 4 195,022 220,466 5 193,564 263,998
we are getting 50% perforamnce improvment for creation for now.
patch https://review.whamcloud.com/28276 improves performance on Lustre, but cause performance regression with mds-survey beoucse of a lot of _raw_spin_lock() calls.
Without patch
[root@mds04 lustre-release-ee-ddn]# tests_str="create" thrhi=48 thrlo=48 file_count=2000000 dir_count=48 mds-survey Tue Aug 1 16:13:54 JST 2017 /usr/bin/mds-survey from mds04 mdt 1 file 2000000 dir 48 thr 48 create 119346.25 [ 38996.65, 143993.95]
WIth patch
[root@mds04 ~]# tests_str="create" thrhi=48 thrlo=48 file_count=2000000 dir_count=48 mds-survey Tue Aug 1 17:01:32 JST 2017 /usr/bin/mds-survey from mds04 mdt 1 file 2000000 dir 48 thr 48 create 35689.72 [ 24997.45, 46995.91]
perf-tools/bin/funccost '_raw_spin_lock*,ldiskfs*,jbd2*' found difference of function costs with and without patch.
without patch
perf-tools/bin/funccost '_raw_spin_lock*,ldiskfs*,jbd2*' FUNC TOTAL_TIME(us) COUNT AVG(us) ldiskfs_create_inode 668090967 2000016 334.04 jbd2_journal_get_write_access 660607521 28364899 23.29 ldiskfs_dx_add_entry 76861625 1994470 38.54 ldiskfs_mark_inode_dirty 31383256 8050242 3.90 ldiskfs_reserve_inode_write 30442931 12050274 2.53 ldiskfs_mark_dquot_dirty 29298284 4051620 7.23 ldiskfs_write_dquot 27393055 4051620 6.76 ldiskfs_xattr_trusted_set 25413629 4000032 6.35 ldiskfs_xattr_set 24943944 4000032 6.24 ldiskfs_xattr_set_handle 22973138 4000032 5.74 jbd2_journal_dirty_metadata 20879428 23877421 0.87 ldiskfs_bread 18514454 15808891 1.17 ldiskfs_mark_iloc_dirty 18151829 12050274 1.51 ldiskfs_get_inode_loc 16463367 18050473 0.91 ldiskfs_dirty_inode 15518054 4025922 3.85 ldiskfs_getblk 15396652 15816550 0.97 _raw_spin_lock_irqsave 13349523 41367053 0.32 jbd2_journal_put_journal_head 10369726 52749533 0.20 ldiskfs_find_entry 10361625 2002129 5.18 ldiskfs_dx_find_entry 10064239 1994470 5.05 ldiskfs_quota_write 9133495 3600359 2.54 _raw_spin_lock 8851352 104218345 0.08
with patch
perf-tools/bin/funccost '_raw_spin_lock*,ldiskfs*,jbd2*' FUNC TOTAL_TIME(us) COUNT AVG(us) ldiskfs_create_inode 2510051302 2000016 1255.02 _raw_spin_lock 2382710363 191888590 12.42 ldiskfs_read_inode_bitmap 57616803 2002599 28.77 jbd2_journal_get_write_access 44399656 103374244 0.43 ldiskfs_dx_add_entry 42191184 1994470 21.15 ldiskfs_bread 31150135 16197888 1.92 ldiskfs_mark_inode_dirty 19684052 8050194 2.45 ldiskfs_mark_dquot_dirty 18764931 4051620 4.63 ldiskfs_write_dquot 17041111 4051620 4.21 ldiskfs_xattr_trusted_set 14533087 4000032 3.63 ldiskfs_reserve_inode_write 14086373 12050226 1.17 ldiskfs_xattr_set 14036664 4000032 3.51 ldiskfs_getblk 13523271 16205547 0.83 ldiskfs_xattr_set_handle 11857052 4000032 2.96 ldiskfs_mark_iloc_dirty 11561327 12050226 0.96 ldiskfs_get_inode_loc 11498196 18050426 0.64 ldiskfs_find_entry 10792485 2002129 5.39
attached is test mds-survey resutls (File creation) with/withotu patch. https://jira.hpdd.intel.com/secure/attachment/28338/LU-9796.png
1 x MDS(1 x Platinum 8160 CPU, 128GB DDR4 memory)
1 x MDT(SFA7700X, RAID10 with 4 x Toshiba RI SSD)
Lustre-2.10.1CR (user/group quota enabled) on CentOS7.3
patch helps file creation speedup especially at high number of thread. we are now reaching ~250K ops/sec for file creation with single MDT on RHEL7 kernel.