[LU-3305] Quotas affect Metadata performance Created: 09/May/13 Updated: 17/Sep/13 Resolved: 05/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Hyperion/LLNL |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 8187 | ||||||||
| Description |
|
We performed a comparison between 2.3.0, 2.1.5 and current Lustre. We say a regression in metadata performance compared to 2.3.0. Spreadsheet attached. |
| Comments |
| Comment by Keith Mannthey (Inactive) [ 09/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Do you have any other data from the runs to share? Did you confirm the storage was ok for all 3 runs? What did the io stats look like on the MDS? What sort of regular variation are you seeing in runs? The What did the Lustre stats in /proc look like? Is there any oprofile data from the MDS? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 09/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Cliff, is it possible for you to easily run the same "mds-survey" test that you ran for 2.3.63 DNE testing on the same hardware with 2.3.0, so that we can compare MDS-local performance between the two versions? That would help us isolate whether the performance degradation is on the MDS side, or if it relates to the object creation. Failing that, is it possible to get Minh to run mds-survey (with and without object creation) on 2.3.0 and 2.3.63 for comparison. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 09/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
The test itself is not difficult, it's a question of availability/timing vs other requests for Hyperion. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 09/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Kieth, the runs are an average of 5 iterations for IOR. The storage was fine for all three runs. We've stopped using the problematic nodes for these tests. There is no oprofile data from the MDS. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Read (Inactive) [ 09/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Do you mean mdtest instead of IOR? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 13/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Keith was referring to | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 14/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Opensfs cluster run 2.3.65 2.3.0 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 15/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
2.3/2.4 mdtest performance data | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Liang Zhen (Inactive) [ 15/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
As we can see from these graphs, the major difference is about directory per process creation performance, but stddev of all creation tests are very high, for example, 2.3 directory per process creation, stddev value is above 50% of average value. One possible reason could be journal size, I think our default journal size is still 1GB? I would suggest to have 2G internal journal at least, and repeat 10 or more times.
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 16/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I also hit same regression on the latest master commit and started to find where retrogression started. "git bisect" would help. I will post soon which commit causes this regressions. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 17/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
comparison between 2.3 and 2.4.50 RC1 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 17/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Unique direcotry
It seems that there are several regression points between 2.3 and 2.4, but at least on my small client testing(tested mdtest on 16 clients, 32 process), commit 9ddf386035767a96b54e21559f3ea0be126dc8cd might be one of regression point for unlink operation. (I collected more data at commit points, but this is one of point big differnce between commits. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 17/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Niu Are you able to advise on the reported drop due to the quotas landing? Alex/Nasf Any comments? Thanks Peter | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 17/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This is a previously known problem discussed in It may well be that there is still a performance impact from the "on-by-default" quota accounting, which is worthwhile to test before trying to find some other cause for this regression. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 22/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
oprofile data for mds-survey run: [root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 22/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is some info about turn off quota: quota on (default) [root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey quota off via tuners -O^quota <dev> [root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Niu Yawei (Inactive) [ 23/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
From Minh's result we can see: because of quota file updating, when testing 256 threads over 256 directories (1 thread per directory, no contention on parent directory updating), create/unlink of w/o quota is faster than create/unlink with quota on. I think the oprofile data confirms it: Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 2276160 46.2251 vmlinux vmlinux dqput 963873 19.5747 vmlinux vmlinux dqget 335277 6.8089 ldiskfs ldiskfs /ldiskfs 258028 5.2401 vmlinux vmlinux dquot_mark_dquot_dirty 110819 2.2506 osd_ldiskfs osd_ldiskfs /osd_ldiskfs 76925 1.5622 obdclass obdclass /obdclass 58193 1.1818 mdd mdd /mdd 41931 0.8516 vmlinux vmlinux __find_get_block 32408 0.6582 lod lod /lod 20711 0.4206 jbd2.ko jbd2.ko jbd2_journal_add_journal_head 18598 0.3777 jbd2.ko jbd2.ko do_get_write_access 18579 0.3773 vmlinux vmlinux __find_get_block_slow 18364 0.3729 libcfs libcfs /libcfs 17833 0.3622 oprofiled oprofiled /usr/bin/oprofiled 17472 0.3548 vmlinux vmlinux mutex_lock I'm not sure if we can still improve the performance (with quota) further in this respect, because single quota file updating can always be the bottleneck. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Niu Yawei (Inactive) [ 23/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Look closer into the dqget()/dqput(), I realized that there is still quite a few global locks in quota code: dq_list_lock, dq_state_lock, dq_data_lock. The fix of | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 23/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I find it strange that dqget() is called 2M times, but it only looks like 20k blocks are being allocated (based on the ldiskfs an jbd2 call counts). Before trying to optimize the speed of that function, it is probably better to reduce the number of times it is called? It is also a case where the same quota entry is being accessed for every call (same UID and GID each time), so I wonder if that common case could be optimized in some way? Are any of these issues fixed in the original quota patches? Unfortunately, since all of the threads are contending to update the same record, there isn't an easy way to reduce contention. The only thing I can think of is to have a journal pre-commit callback that does only a single quota update to disk per transaction, and uses percpu counters for the per-quota-per-transaction updates in memory. That would certainly avoid contention, and is no less correct in the face of a crash. No idea how easy that would be to implement. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Niu Yawei (Inactive) [ 23/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
dqget()/dqput() is mainly to get/drop reference on the in-memory per-id data of dquot, and it acquires global locks like lock dq_list_lock & dq_state_lock, (since it will lookup the dquot list and do some state checking) so contention on those global locks could be severe in the test case. If we can replace them with RCU or read/write lock, things will be better. I heard from Lai that there were some old patches which tried to remove those global locks, but it didn't gain much interest of community and not reviewed. Lai, could you comment on this? Regarding the quota record commit (mark_dquot_dirty() -> ext4_mark_dquot_dirty() -> ext4_write_dquot() -> dquot_commit(), which should happen along with each transaction), it does require global locks: dqio_mutex & dq_list_lock, but surprisingly, I didn't see it in the top samples of oprofile, it might just because the dqget()/dqput() calls are much more than dquot commit calls? Once we resolved the bottleneck in dqget()/dqput(), the contention in dquot commit could probably come to light. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Niu Yawei (Inactive) [ 24/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Instead of eliminate the global locks entirely, maybe a small fix in dquot_initialize() could relieve the contenion caused by dqget()/dqput(): In dquot_initialize(), we'd call dqget() only when i_dquot not initialized, which can avoid 2 pair of dqget()/dqput() in most case. I'll propose a patch soon. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 24/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Niu's patch is at http://review.whamcloud.com/6440. Minh, would you be able to run another set of tests with the latest patch applied, and produce a graph like: https://jira.hpdd.intel.com/secure/attachment/12415/mdtest_create.png so it is easier to see what the differences are? Presumably with 5 runs it would be useful to plot the standard deviation, since I see from the text results you posted above that the performance can vary dramatically between runs. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 24/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
yes, will do when the cluster's IB network is back online next week | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James A Simmons [ 28/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This patch will need to be port to SLES11 SP[1/2] as well. Later in the week I can include it in the patch. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 28/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
James, please submit the SLES changes as a separate patch. Since this doesn't affect the API, the two changes do not need to be in the same commit. If the other patch needs to be refreshed for some other reason they can be merged. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James A Simmons [ 28/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Fine with me. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 04/Jun/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
performance data for the patch | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Niu Yawei (Inactive) [ 05/Jun/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks a lot, Minh. Looks the patch improves create/rm performance overall, but there are something strange in the figure that I don't know why:
BTW: why we put the create & rm data into same figure? I think they are two distinct tests, aren't they? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Niu Yawei (Inactive) [ 08/Jul/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I did few tests on Rosso cluster, the result is similar to what we got in patched: mdtest-1.8.3 was launched with 32 total task(s) on 1 nodes Command line used: mdtest -d /mnt/ldiskfs -i 10 -n 25000 -u -F -r Path: /mnt FS: 19.7 GiB Used FS: 17.5% Inodes: 1.2 Mi Used Inodes: 4.6% 32 tasks, 800000 files SUMMARY: (of 10 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 0.000 0.000 0.000 0.000 File stat : 0.000 0.000 0.000 0.000 File removal : 4042.032 1713.613 2713.827 698.243 Tree creation : 0.000 0.000 0.000 0.000 Tree removal : 2.164 1.861 2.020 0.088 -- finished at 07/07/2013 20:37:36 -- CPU: Intel Sandy Bridge microarchitecture, speed 2.601e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % app name symbol name 10826148 11.6911 vmlinux schedule 7089656 7.6561 vmlinux update_curr 6432166 6.9461 vmlinux sys_sched_yield 4384494 4.7348 vmlinux __audit_syscall_exit 4088507 4.4152 libc-2.12.so sched_yield 3441346 3.7163 vmlinux system_call_after_swapgs 3337224 3.6038 vmlinux put_prev_task_fair 3244213 3.5034 vmlinux audit_syscall_entry 2844216 3.0715 vmlinux thread_return 2702323 2.9182 vmlinux rb_insert_color 2636798 2.8475 vmlinux native_read_tsc 2234644 2.4132 vmlinux sched_clock_cpu 2182744 2.3571 vmlinux native_sched_clock 2175482 2.3493 vmlinux hrtick_start_fair 2152807 2.3248 vmlinux pick_next_task_fair 2130024 2.3002 vmlinux set_next_entity 2099576 2.2673 vmlinux rb_erase 1790101 1.9331 vmlinux update_stats_wait_end 1777328 1.9193 vmlinux mutex_spin_on_owner 1701672 1.8376 vmlinux sysret_check unpatched: mdtest-1.8.3 was launched with 32 total task(s) on 1 nodes Command line used: mdtest -d /mnt/ldiskfs -i 10 -n 25000 -u -F -r Path: /mnt FS: 19.7 GiB Used FS: 17.8% Inodes: 1.2 Mi Used Inodes: 4.6% 32 tasks, 800000 files SUMMARY: (of 10 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 0.000 0.000 0.000 0.000 File stat : 0.000 0.000 0.000 0.000 File removal : 2816.345 1673.085 2122.347 342.119 Tree creation : 0.000 0.000 0.000 0.000 Tree removal : 2.296 0.111 1.361 0.866 -- finished at 07/07/2013 21:11:03 -- CPU: Intel Sandy Bridge microarchitecture, speed 2.601e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 23218790 18.3914 vmlinux vmlinux dqput 9549739 7.5643 vmlinux vmlinux __audit_syscall_exit 9116086 7.2208 vmlinux vmlinux schedule 8290558 6.5669 vmlinux vmlinux dqget 5576620 4.4172 vmlinux vmlinux update_curr 5343755 4.2327 vmlinux vmlinux sys_sched_yield 3251018 2.5751 libc-2.12.so libc-2.12.so sched_yield 2907579 2.3031 vmlinux vmlinux system_call_after_swapgs 2854863 2.2613 vmlinux vmlinux put_prev_task_fair 2793392 2.2126 vmlinux vmlinux audit_syscall_entry 2723949 2.1576 vmlinux vmlinux kfree 2551007 2.0206 vmlinux vmlinux mutex_spin_on_owner 2406364 1.9061 vmlinux vmlinux rb_insert_color 2321179 1.8386 vmlinux vmlinux thread_return 2184031 1.7299 vmlinux vmlinux native_read_tsc 2002277 1.5860 vmlinux vmlinux dquot_mark_dquot_dirty 1990135 1.5764 vmlinux vmlinux native_sched_clock 1970544 1.5608 vmlinux vmlinux set_next_entity 1967852 1.5587 vmlinux vmlinux pick_next_task_fair 1966282 1.5575 vmlinux vmlinux dquot_commit 1966271 1.5575 vmlinux vmlinux sysret_check 1919524 1.5204 vmlinux vmlinux unroll_tree_refs 1811281 1.4347 vmlinux vmlinux sched_clock_cpu 1810278 1.4339 vmlinux vmlinux rb_erase The unlink rate speed increased ~28% with 32 threads, and the oprofile data shows contention on dq_list_lock in dqput() is alleviated a lot. I think we should take this patch as a supplement of fix | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 05/Sep/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Landed for 2.4.1 and 2.5 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Prakash Surya (Inactive) [ 16/Sep/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
It's hard to gather from the prior discussion, but is this the only patch that came out of this issue: commit 58d2a322589ec13ee3c585c13b1c83f429d946ce
Author: Niu Yawei <yawei.niu@intel.com>
Date: Thu May 23 23:49:03 2013 -0400
LU-3305 quota: avoid unnecessary dqget/dqput calls
? Thanks. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 17/Sep/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
There are two core kernel patches in the RHEL series on master that improve quota performance - both start with "quota". |