[LU-3305] Quotas affect Metadata performance Created: 09/May/13  Updated: 17/Sep/13  Resolved: 05/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Hyperion/LLNL


Attachments: Microsoft Word Hyperion Performance 07 May 2013.xlsx     Microsoft Word Opensfs Metadata Performance RC1.xlsx     Microsoft Word Opensfs Metadata Performance quota patch.xlsx     Microsoft Word Opensfs Metadata Performance.xlsx     PNG File Screen Shot 2013-05-15 at 2.18.52 PM.png     File oprofile.tgz    
Issue Links:
Blocker
is blocked by LU-3396 Disable quota by default for 2.4 Closed
Severity: 3
Rank (Obsolete): 8187

 Description   

We performed a comparison between 2.3.0, 2.1.5 and current Lustre. We say a regression in metadata performance compared to 2.3.0. Spreadsheet attached.



 Comments   
Comment by Keith Mannthey (Inactive) [ 09/May/13 ]

Do you have any other data from the runs to share? Did you confirm the storage was ok for all 3 runs? What did the io stats look like on the MDS?

What sort of regular variation are you seeing in runs? The LU-3281 and 2.3.64 runs should be about the same.

What did the Lustre stats in /proc look like?

Is there any oprofile data from the MDS?

Comment by Andreas Dilger [ 09/May/13 ]

Cliff, is it possible for you to easily run the same "mds-survey" test that you ran for 2.3.63 DNE testing on the same hardware with 2.3.0, so that we can compare MDS-local performance between the two versions? That would help us isolate whether the performance degradation is on the MDS side, or if it relates to the object creation.

Failing that, is it possible to get Minh to run mds-survey (with and without object creation) on 2.3.0 and 2.3.63 for comparison.

Comment by Cliff White (Inactive) [ 09/May/13 ]

The test itself is not difficult, it's a question of availability/timing vs other requests for Hyperion.

Comment by Cliff White (Inactive) [ 09/May/13 ]

Kieth, the runs are an average of 5 iterations for IOR. The storage was fine for all three runs. We've stopped using the problematic nodes for these tests. There is no oprofile data from the MDS.

Comment by Robert Read (Inactive) [ 09/May/13 ]

Do you mean mdtest instead of IOR?

Comment by Cliff White (Inactive) [ 13/May/13 ]

Keith was referring to LU-3281, which was a bug involving IOR.
We also run 5 iterations for mdtest.

Comment by Minh Diep [ 14/May/13 ]

Opensfs cluster run

2.3.65
mdtest fpp: https://maloo.whamcloud.com/test_sets/697bfbb2-bca3-11e2-b6d1-52540035b04c
mdtest ssf: https://maloo.whamcloud.com/test_sets/6b79a220-bca3-11e2-8441-52540035b04c

2.3.0
mdtest fpp: https://maloo.whamcloud.com/test_sets/0206a48c-bcb7-11e2-8441-52540035b04c
mdtest ssf: https://maloo.whamcloud.com/test_sets/054d925e-bcb7-11e2-8441-52540035b04c

Comment by Liang Zhen (Inactive) [ 15/May/13 ]

2.3/2.4 mdtest performance data

Comment by Liang Zhen (Inactive) [ 15/May/13 ]

As we can see from these graphs, the major difference is about directory per process creation performance, but stddev of all creation tests are very high, for example, 2.3 directory per process creation, stddev value is above 50% of average value. One possible reason could be journal size, I think our default journal size is still 1GB? I would suggest to have 2G internal journal at least, and repeat 10 or more times.
Another thing I cannot explain is, why shared directory file stat performance is so much worse than dir per process?

Comment by Shuichi Ihara (Inactive) [ 16/May/13 ]

I also hit same regression on the latest master commit and started to find where retrogression started. "git bisect" would help. I will post soon which commit causes this regressions.

Comment by Minh Diep [ 17/May/13 ]

comparison between 2.3 and 2.4.50 RC1

Comment by Shuichi Ihara (Inactive) [ 17/May/13 ]

Unique direcotry

version commit Dir creation Dir stat Dir removal File creation File stat File removal
v2_3_50_0-143 9ddf386035767a96b54e21559f3ea0be126dc8cd 22167 194845 32910 50090 132461 36762
v2_3_50_0-142 c61d09e9944ae47f68eb159224af7c5456cc180a 21835 203123 46178 45941 147281 67253
v2_3_50_0-141 48bad5d9db9baa7bca093de5c54294adf1cf8303 20542 202014 48271 46264 142450 66267
2.3.50 04ec54ff56b83a9114f7a25fbd4aa5f65e68ef7a 24727 226128 34064 27859 124521 34591
2.3 ee695bf0762f5dbcb2ac6d96354f8d01ad764903 23565 222482 44421 48333 125765 76709

It seems that there are several regression points between 2.3 and 2.4, but at least on my small client testing(tested mdtest on 16 clients, 32 process), commit 9ddf386035767a96b54e21559f3ea0be126dc8cd might be one of regression point for unlink operation. (I collected more data at commit points, but this is one of point big differnce between commits.
I'm still working to verify wether this is exactly point. will run on large number of client environment.

Comment by Peter Jones [ 17/May/13 ]

Niu

Are you able to advise on the reported drop due to the quotas landing?

Alex/Nasf

Any comments?

Thanks

Peter

Comment by Andreas Dilger [ 17/May/13 ]

This is a previously known problem discussed in LU-2442. The "on-by-default" quota accounting introduced serialization in the quota layer that broke the SMP scaling optimizations done in the 2.3 code. This wasn't fixed until v2_3_63_0-35-g6df197d (http://review.whamcloud.com/5010), so this will hide any regressions in the metadata performance when testing on a faster system, unless the quota feature is disabled on the MDS (tune2fs -O ^quota /dev/mdt).

It may well be that there is still a performance impact from the "on-by-default" quota accounting, which is worthwhile to test before trying to find some other cause for this regression.

Comment by Minh Diep [ 22/May/13 ]

oprofile data for mds-survey run:

[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 09:45:01 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 39545.05 [39545.05,39545.05] lookup 2577588.87 [2577588.87,2577588.87] md_getattr 1247558.37 [1247558.37,1247558.37] setxattr 57218.08 [57218.08,57218.08] destroy 33400.55 [33400.55,33400.55]
done!

Comment by Minh Diep [ 22/May/13 ]

Here is some info about turn off quota:

quota on (default)

[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:09:18 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 44082.00 [44082.00,44082.00] lookup 2355576.51 [2355576.51,2355576.51] md_getattr 1223878.40 [1223878.40,1223878.40] setxattr 29649.06 [29649.06,29649.06] destroy 46921.25 [46921.25,46921.25]
done!
[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:10:48 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 44897.20 [44897.20,44897.20] lookup 2659938.47 [2659938.47,2659938.47] md_getattr 1321129.64 [1321129.64,1321129.64] setxattr 57581.92 [57581.92,57581.92] destroy 35684.58 [35684.58,35684.58]
done!
[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:11:32 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 43014.95 [43014.95,43014.95] lookup 2802301.45 [2802301.45,2802301.45] md_getattr 1348641.14 [1348641.14,1348641.14] setxattr 32394.97 [32394.97,32394.97] destroy 39988.76 [39988.76,39988.76]
done!
[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:12:10 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 42977.31 [42977.31,42977.31] lookup 2828152.94 [2828152.94,2828152.94] md_getattr 1357190.19 [1357190.19,1357190.19] setxattr 34235.71 [34235.71,34235.71] destroy 48435.62 [48435.62,48435.62]
done!
[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:12:53 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 42984.65 [42984.65,42984.65] lookup 2782484.74 [2782484.74,2782484.74] md_getattr 1344989.45 [1344989.45,1344989.45] setxattr 58604.47 [58604.47,58604.47] destroy 34678.90 [34678.90,34678.90]
done!

quota off via tuners -O^quota <dev>

[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:16:30 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 57635.67 [57635.67,57635.67] lookup 2949092.93 [2949092.93,2949092.93] md_getattr 1439917.94 [1439917.94,1439917.94] setxattr 59857.89 [59857.89,59857.89] destroy 58406.76 [58406.76,58406.76]
done!
[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:16:53 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 57729.97 [57729.97,57729.97] lookup 2745549.65 [2745549.65,2745549.65] md_getattr 1450702.84 [1450702.84,1450702.84] setxattr 31255.98 [31255.98,31255.98] destroy 71909.76 [71909.76,71909.76]
done!
[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:17:16 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 57145.08 [57145.08,57145.08] lookup 2610389.07 [2610389.07,2610389.07] md_getattr 1456521.11 [1456521.11,1456521.11] setxattr 31354.29 [31354.29,31354.29] destroy 75608.29 [75608.29,75608.29]
done!
[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:17:55 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 53349.04 [53349.04,53349.04] lookup 2678071.11 [2678071.11,2678071.11] md_getattr 1390181.88 [1390181.88,1390181.88] setxattr 34123.68 [34123.68,34123.68] destroy 148934.27 [148934.27,148934.27]
done!
[root@mds03 lu3305]# file_count=250000 thrlo=256 thrhi=256 /usr/bin/mds-survey
Wed May 22 13:18:31 PDT 2013 /usr/bin/mds-survey from mds03
mdt 1 file 250000 dir 256 thr 256 create 54390.24 [54390.24,54390.24] lookup 2724295.09 [2724295.09,2724295.09] md_getattr 1428483.22 [1428483.22,1428483.22] setxattr 34446.16 [34446.16,34446.16] destroy 154758.66 [154758.66,154758.66]
done!

Comment by Niu Yawei (Inactive) [ 23/May/13 ]

From Minh's result we can see: because of quota file updating, when testing 256 threads over 256 directories (1 thread per directory, no contention on parent directory updating), create/unlink of w/o quota is faster than create/unlink with quota on. I think the oprofile data confirms it:

Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name                 symbol name
2276160  46.2251  vmlinux                  vmlinux                  dqput
963873   19.5747  vmlinux                  vmlinux                  dqget
335277    6.8089  ldiskfs                  ldiskfs                  /ldiskfs
258028    5.2401  vmlinux                  vmlinux                  dquot_mark_dquot_dirty
110819    2.2506  osd_ldiskfs              osd_ldiskfs              /osd_ldiskfs
76925     1.5622  obdclass                 obdclass                 /obdclass
58193     1.1818  mdd                      mdd                      /mdd
41931     0.8516  vmlinux                  vmlinux                  __find_get_block
32408     0.6582  lod                      lod                      /lod
20711     0.4206  jbd2.ko                  jbd2.ko                  jbd2_journal_add_journal_head
18598     0.3777  jbd2.ko                  jbd2.ko                  do_get_write_access
18579     0.3773  vmlinux                  vmlinux                  __find_get_block_slow
18364     0.3729  libcfs                   libcfs                   /libcfs
17833     0.3622  oprofiled                oprofiled                /usr/bin/oprofiled
17472     0.3548  vmlinux                  vmlinux                  mutex_lock

I'm not sure if we can still improve the performance (with quota) further in this respect, because single quota file updating can always be the bottleneck.

Comment by Niu Yawei (Inactive) [ 23/May/13 ]

Look closer into the dqget()/dqput(), I realized that there is still quite a few global locks in quota code: dq_list_lock, dq_state_lock, dq_data_lock. The fix of LU-2442 only removes the global lock of dqptr_sem, which has the most significant impact on performace. Removing all of the quota global locks requires lots of changes in VFS code, that isn't a small project, maybe we should open a new project for further release?

Comment by Andreas Dilger [ 23/May/13 ]

I find it strange that dqget() is called 2M times, but it only looks like 20k blocks are being allocated (based on the ldiskfs an jbd2 call counts). Before trying to optimize the speed of that function, it is probably better to reduce the number of times it is called?

It is also a case where the same quota entry is being accessed for every call (same UID and GID each time), so I wonder if that common case could be optimized in some way?

Are any of these issues fixed in the original quota patches?

Unfortunately, since all of the threads are contending to update the same record, there isn't an easy way to reduce contention. The only thing I can think of is to have a journal pre-commit callback that does only a single quota update to disk per transaction, and uses percpu counters for the per-quota-per-transaction updates in memory. That would certainly avoid contention, and is no less correct in the face of a crash. No idea how easy that would be to implement.

Comment by Niu Yawei (Inactive) [ 23/May/13 ]

dqget()/dqput() is mainly to get/drop reference on the in-memory per-id data of dquot, and it acquires global locks like lock dq_list_lock & dq_state_lock, (since it will lookup the dquot list and do some state checking) so contention on those global locks could be severe in the test case. If we can replace them with RCU or read/write lock, things will be better.

I heard from Lai that there were some old patches which tried to remove those global locks, but it didn't gain much interest of community and not reviewed. Lai, could you comment on this?

Regarding the quota record commit (mark_dquot_dirty() -> ext4_mark_dquot_dirty() -> ext4_write_dquot() -> dquot_commit(), which should happen along with each transaction), it does require global locks: dqio_mutex & dq_list_lock, but surprisingly, I didn't see it in the top samples of oprofile, it might just because the dqget()/dqput() calls are much more than dquot commit calls? Once we resolved the bottleneck in dqget()/dqput(), the contention in dquot commit could probably come to light.

Comment by Niu Yawei (Inactive) [ 24/May/13 ]

Instead of eliminate the global locks entirely, maybe a small fix in dquot_initialize() could relieve the contenion caused by dqget()/dqput(): In dquot_initialize(), we'd call dqget() only when i_dquot not initialized, which can avoid 2 pair of dqget()/dqput() in most case. I'll propose a patch soon.

Comment by Andreas Dilger [ 24/May/13 ]

Niu's patch is at http://review.whamcloud.com/6440.

Minh, would you be able to run another set of tests with the latest patch applied, and produce a graph like:

https://jira.hpdd.intel.com/secure/attachment/12415/mdtest_create.png

so it is easier to see what the differences are? Presumably with 5 runs it would be useful to plot the standard deviation, since I see from the text results you posted above that the performance can vary dramatically between runs.

Comment by Minh Diep [ 24/May/13 ]

yes, will do when the cluster's IB network is back online next week

Comment by James A Simmons [ 28/May/13 ]

This patch will need to be port to SLES11 SP[1/2] as well. Later in the week I can include it in the patch.

Comment by Andreas Dilger [ 28/May/13 ]

James, please submit the SLES changes as a separate patch. Since this doesn't affect the API, the two changes do not need to be in the same commit. If the other patch needs to be refreshed for some other reason they can be merged.

Comment by James A Simmons [ 28/May/13 ]

Fine with me.

Comment by Minh Diep [ 04/Jun/13 ]

performance data for the patch

Comment by Niu Yawei (Inactive) [ 05/Jun/13 ]

Thanks a lot, Minh.

Looks the patch improves create/rm performance overall, but there are something strange in the figure that I don't know why:

  • The stat performance is getting worse with patch (or disable quota), I don't how the quota code can affect the read-only operations. Maybe we need to collect some oprofile data to investigate this further. (for both 2.4 & patched 2.4);
  • create/rm performance drops a lot on 3 threads;
  • For the per proc create/rm with 4 threads, patched 2.4 (and disable quota) is even worse than standard 2.4, looks it's same with what Siyao discovered in LU-2442 (unlink getting worse with 32 threads when disabled quota), contention on a global semaphore is better than contention on several spin locks when the contention is heavy enough?

BTW: why we put the create & rm data into same figure? I think they are two distinct tests, aren't they?

Comment by Niu Yawei (Inactive) [ 08/Jul/13 ]

I did few tests on Rosso cluster, the result is similar to what we got in LU-2442, except that the performance drop problem (with 32 threads) showed in LU-2442 (with LU-2442 patch) is resolved:

patched:

mdtest-1.8.3 was launched with 32 total task(s) on 1 nodes
Command line used: mdtest -d /mnt/ldiskfs -i 10 -n 25000 -u -F -r
Path: /mnt
FS: 19.7 GiB   Used FS: 17.5%   Inodes: 1.2 Mi   Used Inodes: 4.6%

32 tasks, 800000 files

SUMMARY: (of 10 iterations)
   Operation                  Max        Min       Mean    Std Dev
   ---------                  ---        ---       ----    -------
   File creation     :      0.000      0.000      0.000      0.000
   File stat         :      0.000      0.000      0.000      0.000
   File removal      :   4042.032   1713.613   2713.827    698.243
   Tree creation     :      0.000      0.000      0.000      0.000
   Tree removal      :      2.164      1.861      2.020      0.088

-- finished at 07/07/2013 20:37:36 --
CPU: Intel Sandy Bridge microarchitecture, speed 2.601e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
10826148 11.6911  vmlinux                  schedule
7089656   7.6561  vmlinux                  update_curr
6432166   6.9461  vmlinux                  sys_sched_yield
4384494   4.7348  vmlinux                  __audit_syscall_exit
4088507   4.4152  libc-2.12.so             sched_yield
3441346   3.7163  vmlinux                  system_call_after_swapgs
3337224   3.6038  vmlinux                  put_prev_task_fair
3244213   3.5034  vmlinux                  audit_syscall_entry
2844216   3.0715  vmlinux                  thread_return
2702323   2.9182  vmlinux                  rb_insert_color
2636798   2.8475  vmlinux                  native_read_tsc
2234644   2.4132  vmlinux                  sched_clock_cpu
2182744   2.3571  vmlinux                  native_sched_clock
2175482   2.3493  vmlinux                  hrtick_start_fair
2152807   2.3248  vmlinux                  pick_next_task_fair
2130024   2.3002  vmlinux                  set_next_entity
2099576   2.2673  vmlinux                  rb_erase
1790101   1.9331  vmlinux                  update_stats_wait_end
1777328   1.9193  vmlinux                  mutex_spin_on_owner
1701672   1.8376  vmlinux                  sysret_check

unpatched:

mdtest-1.8.3 was launched with 32 total task(s) on 1 nodes
Command line used: mdtest -d /mnt/ldiskfs -i 10 -n 25000 -u -F -r
Path: /mnt
FS: 19.7 GiB   Used FS: 17.8%   Inodes: 1.2 Mi   Used Inodes: 4.6%

32 tasks, 800000 files

SUMMARY: (of 10 iterations)
   Operation                  Max        Min       Mean    Std Dev
   ---------                  ---        ---       ----    -------
   File creation     :      0.000      0.000      0.000      0.000
   File stat         :      0.000      0.000      0.000      0.000
   File removal      :   2816.345   1673.085   2122.347    342.119
   Tree creation     :      0.000      0.000      0.000      0.000
   Tree removal      :      2.296      0.111      1.361      0.866

-- finished at 07/07/2013 21:11:03 --
CPU: Intel Sandy Bridge microarchitecture, speed 2.601e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name                 symbol name
23218790 18.3914  vmlinux                  vmlinux                  dqput
9549739   7.5643  vmlinux                  vmlinux                  __audit_syscall_exit
9116086   7.2208  vmlinux                  vmlinux                  schedule
8290558   6.5669  vmlinux                  vmlinux                  dqget
5576620   4.4172  vmlinux                  vmlinux                  update_curr
5343755   4.2327  vmlinux                  vmlinux                  sys_sched_yield
3251018   2.5751  libc-2.12.so             libc-2.12.so             sched_yield
2907579   2.3031  vmlinux                  vmlinux                  system_call_after_swapgs
2854863   2.2613  vmlinux                  vmlinux                  put_prev_task_fair
2793392   2.2126  vmlinux                  vmlinux                  audit_syscall_entry
2723949   2.1576  vmlinux                  vmlinux                  kfree
2551007   2.0206  vmlinux                  vmlinux                  mutex_spin_on_owner
2406364   1.9061  vmlinux                  vmlinux                  rb_insert_color
2321179   1.8386  vmlinux                  vmlinux                  thread_return
2184031   1.7299  vmlinux                  vmlinux                  native_read_tsc
2002277   1.5860  vmlinux                  vmlinux                  dquot_mark_dquot_dirty
1990135   1.5764  vmlinux                  vmlinux                  native_sched_clock
1970544   1.5608  vmlinux                  vmlinux                  set_next_entity
1967852   1.5587  vmlinux                  vmlinux                  pick_next_task_fair
1966282   1.5575  vmlinux                  vmlinux                  dquot_commit
1966271   1.5575  vmlinux                  vmlinux                  sysret_check
1919524   1.5204  vmlinux                  vmlinux                  unroll_tree_refs
1811281   1.4347  vmlinux                  vmlinux                  sched_clock_cpu
1810278   1.4339  vmlinux                  vmlinux                  rb_erase

The unlink rate speed increased ~28% with 32 threads, and the oprofile data shows contention on dq_list_lock in dqput() is alleviated a lot.

I think we should take this patch as a supplement of fix LU-2442.

Comment by Peter Jones [ 05/Sep/13 ]

Landed for 2.4.1 and 2.5

Comment by Prakash Surya (Inactive) [ 16/Sep/13 ]

It's hard to gather from the prior discussion, but is this the only patch that came out of this issue:

commit 58d2a322589ec13ee3c585c13b1c83f429d946ce
Author: Niu Yawei <yawei.niu@intel.com>
Date:   Thu May 23 23:49:03 2013 -0400

    LU-3305 quota: avoid unnecessary dqget/dqput calls

? Thanks.

Comment by Andreas Dilger [ 17/Sep/13 ]

There are two core kernel patches in the RHEL series on master that improve quota performance - both start with "quota".

Generated at Sat Feb 10 01:32:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.