[LU-6381] replace global dq_state_lock/dq_list_lock with per-sb spinlocks and per-sb hash table. Created: 18/Mar/15  Updated: 20/Jul/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Di Wang Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6378 Quota performance issue for 2.7 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

[3/18/15, 11:01:00 AM] Andreas Dilger: The other thought I had was to replace the global dq_state_lock and dq_list_lock with per-sb spinlocks and a per-sb hash table, and use a separate lock for the dq_list_lock for the quota format calls
[3/18/15, 11:02:07 AM] Andreas Dilger: that would at least avoid lock contention between MDTs



 Comments   
Comment by James A Simmons [ 18/Mar/15 ]

Will this be a special kernel patch needed or will it be at the lustre level?

Comment by Di Wang [ 18/Mar/15 ]

I believe it will be changed in the kernel patch.

Comment by Jodi Levi (Inactive) [ 18/Mar/15 ]

Niu,
Could you please take a look at this one?
Thank you!

Comment by Andreas Dilger [ 18/Mar/15 ]

The scalability of dqget() is poor because it is holding two global spinlocks for code that is shared across multiple filesystems. In LU-6378 there was major contention seen on dqget() due to multiple MDTs on the same MDS on fast storage.

Several things that could be done to improve this:

  • replace the global dq_state_lock with a per-sb lock
  • replace most of the global dq_list_lock with a per-sb lock, and move the quota format handling under a different lock (I don't see how the quota formats relate to the hash buckets)
  • for better scaling between users there could be per-hash bucket locking, or a blockgroup_lock that at least scales well per core without the overhead of one lock per bucket. This won't help if all the dqget() calls are for a single user, as with most benchmarks and normal uses.
Comment by Andreas Dilger [ 18/Mar/15 ]

Please also post patches to the upstream linux-fsdevel mailing list for review and feedback.

Comment by Niu Yawei (Inactive) [ 19/Mar/15 ]

The scalability of dqget() is poor because it is holding two global spinlocks for code that is shared across multiple filesystems. In LU-6378 there was major contention seen on dqget() due to multiple MDTs on the same MDS on fast storage.

Andreas, do we have the performance & oprofile data for multiple MDTs on same MDS? I didn't find it in IU-4 (there is only numbers for multiple MDTs with quota disabled). It would be interesting if we can compare it with the data when quota disabled, and it's even better to do such comparison for multiple MDTs vs. single MDT (with quota enabled), so we can verify that if there is other bottlenecks besides these two global locks. I'm not sure if it's proper to ask Nathan from IU to the test for us?

An interesting observation is that unlink test wasn't affected by the quota global locks like mknod test. (unlink calls dqput() which will get the dq_list_lock), see the oprofile of unlink test (64 threads, 32 mnt, single MDT, Lustre 2.6):

CPU: Intel Architectural Perfmon, speed 3292.01 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
vma      samples  %        image name               app name                 symbol name
ffffffff811be860 831572    4.7333  vmlinux                  vmlinux                  __find_get_block_slow
0000000000032bd0 523313    2.9787  obdclass.ko              obdclass.ko              class_handle2object
ffffffff811701c0 328672    1.8708  vmlinux                  vmlinux                  kmem_cache_free
ffffffff811bf060 319529    1.8188  vmlinux                  vmlinux                  __find_get_block
000000000001f6c0 288861    1.6442  libcfs.ko                libcfs.ko                cfs_percpt_lock
0000000000031cd0 264236    1.5040  obdclass.ko              obdclass.ko              lprocfs_counter_add
0000000000050530 234202    1.3331  obdclass.ko              obdclass.ko              lu_context_key_get
ffffffff81058e10 209186    1.1907  vmlinux                  vmlinux                  task_rq_lock
ffffffff8128f490 165519    0.9421  vmlinux                  vmlinux                  memset
ffffffff811708f0 150077    0.8542  vmlinux                  vmlinux                  kfree

The oprofile for mknod test looks like (64 threads, 32 mnt, Lustre 2.6):

CPU: Intel Architectural Perfmon, speed 3292.01 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
vma      samples  %        image name               app name                 symbol name
ffffffff811eb7f0 4744016  21.8990  vmlinux                  vmlinux                  dqput
ffffffff811eb180 3570862  16.4836  vmlinux                  vmlinux                  dquot_mark_dquot_dirty
ffffffff811ecb70 2488686  11.4881  vmlinux                  vmlinux                  dqget
ffffffff811be860 436818    2.0164  vmlinux                  vmlinux                  __find_get_block_slow
0000000000002630 383431    1.7700  ldiskfs.ko               ldiskfs.ko               ldiskfs_check_dir_entry
ffffffff811bf060 297226    1.3720  vmlinux                  vmlinux                  __find_get_block
0000000000026690 147702    0.6818  ldiskfs.ko               ldiskfs.ko               ldiskfs_dx_find_entry
0000000000050530 147313    0.6800  obdclass.ko              obdclass.ko              lu_context_key_get
000000000000a6a0 130593    0.6028  jbd2.ko                  jbd2.ko                  jbd2_journal_add_journal_head
ffffffff81058e10 121861    0.5625  vmlinux                  vmlinux                  task_rq_lock
0000000000031cd0 121519    0.5609  obdclass.ko              obdclass.ko              lprocfs_counter_add

And there is actually another global quota lock 'dq_data_lock', which is used on each inode/block allocating/deleting, but I'm not quite sure why the contention on this lock is negligible (as showed by oprofile data).

Generated at Sat Feb 10 01:59:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.