[LU-3305] Quotas affect Metadata performance - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0
Affects Version/s: Lustre 2.4.0
Labels:
None
Environment:
Hyperion/LLNL

Severity:
3
Rank (Obsolete):
8187

Description

We performed a comparison between 2.3.0, 2.1.5 and current Lustre. We say a regression in metadata performance compared to 2.3.0. Spreadsheet attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Hyperion Performance 07 May 2013.xlsx
97 kB
09/May/13 4:31 PM
Opensfs Metadata Performance.xlsx
50 kB
14/May/13 4:56 PM
Opensfs Metadata Performance quota patch.xlsx
60 kB
04/Jun/13 8:51 PM
Opensfs Metadata Performance RC1.xlsx
60 kB
17/May/13 4:45 PM
oprofile.tgz
60 kB
22/May/13 4:59 PM
Screen Shot 2013-05-15 at 2.18.52 PM.png
181 kB
15/May/13 6:19 AM

Issue Links

is blocked by

LU-3396 Disable quota by default for 2.4

Closed

Activity

[LU-3305] Quotas affect Metadata performance

Minh Diep added a comment - 04/Jun/13 8:51 PM

performance data for the patch

Minh Diep added a comment - 04/Jun/13 8:51 PM performance data for the patch

James A Simmons added a comment - 28/May/13 3:27 PM

Fine with me.

James A Simmons added a comment - 28/May/13 3:27 PM Fine with me.

Andreas Dilger added a comment - 28/May/13 2:52 PM

James, please submit the SLES changes as a separate patch. Since this doesn't affect the API, the two changes do not need to be in the same commit. If the other patch needs to be refreshed for some other reason they can be merged.

Andreas Dilger added a comment - 28/May/13 2:52 PM James, please submit the SLES changes as a separate patch. Since this doesn't affect the API, the two changes do not need to be in the same commit. If the other patch needs to be refreshed for some other reason they can be merged.

James A Simmons added a comment - 28/May/13 2:25 PM

This patch will need to be port to SLES11 SP[1/2] as well. Later in the week I can include it in the patch.

James A Simmons added a comment - 28/May/13 2:25 PM This patch will need to be port to SLES11 SP [1/2] as well. Later in the week I can include it in the patch.

Minh Diep added a comment - 24/May/13 11:34 PM

yes, will do when the cluster's IB network is back online next week

Minh Diep added a comment - 24/May/13 11:34 PM yes, will do when the cluster's IB network is back online next week

Andreas Dilger added a comment - 24/May/13 11:00 PM

Niu's patch is at http://review.whamcloud.com/6440.

Minh, would you be able to run another set of tests with the latest patch applied, and produce a graph like:

https://jira.hpdd.intel.com/secure/attachment/12415/mdtest_create.png

so it is easier to see what the differences are? Presumably with 5 runs it would be useful to plot the standard deviation, since I see from the text results you posted above that the performance can vary dramatically between runs.

Andreas Dilger added a comment - 24/May/13 11:00 PM Niu's patch is at http://review.whamcloud.com/6440 . Minh, would you be able to run another set of tests with the latest patch applied, and produce a graph like: https://jira.hpdd.intel.com/secure/attachment/12415/mdtest_create.png so it is easier to see what the differences are? Presumably with 5 runs it would be useful to plot the standard deviation, since I see from the text results you posted above that the performance can vary dramatically between runs.

Niu Yawei (Inactive) added a comment - 24/May/13 3:18 AM

Instead of eliminate the global locks entirely, maybe a small fix in dquot_initialize() could relieve the contenion caused by dqget()/dqput(): In dquot_initialize(), we'd call dqget() only when i_dquot not initialized, which can avoid 2 pair of dqget()/dqput() in most case. I'll propose a patch soon.

Niu Yawei (Inactive) added a comment - 24/May/13 3:18 AM Instead of eliminate the global locks entirely, maybe a small fix in dquot_initialize() could relieve the contenion caused by dqget()/dqput(): In dquot_initialize(), we'd call dqget() only when i_dquot not initialized, which can avoid 2 pair of dqget()/dqput() in most case. I'll propose a patch soon.

Niu Yawei (Inactive) added a comment - 23/May/13 9:09 AM

dqget()/dqput() is mainly to get/drop reference on the in-memory per-id data of dquot, and it acquires global locks like lock dq_list_lock & dq_state_lock, (since it will lookup the dquot list and do some state checking) so contention on those global locks could be severe in the test case. If we can replace them with RCU or read/write lock, things will be better.

I heard from Lai that there were some old patches which tried to remove those global locks, but it didn't gain much interest of community and not reviewed. Lai, could you comment on this?

Regarding the quota record commit (mark_dquot_dirty() -> ext4_mark_dquot_dirty() -> ext4_write_dquot() -> dquot_commit(), which should happen along with each transaction), it does require global locks: dqio_mutex & dq_list_lock, but surprisingly, I didn't see it in the top samples of oprofile, it might just because the dqget()/dqput() calls are much more than dquot commit calls? Once we resolved the bottleneck in dqget()/dqput(), the contention in dquot commit could probably come to light.

Niu Yawei (Inactive) added a comment - 23/May/13 9:09 AM dqget()/dqput() is mainly to get/drop reference on the in-memory per-id data of dquot, and it acquires global locks like lock dq_list_lock & dq_state_lock, (since it will lookup the dquot list and do some state checking) so contention on those global locks could be severe in the test case. If we can replace them with RCU or read/write lock, things will be better. I heard from Lai that there were some old patches which tried to remove those global locks, but it didn't gain much interest of community and not reviewed. Lai, could you comment on this? Regarding the quota record commit (mark_dquot_dirty() -> ext4_mark_dquot_dirty() -> ext4_write_dquot() -> dquot_commit(), which should happen along with each transaction), it does require global locks: dqio_mutex & dq_list_lock, but surprisingly, I didn't see it in the top samples of oprofile, it might just because the dqget()/dqput() calls are much more than dquot commit calls? Once we resolved the bottleneck in dqget()/dqput(), the contention in dquot commit could probably come to light.

Andreas Dilger added a comment - 23/May/13 7:21 AM

I find it strange that dqget() is called 2M times, but it only looks like 20k blocks are being allocated (based on the ldiskfs an jbd2 call counts). Before trying to optimize the speed of that function, it is probably better to reduce the number of times it is called?

It is also a case where the same quota entry is being accessed for every call (same UID and GID each time), so I wonder if that common case could be optimized in some way?

Are any of these issues fixed in the original quota patches?

Unfortunately, since all of the threads are contending to update the same record, there isn't an easy way to reduce contention. The only thing I can think of is to have a journal pre-commit callback that does only a single quota update to disk per transaction, and uses percpu counters for the per-quota-per-transaction updates in memory. That would certainly avoid contention, and is no less correct in the face of a crash. No idea how easy that would be to implement.

Andreas Dilger added a comment - 23/May/13 7:21 AM I find it strange that dqget() is called 2M times, but it only looks like 20k blocks are being allocated (based on the ldiskfs an jbd2 call counts). Before trying to optimize the speed of that function, it is probably better to reduce the number of times it is called? It is also a case where the same quota entry is being accessed for every call (same UID and GID each time), so I wonder if that common case could be optimized in some way? Are any of these issues fixed in the original quota patches? Unfortunately, since all of the threads are contending to update the same record, there isn't an easy way to reduce contention. The only thing I can think of is to have a journal pre-commit callback that does only a single quota update to disk per transaction, and uses percpu counters for the per-quota-per-transaction updates in memory. That would certainly avoid contention, and is no less correct in the face of a crash. No idea how easy that would be to implement.

Niu Yawei (Inactive) added a comment - 23/May/13 3:32 AM

Look closer into the dqget()/dqput(), I realized that there is still quite a few global locks in quota code: dq_list_lock, dq_state_lock, dq_data_lock. The fix of ~~LU-2442~~ only removes the global lock of dqptr_sem, which has the most significant impact on performace. Removing all of the quota global locks requires lots of changes in VFS code, that isn't a small project, maybe we should open a new project for further release?

Niu Yawei (Inactive) added a comment - 23/May/13 3:32 AM Look closer into the dqget()/dqput(), I realized that there is still quite a few global locks in quota code: dq_list_lock, dq_state_lock, dq_data_lock. The fix of LU-2442 only removes the global lock of dqptr_sem, which has the most significant impact on performace. Removing all of the quota global locks requires lots of changes in VFS code, that isn't a small project, maybe we should open a new project for further release?

Niu Yawei (Inactive) added a comment - 23/May/13 3:05 AM

From Minh's result we can see: because of quota file updating, when testing 256 threads over 256 directories (1 thread per directory, no contention on parent directory updating), create/unlink of w/o quota is faster than create/unlink with quota on. I think the oprofile data confirms it:

Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name                 symbol name
2276160  46.2251  vmlinux                  vmlinux                  dqput
963873   19.5747  vmlinux                  vmlinux                  dqget
335277    6.8089  ldiskfs                  ldiskfs                  /ldiskfs
258028    5.2401  vmlinux                  vmlinux                  dquot_mark_dquot_dirty
110819    2.2506  osd_ldiskfs              osd_ldiskfs              /osd_ldiskfs
76925     1.5622  obdclass                 obdclass                 /obdclass
58193     1.1818  mdd                      mdd                      /mdd
41931     0.8516  vmlinux                  vmlinux                  __find_get_block
32408     0.6582  lod                      lod                      /lod
20711     0.4206  jbd2.ko                  jbd2.ko                  jbd2_journal_add_journal_head
18598     0.3777  jbd2.ko                  jbd2.ko                  do_get_write_access
18579     0.3773  vmlinux                  vmlinux                  __find_get_block_slow
18364     0.3729  libcfs                   libcfs                   /libcfs
17833     0.3622  oprofiled                oprofiled                /usr/bin/oprofiled
17472     0.3548  vmlinux                  vmlinux                  mutex_lock

I'm not sure if we can still improve the performance (with quota) further in this respect, because single quota file updating can always be the bottleneck.

Niu Yawei (Inactive) added a comment - 23/May/13 3:05 AM From Minh's result we can see: because of quota file updating, when testing 256 threads over 256 directories (1 thread per directory, no contention on parent directory updating), create/unlink of w/o quota is faster than create/unlink with quota on. I think the oprofile data confirms it: Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 2276160 46.2251 vmlinux vmlinux dqput 963873 19.5747 vmlinux vmlinux dqget 335277 6.8089 ldiskfs ldiskfs /ldiskfs 258028 5.2401 vmlinux vmlinux dquot_mark_dquot_dirty 110819 2.2506 osd_ldiskfs osd_ldiskfs /osd_ldiskfs 76925 1.5622 obdclass obdclass /obdclass 58193 1.1818 mdd mdd /mdd 41931 0.8516 vmlinux vmlinux __find_get_block 32408 0.6582 lod lod /lod 20711 0.4206 jbd2.ko jbd2.ko jbd2_journal_add_journal_head 18598 0.3777 jbd2.ko jbd2.ko do_get_write_access 18579 0.3773 vmlinux vmlinux __find_get_block_slow 18364 0.3729 libcfs libcfs /libcfs 17833 0.3622 oprofiled oprofiled /usr/bin/oprofiled 17472 0.3548 vmlinux vmlinux mutex_lock I'm not sure if we can still improve the performance (with quota) further in this respect, because single quota file updating can always be the bottleneck.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 09/May/13 4:31 PM

Updated:: 17/Sep/13 4:54 AM

Resolved:: 05/Sep/13 10:46 PM