[LU-1057] low performance maybe related to quota Created: 31/Jan/12  Updated: 22/Dec/12  Resolved: 27/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.3.0, Lustre 2.1.4

Type: Bug Priority: Major
Reporter: Gregoire Pichon Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: paj
Environment:

Lustre 2.1 with Bull patches, bullxlinux6.1 x86_64 (based on Redhat 6.1), server bullx S6010-4


Attachments: Text File oprofile.client.S6010-4.report.txt     Text File oprofile.client.S6010-4.root.report.txt     Text File oprofile.client.S6010.report.txt    
Severity: 3
Rank (Obsolete): 4513

 Description   

When running a performance test (sequential data IOs, 15 tasks writing in one file each) on a Lustre file-system, installed with Lustre 2.1 plus a few Bull patches, I observe very low throughput compared to what I usually measure on the same hardware.

Write bandwidth is varying between 150MB/s and 500 MB/s running with a standard user. With the exact same parameters and configuration, but running under the root user, I get around 2000 MB/s write bandwidth. This second value is what I observe usually.
With the root user, I suppose the flag OBD_BRW_NOQUOTA is set (but I have not been able to confirm that from the source code), which makes the request processing skip the lquota_chkdq() quota check in osc_queue_async_io().

The profiling of the Lustre client indicates more than 50% of time is spent in osc_quota_chkdq() routine. So this seems related to the quota subsystem and certainly explains why root user is not impacted by the problem. I will attach the profiling reports to this ticket.

The Lustre client is a bullx S6010-4, which has 128 cores and a large NUMIOA factor. The same performance measure on a bullx S6010, which has only 32 cores and smaller NUMIOA factor, gives around 3000 MB/s write bandwidth, so it is not impacted by the performance issue.

I have recompiled the lquota module after removing the cfs_spin_lock()/cfs_spin_unlock() calls on qinfo_list_lock in osc_quota_chkdq() routine and the performance is back to the expected level. Note that the qinfo_hash[] table in the Lustre client is empty since quota are disabled.

How many asynchronous IO requests can be generated by only 15 writing tasks ? Is there so many requests in parallel that the qinfo_list_lock becomes a congestion point ?

Is there more latency in the spin_lock()/spin_unlock() routines when the NUMIOA factor is high ?



 Comments   
Comment by Johann Lombardi (Inactive) [ 31/Jan/12 ]

To speed up the case when quota isn't enforced (like in this case), we could just record the number of osc_quota_info entries we have for each cli and skip the hash lookup as well as the locking entirely.

When quota is enforced, i think we should first have one hash per cli instead of a global hash and spinlock.
In fact, we might just want to use the generic cfs_hash_t to handle this (which uses rw spinlock already).

Comment by Gregoire Pichon [ 31/Jan/12 ]

Here are the oprofile reports for

  • a Lustre client machine S6010-4 (128 cores and large NUMIOA factor),
  • a Lustre client machine S6010-4 but using root user
  • a Lustre client machine S6010 (32 cores)
    during the benchmark that performs sequential data IO writes with 15 tasks over 15 files (one for each task), each file is stripped on one of the 15 OSTs of the file-system.
Comment by Peter Jones [ 31/Jan/12 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Johann Lombardi (Inactive) [ 31/Jan/12 ]

Actually, we might be able to just use a radix tree with RCU
http://lxr.linux.no/#linux+v3.2.2/include/linux/radix-tree.h#L88

Comment by Johann Lombardi (Inactive) [ 31/Jan/12 ]

I have just pushed a - untested - patch using RCU & radix tree:
http://review.whamcloud.com/2074

Comment by Gregoire Pichon [ 02/Feb/12 ]

Thank you Johann.

I have tested your patch (set 2) and results are good. The performance is at the expected level and the profiling report does not show too much time spent in osc_quota_chkdq() routine (0.0170% of the profiling samples).

Note that my configuration still has quota disabled and therefore there are no osc_quota_info entries.

Comment by Johann Lombardi (Inactive) [ 02/Feb/12 ]

Thanks for testing this patch Grégoire. I'm now waiting for autotest results to check if the patch broke quota

Comment by Johann Lombardi (Inactive) [ 04/Feb/12 ]

Please note that there was a bug in the patch:

rc = radix_tree_insert(&cli->cl_quota_ids[type], qid[type], &oqi);
                                                            ^^^^ this should be oqi

I have pushed the corrected version. That said, it only shows up when you start using quota.

Comment by Gregoire Pichon [ 21/Jun/12 ]

Hi Johann,

What is the status of this ticket ? Do you plan to provide a new version of the patch with hash table implementation ?

This issue is going to become critical as many of these Bullx S6010-4 machines (with large NUMA factor) are being installed in the june/july timeframe at TGCC customer site.

Thanks.

Comment by Ian Colle (Inactive) [ 04/Jul/12 ]

Support team can pick up and refresh Johann's last patch

Comment by Peter Jones [ 04/Jul/12 ]

Yujian

Could you please take care of this one?

Thanks

Peter

Comment by Peter Jones [ 05/Jul/12 ]

Reassign to Hongchao

Comment by Hongchao Zhang [ 09/Jul/12 ]

status update:

the updated path using cfs_hash_t is under test.

Comment by Hongchao Zhang [ 05/Aug/12 ]

the patch has been merged (1b044fecb42c1f72ca2d2bc2bf80a4345b9ccf11)

Comment by Jodi Levi (Inactive) [ 27/Sep/12 ]

Please let me know if there is outstanding work on this ticket.

Comment by Gregoire Pichon [ 04/Oct/12 ]

I have backported the patch into b2_1: http://review.whamcloud.com/#change,4184.

The tests show the contention on quota (osc_quota_chkdq() routine) has been fixed.

Could this patch been reviews ?

Thanks.

Generated at Sat Feb 10 01:13:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.