[LU-16691] optimize ldiskfs prealloc (PA) under random read workloads Created: 31/Mar/23 Updated: 29/Jul/23 Resolved: 09/Jul/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0, Lustre 2.15.2 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ldiskfs | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
In some cases, ldiskfs block allocation can consume a large amount of CPU cycles handling block allocations and cause OST threads to become blocked: crmd[16542]: notice: High CPU load detected: 261.019989 crmd[16542]: notice: High CPU load detected: 258.720001 crmd[16542]: notice: High CPU load detected: 265.029999 crmd[16542]: notice: High CPU load detected: 270.309998 INFO: task ll_ost00_027:20788 blocked for more than 90 seconds. ll_ost00_027 D ffff92242eda9080 0 20788 2 0x00000080 Call Trace: schedule+0x29/0x70 wait_transaction_locked+0x85/0xd0 [jbd2] add_transaction_credits+0x278/0x310 [jbd2] start_this_handle+0x1a1/0x430 [jbd2] jbd2__journal_start+0xf3/0x1f0 [jbd2] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] osd_trans_start+0x1e7/0x570 [osd_ldiskfs] ofd_trans_start+0x75/0xf0 [ofd] ofd_attr_set+0x586/0xb00 [ofd] ofd_setattr_hdl+0x31d/0x960 [ofd] tgt_request_handle+0xb7e/0x1700 [ptlrpc] ptlrpc_server_handle_request+0x253/0xbd0 [ptlrpc] ptlrpc_main+0xc09/0x1c30 [ptlrpc] Perf stats show that a large amount of CPU time is used in preallocation: Samples: 86M of event 'cycles', 4000 Hz, Event count (approx.): 25480688920 lost: 0/0 drop: 0/0 Overhead Shared Object Symbol 23,81% [kernel] [k] _raw_qspin_lock 21,90% [kernel] [k] ldiskfs_mb_use_preallocated 20,16% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 15,46% [kernel] [k] ldiskfs_mb_normalize_request 1,21% [kernel] [k] rwsem_spin_on_owner 0,98% [kernel] [k] native_write_msr_safe 0,54% [kernel] [k] apic_timer_interrupt 0,51% [kernel] [k] ktime_get |
| Comments |
| Comment by Andreas Dilger [ 31/Mar/23 ] |
|
Looking at the flame graphs, I would suspect that something may be wrong with the preallocation (PA), for example too many PA regions, or something else that is causing these functions to be slow. According to the flame graph oss07.perf.svg
ldiskfs_fsblk_t ldiskfs_mb_new_blocks(handle_t *handle,
struct ldiskfs_allocation_request *ar, int *errp)
{
:
:
if (!ldiskfs_mb_use_preallocated(ac)) {
ac->ac_op = LDISKFS_MB_HISTORY_ALLOC;
ldiskfs_mb_normalize_request(ac, ar);
repeat:
/* allocate space in core */
*errp = ldiskfs_mb_regular_allocator(ac);
so these heavy functions are before ldiskfs_mb_regular_allocator() is called. There is a loop in ldiskfs_mb_use_preallocated() that is repeatedly getting a spinlock, but it doesn't appear to be successful in finding a good PA, since the function ends up returning "0" and then ldiskfs_mb_normalize_request() is called anyway:
ldiskfs_mb_use_preallocated(struct ldiskfs_allocation_context *ac)
{
/* first, try per-file preallocation */
list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
:
/* found preallocated blocks, use them */
spin_lock(&pa->pa_lock);
if (pa->pa_deleted == 0 && pa->pa_free) {
:
/* this branch is never taken */
:
return 1;
}
spin_unlock(&pa->pa_lock);
}
:
/*
* search for the prealloc space that is having
* minimal distance from the goal block.
*/
for (i = order; i < PREALLOC_TB_SIZE; i++) {
list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i],
pa_inode_list) {
spin_lock(&pa->pa_lock);
if (pa->pa_deleted == 0 &&
pa->pa_free >= ac->ac_o_ex.fe_len) {
cpa = ldiskfs_mb_check_group_pa(goal_block,
pa, cpa);
}
spin_unlock(&pa->pa_lock);
}
and then in ldiskfs_mb_normalize_request() it looks like the same PA lists are walked again and the same locks are contended:
ldiskfs_mb_normalize_request(struct ldiskfs_allocation_context *ac,
struct ldiskfs_allocation_request *ar)
{
:
list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
ldiskfs_lblk_t pa_end;
if (pa->pa_deleted)
continue;
spin_lock(&pa->pa_lock);
:
/* lots of checks */
:
spin_unlock(&pa->pa_lock);
}
}
By all rights, since these PA lists are on a single inode, there shouldn't be much contention, but it seems to fit the pattern shown by the flame graphs. Unfortunately, it isn't possible to know if the slow threads were all accessing a single file or different files. I think it makes sense to backport either https://patchwork.ozlabs.org/project/linux-ext4/list/?series=346731 to ldiskfs, or at least the prealloc list fixed limit patch https://lore.kernel.org/all/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com/ to prevent the PA list from getting too long... |
| Comment by Alex Zhuravlev [ 31/Mar/23 ] |
this one looks simple enough |
| Comment by Gerrit Updater [ 31/Mar/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50481 |
| Comment by Gerrit Updater [ 08/Jul/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50481/ |
| Comment by Peter Jones [ 09/Jul/23 ] |
|
Landed for 2.16 |