[LU-16691] optimize ldiskfs prealloc (PA) under random read workloads Created: 31/Mar/23  Updated: 29/Jul/23  Resolved: 09/Jul/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0, Lustre 2.15.2
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: ldiskfs

Attachments: File oss07.perf.svg    
Issue Links:
Related
is related to LU-12970 improve mballoc for huge filesystems Open
Rank (Obsolete): 9223372036854775807

 Description   

In some cases, ldiskfs block allocation can consume a large amount of CPU cycles handling block allocations and cause OST threads to become blocked:

crmd[16542]:  notice: High CPU load detected: 261.019989
crmd[16542]:  notice: High CPU load detected: 258.720001
crmd[16542]:  notice: High CPU load detected: 265.029999
crmd[16542]:  notice: High CPU load detected: 270.309998

 INFO: task ll_ost00_027:20788 blocked for more than 90 seconds.
 ll_ost00_027    D ffff92242eda9080     0 20788      2 0x00000080
 Call Trace:
 schedule+0x29/0x70
 wait_transaction_locked+0x85/0xd0 [jbd2]
 add_transaction_credits+0x278/0x310 [jbd2]
 start_this_handle+0x1a1/0x430 [jbd2]
 jbd2__journal_start+0xf3/0x1f0 [jbd2]
 __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
 osd_trans_start+0x1e7/0x570 [osd_ldiskfs]
 ofd_trans_start+0x75/0xf0 [ofd]
 ofd_attr_set+0x586/0xb00 [ofd]
 ofd_setattr_hdl+0x31d/0x960 [ofd]
 tgt_request_handle+0xb7e/0x1700 [ptlrpc]
 ptlrpc_server_handle_request+0x253/0xbd0 [ptlrpc]
 ptlrpc_main+0xc09/0x1c30 [ptlrpc]

Perf stats show that a large amount of CPU time is used in preallocation:

Samples: 86M of event 'cycles', 4000 Hz, Event count (approx.): 25480688920 lost: 0/0 drop: 0/0
Overhead  Shared Object               Symbol
  23,81%  [kernel]                    [k] _raw_qspin_lock
  21,90%  [kernel]                    [k] ldiskfs_mb_use_preallocated
  20,16%  [kernel]                    [k] __raw_callee_save___pv_queued_spin_unlock
  15,46%  [kernel]                    [k] ldiskfs_mb_normalize_request
   1,21%  [kernel]                    [k] rwsem_spin_on_owner
   0,98%  [kernel]                    [k] native_write_msr_safe
   0,54%  [kernel]                    [k] apic_timer_interrupt
   0,51%  [kernel]                    [k] ktime_get


 Comments   
Comment by Andreas Dilger [ 31/Mar/23 ]

Looking at the flame graphs, I would suspect that something may be wrong with the preallocation (PA), for example too many PA regions, or something else that is causing these functions to be slow. According to the flame graph oss07.perf.svg, for each call to ldiskfs_mb_new_blocks() there is a large amount of time spent in _raw_spin_lock(), ldiskfs_mb_normalize_request(), and ldiskfs_mb_use_preallocated().

ldiskfs_fsblk_t ldiskfs_mb_new_blocks(handle_t *handle,
                                struct ldiskfs_allocation_request *ar, int *errp)
{
        :
        :
        if (!ldiskfs_mb_use_preallocated(ac)) {
                ac->ac_op = LDISKFS_MB_HISTORY_ALLOC;
                ldiskfs_mb_normalize_request(ac, ar);
repeat:
                /* allocate space in core */
                *errp = ldiskfs_mb_regular_allocator(ac);

so these heavy functions are before ldiskfs_mb_regular_allocator() is called. There is a loop in ldiskfs_mb_use_preallocated() that is repeatedly getting a spinlock, but it doesn't appear to be successful in finding a good PA, since the function ends up returning "0" and then ldiskfs_mb_normalize_request() is called anyway:

ldiskfs_mb_use_preallocated(struct ldiskfs_allocation_context *ac)
{
        /* first, try per-file preallocation */
        list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
                :
                /* found preallocated blocks, use them */
                spin_lock(&pa->pa_lock);
                if (pa->pa_deleted == 0 && pa->pa_free) {
                        :
                        /* this branch is never taken */
                        :
                        return 1;
                }
                spin_unlock(&pa->pa_lock);
        }
        :
        /*
         * search for the prealloc space that is having
         * minimal distance from the goal block.                
         */             
        for (i = order; i < PREALLOC_TB_SIZE; i++) {
                list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i],
                                        pa_inode_list) {
                        spin_lock(&pa->pa_lock);
                        if (pa->pa_deleted == 0 &&
                            pa->pa_free >= ac->ac_o_ex.fe_len) {
        
                                cpa = ldiskfs_mb_check_group_pa(goal_block,
                                                                pa, cpa);
                        }
                        spin_unlock(&pa->pa_lock);
                }

and then in ldiskfs_mb_normalize_request() it looks like the same PA lists are walked again and the same locks are contended:

ldiskfs_mb_normalize_request(struct ldiskfs_allocation_context *ac,
                                struct ldiskfs_allocation_request *ar)
{
         :
        list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
                ldiskfs_lblk_t pa_end;

                if (pa->pa_deleted)
                        continue;
                spin_lock(&pa->pa_lock);
                :
                /* lots of checks */
                :
                spin_unlock(&pa->pa_lock);
        }
}

By all rights, since these PA lists are on a single inode, there shouldn't be much contention, but it seems to fit the pattern shown by the flame graphs. Unfortunately, it isn't possible to know if the slow threads were all accessing a single file or different files.

I think it makes sense to backport either https://patchwork.ozlabs.org/project/linux-ext4/list/?series=346731 to ldiskfs, or at least the prealloc list fixed limit patch https://lore.kernel.org/all/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com/ to prevent the PA list from getting too long...

Comment by Alex Zhuravlev [ 31/Mar/23 ]

https://lore.kernel.org/all/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com/

this one looks simple enough

Comment by Gerrit Updater [ 31/Mar/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50481
Subject: LU-16691 ldiskfs: limit preallocation list
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: acf2f540db47d223e6999e5923aec8549be52d0b

Comment by Gerrit Updater [ 08/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50481/
Subject: LU-16691 ldiskfs: limit length of per-inode prealloc list
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b16c9333a00802faea419dfe6fbb013c4477c9c6

Comment by Peter Jones [ 09/Jul/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:29:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.