[LU-16272] High CPU usage in libcfs_hash_for_each_empty() Created: 27/Oct/22  Updated: 09/Feb/24  Resolved: 13/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Major
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

Here are ftrace logs for cancelling of 10k ldlm locks on client disconnect
ldlm_cancel_locks_for_export() / cfs_hash_for_each_relax():

First function calls are fast:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0) + 30.305 us   |  cfs_hash_for_each_relax [libcfs]();
 0) + 13.570 us   |  cfs_hash_for_each_relax [libcfs]();
 0) + 16.458 us   |  cfs_hash_for_each_relax [libcfs]();
 1) + 22.303 us   |  cfs_hash_for_each_relax [libcfs]();
 0) + 11.710 us   |  cfs_hash_for_each_relax [libcfs]();
 0) + 10.227 us   |  cfs_hash_for_each_relax [libcfs]();
 1) + 12.137 us   |  cfs_hash_for_each_relax [libcfs]();
 0) + 10.632 us   |  cfs_hash_for_each_relax [libcfs]();
 1) + 11.427 us   |  cfs_hash_for_each_relax [libcfs]();

the last ones are critically slow:

3) ! 408.016 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 425.611 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 411.456 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 408.476 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 410.266 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 409.032 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 408.790 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 409.062 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 409.492 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 839.014 us  |  cfs_hash_for_each_relax [libcfs]();
 3) ! 409.712 us  |  cfs_hash_for_each_relax [libcfs]();

i.e. cfs_hash_for_each_relax becoming slow when first hash buckets are emptied.
it matches the perf (cpu profiling) results in one of the above comments, that CPU consumed mostly by
cfs_hash_hh_hhead() – the function is used to check whether hlists in a bucket are empty.

     3.83%     0.00%  [k] ret_from_fork                                         -      -            
            |          
             --3.81%--kthread
                       |          
                       |--1.85%--ldlm_bl_thread_main
                       |          ldlm_export_cancel_locks
                       |          cfs_hash_for_each_empty
                       |          cfs_hash_for_each_relax
                       |          |          
                       |           --1.09%--cfs_hash_hh_hhead
                       |          
 

ldlm_export_cancel_locks() call consumes 2.3 seconds:

[root@devvm3 tests]# cat ~/trace.txt
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 2) $ 2397334 us  |  ldlm_export_cancel_locks [ptlrpc]();
 1)               |  ldlm_export_cancel_locks [ptlrpc]() {
 0) $ 2384598 us  |  } /* ldlm_export_cancel_locks [ptlrpc] */
[root@devvm3 tests]# 

with a small libcfs_hash improvement (will be submitted soon), the function call takes only 0.1 sec:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  ldlm_export_cancel_locks [ptlrpc]() {
 0)   8.154 us    |    lu_env_init [obdclass]();
 0) * 91503.19 us |    cfs_hash_for_each_empty [libcfs]();
 3)               |  ldlm_export_cancel_locks [ptlrpc]() {
 3)   7.818 us    |    lu_env_init [obdclass]();
 3) * 92138.99 us |    cfs_hash_for_each_empty [libcfs]();
 0) + 34.008 us   |    ldlm_reprocess_list [ptlrpc]();
 0)   1.183 us    |    libcfs_debug_msg [libcfs]();
 0) + 13.812 us   |    lu_env_fini [obdclass]();
 0) * 91564.21 us |  }
 3) + 36.190 us   |    ldlm_reprocess_list [ptlrpc]();
 3)   1.249 us    |    libcfs_debug_msg [libcfs]();
 3)   9.312 us    |    lu_env_fini [obdclass]();
 3) * 92197.71 us |  }

my test script, for a reference:

cd /sys/kernel/debug/tracing/
echo ldlm_export_cancel_locks > set_graph_function 
# echo cfs_hash_for_each_relax > set_graph_function
# echo ldlm_cancel_locks_for_export_cb > set_graph_function 
# echo ldlm_export_cancel_locks > /sys/kernel/debug/tracing/set_ftrace_filter 
echo > /sys/kernel/debug/tracing/set_ftrace_filter 
echo 1 > max_graph_depth
echo function_graph > current_tracer 
echo > trace
cd -

for n in {1..20}; do touch /mnt/lustre/foo-$n; for x in {1..1000}; do echo ladvise -a lockahead -s $(( 2 * x ))M -l 1M -m WRITE /mnt/lustre/foo-$n; done; done | ../utils/lfs 
../utils/lctl get_param ldlm.namespaces.filter-*.lock_count
LUUID=$(../utils/lctl get_param -n llite.*.uuid)
../utils/lctl set_param obdfilter.lustre-OST*.evict_client=$LUUID
sleep 3
../utils/lctl get_param ldlm.namespaces.filter-*.lock_count
cat /sys/kernel/debug/tracing/trace > ~/trace.txt
cat ~/trace.txt


 Comments   
Comment by Gerrit Updater [ 27/Oct/22 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48972
Subject: LU-16272 libcfs: cfs_hash_for_each_empty optimization
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7969901d3a41c5eabe8e113bb721a0685aa9759d

Comment by Alex Zhuravlev [ 27/Oct/22 ]

https://review.whamcloud.com/c/fs/lustre-release/+/45512

Comment by Alexander Zarochentsev [ 27/Oct/22 ]

Looks like there is a similar patch in LU-15204 https://review.whamcloud.com/c/fs/lustre-release/+/45512

Comment by Gerrit Updater [ 13/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48972/
Subject: LU-16272 libcfs: cfs_hash_for_each_empty optimization
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 306a9b666e5ea2882f704d93483355e7e147544f

Comment by Peter Jones [ 13/Jan/23 ]

Landed for 2.16

Comment by Alex Zhuravlev [ 16/Jan/23 ]

hit on the fresh master:

[  193.056624] Lustre: dir [0x240000402:0x180:0x0] stripe 1 readdir failed: -2, directory is partially accessed!
[  193.515228] LustreError: 5402:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: 00000000209e1bb3/0x83b6d793c0eaa64f lrc: 3/0,0 mode: PR/PR res: [0x200000403:0x5c:0x0].0x0 bits 0x13/0x0 rrc: 12 type: IBT gid 0 flags: 0x60200400000020 nid: 0@lo remote: 0x83b6d793c0eaa625 expref: 73 pid: 7057 timeout: 193 lvb_type: 0
[  193.516385] LustreError: 8007:0:(hash.c:1751:cfs_hash_for_each_empty()) ASSERTION( atomic_read(&hs->hs_count) == 0 ) failed: 
[  193.516479] LustreError: 8007:0:(hash.c:1751:cfs_hash_for_each_empty()) LBUG

the trace:

PID: 8007   TASK: ffff9d14ca4f1b40  CPU: 0   COMMAND: "ldlm_bl_04"
 #0 [ffff9d14d4b4bd60] panic at ffffffffa30b9786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffff9d14d4b4bdf0] cfs_hash_for_each_empty at ffffffffc02b9bd8 [libcfs]
    /home/lustre/master-mine/libcfs/libcfs/hash.c: 1751
 #2 [ffff9d14d4b4be30] ldlm_export_cancel_locks at ffffffffc078092b [ptlrpc]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
 #3 [ffff9d14d4b4be98] ldlm_bl_thread_main at ffffffffc07a911f [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c: 2975
 #4 [ffff9d14d4b4bf10] kthread at ffffffffa30d5199
    /tmp/kernel/kernel/kthread.c: 340
 #5 [ffff9d14d4b4bf50] ret_from_fork at ffffffffa360019f
    /tmp/kernel/arch/x86/entry/entry_64.S: 325

the test:

FSTYPE=ldiskfs MDSCOUNT=2 SLOW=yes DURATION=2700 MDSSIZE=250000 OSTSIZE=400000 OSTCOUNT=2 REFORMAT=yes REFORMAT=yes bash racer.sh
Comment by Alex Zhuravlev [ 16/Jan/23 ]

one more similar crash:

LustreError: 13000:0:(ldlm_lockd.c:1055:ldlm_server_completion_ast()) ASSERTION( data != ((void *)0) ) failed: in recovery-small / 27
...
PID: 13000  TASK: ffff985970e31b00  CPU: 0   COMMAND: "umount"
 #0 [ffff985970d1f8f8] panic at ffffffff8d0b9786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffff985970d1f978] ldlm_server_completion_ast at ffffffffc06f8ba2 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c: 1055
 #2 [ffff985970d1f9e8] cleanup_resource_queue at ffffffffc06d504f [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1084
 #3 [ffff985970d1fa38] ldlm_resource_cleanup at ffffffffc06d5114 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1109
 #4 [ffff985970d1fa50] ldlm_resource_clean_hash at ffffffffc06d514c [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1122
 #5 [ffff985970d1fa60] cfs_hash_for_each_relax at ffffffffc020913f [libcfs]
    /home/lustre/master-mine/libcfs/libcfs/hash.c: 1645
 #6 [ffff985970d1fae0] cfs_hash_for_each_nolock at ffffffffc020c906 [libcfs]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_hash.h: 402
 #7 [ffff985970d1fb08] ldlm_namespace_cleanup at ffffffffc06d54e6 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1166
 #8 [ffff985970d1fb20] __ldlm_namespace_free at ffffffffc06d55ad [ptlrpc]
    /home/lustre/linux-4.18.0-305.25.1.el8_4/include/linux/compiler.h: 276
 #9 [ffff985970d1fb98] ldlm_namespace_free_prior at ffffffffc06d5a86 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_resource.c: 1253
#10 [ffff985970d1fbc8] mdt_device_fini at ffffffffc0ddc901 [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 5874
#11 [ffff985970d1fc20] class_cleanup at ffffffffc0354856 [obdclass]
    /home/lustre/master-mine/lustre/include/obd_class.h: 613
#12 [ffff985970d1fc98] class_process_config at ffffffffc03552bb [obdclass]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
#13 [ffff985970d1fd28] class_manual_cleanup at ffffffffc0357d3c [obdclass]
    /home/lustre/master-mine/lustre/obdclass/obd_config.c: 2359
#14 [ffff985970d1fdc0] server_put_super at ffffffffc07682bb [ptlrpc]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_fail.h: 79
#15 [ffff985970d1fe90] generic_shutdown_super at ffffffff8d1d17cf
    /tmp/kernel/./include/linux/compiler.h: 276
Comment by Alex Zhuravlev [ 10/Mar/23 ]

still hitting this from time to time:

[ 3153.685924] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:934
[ 3153.698414] in_atomic(): 1, irqs_disabled(): 0, pid: 5896, name: ldlm_bl_01
[ 3153.698514] 2 locks held by ldlm_bl_01/5896:
[ 3153.698586]  #0: ffff8c9de894ae18 (&hs->hs_lock.rw){....}-{2:2}, at: cfs_hash_for_each_relax+0x291/0x470 [libcfs]
[ 3153.698741]  #1: ffff8c9dabd29e18 (&new_bkts[i]->hsb_lock.rw){....}-{2:2}, at: cfs_hash_for_each_relax+0x2a2/0x470 [libcfs]
[ 3153.704358] CPU: 0 PID: 5896 Comm: ldlm_bl_01 Tainted: G        W  O     --------- -  - 4.18.0 #2
[ 3153.704488] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3153.704576] Call Trace:
[ 3153.704621]  dump_stack+0x5c/0x80
[ 3153.704679]  ___might_sleep.cold.21+0x9b/0xa8
[ 3153.704753]  __mutex_lock+0x41/0x930
[ 3153.704826]  ? qmt_lvbo_free+0xb1/0x250 [lquota]
[ 3153.704907]  qmt_lvbo_free+0xb1/0x250 [lquota]
[ 3153.705047]  ldlm_resource_putref+0x148/0x220 [ptlrpc]
[ 3153.705170]  ldlm_lock_put+0x265/0x670 [ptlrpc]
[ 3153.705256]  cfs_hash_for_each_relax+0x2e0/0x470 [libcfs]
[ 3153.705367]  ? ldlm_cancel_lock_for_export.isra.12+0x340/0x340 [ptlrpc]
[ 3153.705497]  ? ldlm_export_lock_object+0x10/0x10 [ptlrpc]
[ 3153.705611]  ? ldlm_cancel_lock_for_export.isra.12+0x340/0x340 [ptlrpc]
[ 3153.705709]  cfs_hash_for_each_empty+0x87/0x200 [libcfs]
[ 3153.705819]  ldlm_export_cancel_locks+0x8b/0x150 [ptlrpc]
[ 3153.707282]  ldlm_bl_thread_main+0x70f/0x8c0 [ptlrpc]
[ 3153.707365]  ? wait_woken+0xa0/0xa0
[ 3153.707458]  ? ldlm_handle_bl_callback+0x3e0/0x3e0 [ptlrpc]
[ 3153.707538]  kthread+0x129/0x140
[ 3153.707594]  ? kthread_flush_work_fn+0x10/0x10

guess it's matter of time when production will hit.

Comment by Alexander Zarochentsev [ 27/Mar/23 ]

bzzz can you try to enable spin lock debugging and repeat the test?

Comment by Alex Zhuravlev [ 13/Oct/23 ]

here is a race:

00010000:00020000:1.0:1697224811.088493:0:31361:0:(ldlm_lockd.c:1479:ldlm_handle_enqueue()) ### lock on destroyed export 00000000fc58c50a ns: mdt-lustre-MDT0001_UUID lock: 0000000088760baf/0x8b83035ffb756f74 lrc: 3/0,0 mode: PR/PR res: [0x240000bd0:0x7f:0x0].0x0 bits 0x1b/0x0 rrc: 2 type: IBT gid 0 flags: 0x50200000000000 nid: 0@lo remote: 0x8b83035ffb756f58 expref: 6 pid: 31361 timeout: 0 lvb_type: 0
00000001:00000001:0.0:1697224811.091533:0:7065:0:(hash.c:1714:cfs_hash_for_each_nolock()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:0.0:1697224811.091536:0:7065:0:(ldlm_lock.c:2473:ldlm_reprocess_recovery_done()) Process leaving
00010000:00020000:0.0:1697224811.091537:0:7065:0:(ldlm_lock.c:2748:ldlm_export_cancel_locks()) Export 00000000fc58c50a, canceled 51 locks, left on hash table 1.
...
Comment by Gerrit Updater [ 27/Nov/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53252
Subject: LU-16272 libcfs: cfs_hash_for_each_empty optimization
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: e48faf2b37c01eaf99f7c97c6c6c33c7dd1dc12c

Comment by Andreas Dilger [ 08/Dec/23 ]

Alex, are your comments that the 48972 patch is or is not fixing the problem?

Generated at Sat Feb 10 03:25:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.