[LU-14139]  batched statahead processing Created: 17/Nov/20  Updated: 07/Jul/23  Resolved: 22/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: New Feature Priority: Minor
Reporter: Qian Yingjin Assignee: Qian Yingjin
Resolution: Fixed Votes: 0
Labels: None

Attachments: File ls.svg    
Issue Links:
Related
is related to LU-15975 Statahead_V1 Features Resolved
is related to LU-16139 statahead: avoid new RPC and long wai... Resolved
is related to LU-10938 Metadata writeback cache support Open
is related to LU-13045 WBC2: batch metadata update Open
is related to LU-16096 recovery: handle compatibility durin... Open
is related to LU-14361 Add support for statahead pattern wit... Open
is related to LU-14393 Batching Processing Framework for Lustre Resolved
is related to LU-15550 WBC: retry the batched RPC when the r... Resolved
is related to LU-15562 statahead: using try lock for batched... Resolved
is related to LU-16365 cached 'ls -l' is slow Open
is related to LU-8582 Interop: master<->b2_8 - sanity test... Resolved
is related to LU-10334 Ubuntu1604 client sanity-103a: FAIL: ... Resolved
is related to LU-14868 sanity: all subtests pass but test su... Resolved
is related to LU-16247 do open()+readahead for DoM files as ... Open
is related to LU-16355 batch dirty buffered write of small f... Open
is related to LU-23 Improve performance for traversing la... Resolved
is related to LU-10280 DoM cross-file open+readahead via sta... Open
is related to LU-14361 Add support for statahead pattern wit... Open
is related to LU-14380 Make statahead better support Breadth... Open
is related to LU-14138 Move more members in PTLRPC request i... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Batched metadata processing may get a big performance boost.
Batched statahead can also increase the performance for directory listing operations such as ls.



 Comments   
Comment by Gerrit Updater [ 19/Nov/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40712
Subject: LU-14139 llite: simply callback handling for async getattr
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bdbf864aa6d02b93ed2977b2e48c3890d055f1b3

Comment by Gerrit Updater [ 20/Nov/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40720
Subject: LU-14139 statahead: batched statahead processing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 993903398a61e4dd515b3fcc2ae264511e34fad5

Comment by Gerrit Updater [ 11/Dec/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40943
Subject: LU-14139 statahead: add stats for batch rpc requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fe3ba9bd088fa5b218a288acf0e34a4485dcbb41

Comment by Gerrit Updater [ 11/Dec/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40945
Subject: LU-14139 ptlrpc: grow reply buffer properly for batch request
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b39b4e7897dda3737eb874bbdd2e69ce19a84f93

Comment by Gerrit Updater [ 14/Jan/21 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/41220
Subject: LU-14139 statahead: add test for batch statahead processing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6de31acdd0804ad8b5d656a00a1f687423c1c042

Comment by Shuichi Ihara [ 21/Jan/21 ]

Here is benchmark resutls of batch statahead patch.
It created a single shared directory and also created 1M files (47001 byte) in the directory.
The workload is "ls -l" to that directory on the client without any caching (servers and client) and mesured elapsed time.

# clush -w es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches"
# time ls -l /ai400x/dir/test-dir.0-0/mdtest_tree.0
statahead_max statahead_batch_max
(only with patch)
master
(sec)
master+patch
(sec)
32 0 - 181
32 32 180 40
64 64 183 35
128 128 181 33
256 256 183 30
512 512 178 29
1024 1024 179 28

This is a great improvement and the patch demonstrated 6x speeds up (180sec down to 30sec) for "ls -l" workload.
However, it's still limited use case with current statahead. sthatahead needs to start stat() invoked to files before close() of that directory.
"ls -l" was a good example, but statahead doesn't trigger for other cases with few general command today. (e.g. "du <dir>" or "find <dir> -size X") It would be nice to have additional improvements for them if it's possible.

Comment by Qian Yingjin [ 21/Jan/21 ]

Verify that it work with the command:

du -a $dir
mkdir $dir || error "failed to mkdir $dir"
        touch $dir/$tfile.{0..9}

        $LCTL set_param mdc.*.batch_stats=0
        lctl get_param -n llite.*.statahead_stats
        strace -o output du -a $dir
        lctl get_param -n llite.*.statahead_stats
        lctl get_param -n mdc.*.batch_stats

The output of the test scripts:

mdc.lustre-MDT0000-mdc-ffff97f3bde68000.batch_stats=0
statahead total: 0
statahead wrong: 0
agl total: 0
0	/mnt/lustre/d123d.sanity/f123d.sanity.6
0	/mnt/lustre/d123d.sanity/f123d.sanity.2
0	/mnt/lustre/d123d.sanity/f123d.sanity.5
0	/mnt/lustre/d123d.sanity/f123d.sanity.3
0	/mnt/lustre/d123d.sanity/f123d.sanity.7
0	/mnt/lustre/d123d.sanity/f123d.sanity.9
0	/mnt/lustre/d123d.sanity/f123d.sanity.4
0	/mnt/lustre/d123d.sanity/f123d.sanity.8
0	/mnt/lustre/d123d.sanity/f123d.sanity.1
0	/mnt/lustre/d123d.sanity/f123d.sanity.0
4	/mnt/lustre/d123d.sanity
statahead total: 1
statahead wrong: 0
agl total: 1
snapshot_time:         1611218329. 88052008 (secs.nsecs)
subreqs per batch   batchs   % cum %
1:		         1  50  50
2:		         0   0  50
4:		         0   0  50
8:		         1  50 100
Comment by Andreas Dilger [ 21/Jan/21 ]

Yingjin, for the next refresh of this patch, please include the benchmark results from Shuichi in the commit message of the patch.

Note that different versions of userspace tools (e.g. du, find) may sometimes behave differently between e.g. RHEL7 and 8, or if run as root or a regular user, so it is always useful to include the version of the utility that is working or not.

Comment by Shuichi Ihara [ 21/Jan/21 ]

See and try the following test cases. There are still something wrong and batch doesn't trigger properly if the number of file are increased in the directory.

# mkdir dir0 dir1 dir2
# for i in `seq 0 100`; do  touch dir0/file.$i; done
# for i in `seq 0 1000`; do  touch dir1/file.$i; done
# for i in `seq 0 10000`; do  touch dir2/file.$i; done

100 files

# lctl set_param mdc.*MDT0000*.batch_stats=clear; lctl set_param llite.*.statahead_max=1024 llite.*.statahead_batch_max=1024; clush -w es400nvx1-vm[1-4],ec[01-40] " echo 3 > /proc/sys/vm/drop_caches" 
mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=clear
llite.ai400x-ffff97559d533800.statahead_max=1024
llite.ai400x-ffff97559d533800.statahead_batch_max=1024

# du -a /ai400x/dir0 > /dev/null; lctl get_param mdc.*MDT0000*.batch_stats
mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=
snapshot_time:         1611227271. 35992632 (secs.nsecs)
subreqs per batch   batchs   % cum %
1:		         0   0   0
2:		         0   0   0
4:		         0   0   0
8:		         1  20  20
16:		         1  20  40
32:		         1  20  60
64:		         2  40 100

1000 files

# lctl set_param mdc.*MDT0000*.batch_stats=clear; lctl set_param llite.*.statahead_max=1024 llite.*.statahead_batch_max=1024; clush -w es400nvx1-vm[1-4],ec[01-40] " echo 3 > /proc/sys/vm/drop_caches" 
mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=clear
llite.ai400x-ffff97559d533800.statahead_max=1024
llite.ai400x-ffff97559d533800.statahead_batch_max=1024

# du -a /ai400x/dir1 > /dev/null; lctl get_param mdc.*MDT0000*.batch_stats
mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=
snapshot_time:         1611227330.733571614 (secs.nsecs)
subreqs per batch   batchs   % cum %
1:		         0   0   0
2:		         0   0   0
4:		         0   0   0
8:		         1  25  25
16:		         1  25  50
32:		         1  25  75
64:		         0   0  75
128:		         0   0  75
256:		         0   0  75
512:		         0   0  75
1024:		         1  25 100

10000 files

# lctl set_param mdc.*MDT0000*.batch_stats=clear; lctl set_param llite.*.statahead_max=1024 llite.*.statahead_batch_max=1024; clush -w es400nvx1-vm[1-4],ec[01-40] " echo 3 > /proc/sys/vm/drop_caches" 
mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=clear
llite.ai400x-ffff97559d533800.statahead_max=1024
llite.ai400x-ffff97559d533800.statahead_batch_max=1024

# du -a /ai400x/dir2 > /dev/null; lctl get_param mdc.*MDT0000*.batch_stats
mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=
snapshot_time:         1611227444.263062015 (secs.nsecs)
subreqs per batch   batchs   % cum %
1:		         0   0   0
Comment by Qian Yingjin [ 21/Jan/21 ]

I thought I found the reason why statahead does not trigger:

9 dentries:
0	/mnt/lustre/d123d.sanity/f123d.sanity.6
0	/mnt/lustre/d123d.sanity/f123d.sanity.2
0	/mnt/lustre/d123d.sanity/f123d.sanity.5
0	/mnt/lustre/d123d.sanity/f123d.sanity.3
0	/mnt/lustre/d123d.sanity/f123d.sanity.7
0	/mnt/lustre/d123d.sanity/f123d.sanity.9
0	/mnt/lustre/d123d.sanity/f123d.sanity.4
0	/mnt/lustre/d123d.sanity/f123d.sanity.8
0	/mnt/lustre/d123d.sanity/f123d.sanity.1
0	/mnt/lustre/d123d.sanity/f123d.sanity.0
4	/mnt/lustre/d123d.sanity
10000 dentries
newfstatat(4, "f123d.sanity.1", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.2", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.3", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.4", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.5", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.6", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.7", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.8", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.9", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.10", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.11", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.12", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "f123d.sanity.13", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
...

When the entry count within a directory is larger than 10000, the program 'du' will first sort according to the file name.
While in Lustre the dentries are obtained from MDT by the order of the name hash, the first dentry are not same, so it will not start the statahead thread...

I though if the file name has certain kind of the regular pattern, i.e. mdtest.$(i+step), we can optimize to do statahead also.

i.e.

  • mdtest.0000~mdtest.1000 could do batched/async statahead by thread A;
  • mdtest.1001 ~mdtest.2000 could do batched/async statahead by thread B;
    ...
Comment by Shuichi Ihara [ 21/Jan/21 ]

I though if the file name has certain kind of the regular pattern, i.e. mdtest.$(i+step), we can optimize to do statahead also.
i.e.

mdtest.0000~mdtest.1000 could do batched/async statahead by thread A;
mdtest.1001 ~mdtest.2000 could do batched/async statahead by thread B;

Okey, then "ls -l dir/*" could be also optimized as well if filename is a pattern. This is very useful for ML/AI with ingested data that have typically a rule of filename in the directory.

Comment by Gerrit Updater [ 14/May/21 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/43707
Subject: LU-14139 ptlrpc: grow PtlRPC properly when prepare sub request
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dd29ac0b580430c12b838f7a16fe66cb18638eb9

Comment by Gerrit Updater [ 08/Jul/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40712/
Subject: LU-14139 llite: simplify callback handling for async getattr
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cbaaa7cde45f59372c75577d7274f7e2e38acd24

Comment by Qian Yingjin [ 04/Aug/21 ]

Hi Andreas, SiYao

Any idea and suggestions about how to reproduce the hang caused by LU-14139?
I added a racer test case between rm -rf and ls -l, but can not reproduce the bug...

Thanks,
Qian

Comment by Gerrit Updater [ 04/Aug/21 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/44488
Subject: LU-14139 llite: simplify callback handling for async getattr
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 93e998a154bef2aaea5b1e527da57e1bcce11a5a

Comment by Qian Yingjin [ 05/Aug/21 ]

After discussed with Lai, found the following stack dump:

schedule+0x38/0xa0
[24373.096771]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
[24373.097828]  ? finish_wait+0x80/0x80
[24373.098595]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
[24373.099578]  ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc]
[24373.100929]  ? ll_md_need_convert+0x160/0x160 [lustre]
[24373.101985]  ? mdc_changelog_cdev_finish+0x210/0x210 [mdc]
[24373.103062]  mdc_enqueue_base+0x335/0x1340 [mdc]
[24373.103996]  ? mdc_revalidate_lock+0x1a4/0x1d0 [mdc]
[24373.104972]  mdc_intent_lock+0x210/0x530 [mdc]
[24373.105872]  ? ll_md_need_convert+0x160/0x160 [lustre]
[24373.106904]  ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc]
[24373.108080]  ? mdc_changelog_cdev_finish+0x210/0x210 [mdc]
[24373.109211]  lmv_revalidate_slaves+0x607/0x9e0 [lmv]
[24373.110236]  ? ll_md_need_convert+0x160/0x160 [lustre]
[24373.111258]  lmv_merge_attr+0x32/0x170 [lmv]
[24373.112183]  ll_update_lsm_md+0xb10/0xe60 [lustre]
[24373.113163]  ll_update_inode+0x41f/0x650 [lustre]
[24373.114109]  ll_read_inode2+0x4e/0x3a0 [lustre]
[24373.115023]  ll_iget+0xbd/0x320 [lustre]
[24373.115843]  ll_prep_inode+0x374/0x990 [lustre]
[24373.116754]  ? mdc_revalidate_lock+0x5b/0x1d0 [mdc]
[24373.117739]  ? lprocfs_counter_add+0xd2/0x140 [obdclass]
[24373.118798]  ll_statahead_interpret+0x51b/0xb90 [lustre]

In the statahead callback interpret, it generate new RPCs for striped dir: lmv_merge_attr() -> lmv_revalidate_slaves() -> mdc_intent_lock(). It may cause the problem using out the RPC slots.
There are two solution for this problem:
1. The origin solution: put the intermediate list, let the statahead thread to do revalidate slaves for striped dir. And here it can be improved to just put the striped dir to the intermediate list handling by the statahead thread.
2. The simplest solution: once found it is a striped dir, drop the result, let the scanning process do the stat() on the striped dir in synchronous way.

Which one should we used? Any suggestions?

Comment by Andreas Dilger [ 05/Aug/21 ]

I don't have a good idea on how complex each solution is. The first one sounds like it would provide better performance (not causing a stall in the statahead pipeline), but for now the more simple/robust solution is best so that you can move ahead with the other statahead patches.

Comment by Qian Yingjin [ 23/Nov/21 ]

After the patch:
LU-15121 llite: skip request slot for lmv_revalidate_slaves() https://review.whamcloud.com/#/c/45275/,

I think it solve the problem of using out the RPC slots.

The old patch should not cause the hang waiting for RPC slots.

Comment by Gerrit Updater [ 23/Nov/21 ]

"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45648
Subject: LU-14139 llite: simplify callback handling for async getattr
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4a891e844afdc8c2c18155efba5a1881fda92f0b

Comment by Gerrit Updater [ 26/Jan/22 ]

"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46309
Subject: LU-14139 statahead: add total hit/miss count stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c626be11590c894aa5167060e392b7b582b6b2e6

Comment by Gerrit Updater [ 19/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45648/
Subject: LU-14139 llite: simplify callback handling for async getattr
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 509d7305ce8a01351cb77c26aaad078edca0e09c

Comment by Gerrit Updater [ 19/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46309/
Subject: LU-14139 statahead: add total hit/miss count stats
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b9167201a00e38ce8f63d3013ab89571c3492f52

Comment by Alex Zhuravlev [ 06/Sep/22 ]

I'm hitting freeze in replay-dual/26 quite often, bisection gives the following stats:

COMMIT		TESTED	PASSED	FAILED		COMMIT DESCRIPTION
76c3fa96dc	10	5	4	BAD	LU-16082 ldiskfs: old-style EA inode handling fix
2447564e12	6	1	5	BAD	LU-16011 lnet: use preallocate bulk for server
3f0bee2502	6	2	4	BAD	LU-15959 kernel: new kernel [SLES15 SP4 5.14.21-150400.24.18.1]
cf35c54224	5	3	2	BAD	LU-14745 tests: ensure sanity/51d has enough objects
bbe5e98180	3	0	3	BAD	LU-15874 kernel: new kernel [RHEL 9.0 5.14.0-70.22.1.el9_0]
1ef1fa06b2	7	3	4	BAD	LU-16078 o2iblnd: Salt comp_vector
0109cee261	10	4	6	BAD	LU-13813 tests: fix stack_trap in conf-sanity test 110/111
b9167201a0	3	0	3	BAD	LU-14139 statahead: add total hit/miss count stats
509d7305ce	6	4	1	BAD	LU-14139 llite: simplify callback handling for async getattr
73ac8e35e5	10	10	0	GOOD	LU-16061 osd-ldiskfs: clear EXTENT_FL for symlink agent inode
907dc0a2d3	10	10	0	GOOD	LU-16060 osd-ldiskfs: copy nul byte terminator in writelink
33efefc496	10	10	0	GOOD	LU-16037 build: remove quotes from %{mkconf_options}
Comment by Qian Yingjin [ 07/Sep/22 ]

Hi Alex,
What's your testing environment (how many MDTs and clients) ?
Is there any Maloo testing log available for the freeze in replay-dual/26?

Thanks,
Qian

Comment by Alex Zhuravlev [ 07/Sep/22 ]

just local testing:
FSTYPE=ldiskfs MDSCOUNT=2 MDSSIZE=500000 OSTSIZE=1000000 OSTCOUNT=2 REFORMAT=yes REFORMAT=yes bash replay-dual.sh

this is the unique stack traces:

schedule,osc_extent_wait,osc_cache_wait_range,osc_io_fsync_end,cl_io_end,lov_io_end_wrapper,lov_io_fsync_end,cl_io_end,cl_io_loop,cl_sync_file_range,ll_writepages,do_writepages,__writeback_single_inode,writeback_sb_inodes,wb_writeback,wb_workfn,process_one_work,worker_thread
	PIDs(1): "kworker/u4:3":190 

schedule,osc_extent_wait,osc_cache_wait_range,osc_io_fsync_end,cl_io_end,lov_io_end_wrapper,lov_io_fsync_end,cl_io_end,cl_io_loop,cl_sync_file_range,vvp_prune,cl_object_prune,lov_conf_set,cl_conf_set,cl_file_inode_init,ll_update_inode,ll_iget,ll_prep_inode,ll_statahead_interpret_common,ll_statahead_interpret,mdc_intent_getattr_async_interpret,ptlrpc_check_set,ptlrpcd
	PIDs(2): "ptlrpcd_00_00":3041 "ptlrpcd_00_01":3042 

schedule,obd_get_request_slot,ldlm_cli_enqueue,mdc_enqueue_base,mdc_intent_lock,lmv_intent_lock,ll_lookup_it.constprop.4,ll_lookup_nd,__lookup_slow,lookup_slow,walk_component,path_lookupat,filename_lookup,vfs_statx,__se_sys_newstat
	PIDs(1): "dbench":31243 

schedule,rwsem_down_read_slowpath,down_read,lov_io_init,cl_io_init0,cl_sync_file_range,ll_writepages,do_writepages,__writeback_single_inode,writeback_sb_inodes,__writeback_inodes_wb,wb_writeback,get_nr_inodes,wb_workfn,_raw_spin_unlock_irq,process_one_work,worker_thread
	PIDs(1): "kworker/u4:4":31792 

schedule,wait_for_common,osc_io_setattr_end,cl_io_end,lov_io_end_wrapper,lov_io_call,lov_io_end,cl_io_end,cl_io_loop,cl_setattr_ost,ll_setattr_raw,notify_change,utimes_common,do_utimes,__se_sys_utimensat
	PIDs(1): "tar":33454 
Comment by Shuichi Ihara [ 03/Dec/22 ]

I did re-ran test "ls -l" (1M x 47001 byte file in single directory) with latest patch serices https://review.whamcloud.com/#/c/fs/lustre-release/+/49291/
master (commit:c74c630)

[root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
real	0m33.757s
user	0m9.363s
sys	0m14.982s

patched

[root@ec01 ~]# lctl set_param llite.*.statahead_max=1024 llite.*.statahead_batch_max=64
[root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m27.296s
user	0m9.089s
sys	0m13.051s

I see elapse time downed from 33sec to 27sec with patch.

subreqs per batch   batchs   % cum %
1:		     92322  76  76
2:		      4742   3  80
4:		       115   0  80
8:		       657   0  80
16:		     10736   8  89
32:		       107   0  89
64:		     12168  10 100

as far as I see batch_stats, more than 950K (64 * 12168 + 32 * 107 + 16 * 10736) of 1M requests can be batched with more than 16 request aggregations. Maybe, this is not so bad? However, we are still spending more than 27 sec here.

Here is additinal tests something like "drop cache; ls -l; ls -l"

[root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m27.385s
user	0m8.994s
sys	0m13.131s


llite.exafs-ffff9b96b8157800.stats=
snapshot_time             482063.475018642 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.475018642 secs.nsecs
open                      1 samples [usecs] 154 154 154 23716
close                     1 samples [usecs] 131 131 131 17161
readdir                   1223 samples [usecs] 0 38879 258067 1731301375
getattr                   1000002 samples [usecs] 2 9770 5776755 3873882865
getxattr                  1 samples [usecs] 70 70 70 4900
inode_permission          10000021 samples [usecs] 0 393 1848868 2164824
opencount                 1 samples [reqs] 1 1 1 1


mdc.exafs-MDT0002-mdc-ffff9b96b8157800.stats=
snapshot_time             482063.474876615 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474876615 secs.nsecs
req_waittime              181839 samples [usecs] 57 2043896 22268808747 10620189011411911
req_active                181839 samples [reqs] 1 23615 644666200 9255887581600
ldlm_ibits_enqueue        6 samples [reqs] 1 1 6 6
mds_close                 1 samples [usecs] 110 110 110 12100
mds_readpage              56 samples [usecs] 250 38375 464045 5002627263
mds_batch                 119274 samples [usecs] 61 84527 353967107 2052370493893
ldlm_cancel               62501 samples [usecs] 182 725561 21912324500 10613954056275484
obd_ping                  1 samples [usecs] 90 90 90 8100

osc.exafs-OST0000-osc-ffff9b96b8157800.stats=
snapshot_time             482063.474252382 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474252382 secs.nsecs
req_waittime              132963 samples [usecs] 29 408417 1867410218 500106174718090
req_active                132963 samples [reqs] 1 3612 17237320 44241112018
ldlm_glimpse_enqueue      125139 samples [reqs] 1 1 125139 125139
ost_connect               1 samples [usecs] 97 97 97 9409
ost_disconnect            1 samples [usecs] 253 253 253 64009
ldlm_cancel               7822 samples [usecs] 263 408417 1857609504 500103314555706
osc.exafs-OST0001-osc-ffff9b96b8157800.stats=
snapshot_time             482063.474322305 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474322305 secs.nsecs
req_waittime              132958 samples [usecs] 31 400200 1868293807 500417445772075
req_active                132958 samples [reqs] 1 3620 17260697 44413040365
ldlm_glimpse_enqueue      125134 samples [reqs] 1 1 125134 125134
ost_connect               1 samples [usecs] 134 134 134 17956
ost_disconnect            1 samples [usecs] 255 255 255 65025
ldlm_cancel               7822 samples [usecs] 249 400200 1858535688 500414552572698
osc.exafs-OST0002-osc-ffff9b96b8157800.stats=
snapshot_time             482063.474385742 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474385742 secs.nsecs
req_waittime              132922 samples [usecs] 30 403354 1585074597 373963525463297
req_active                132922 samples [reqs] 1 3139 14993568 33140340186
ldlm_glimpse_enqueue      125100 samples [reqs] 1 1 125100 125100
ost_connect               1 samples [usecs] 104 104 104 10816
ost_disconnect            1 samples [usecs] 229 229 229 52441
ldlm_cancel               7820 samples [usecs] 137 403354 1576064547 373959847577499
osc.exafs-OST0003-osc-ffff9b96b8157800.stats=
snapshot_time             482063.474439284 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474439284 secs.nsecs
req_waittime              132923 samples [usecs] 29 426088 1586055175 373991604658201
req_active                132923 samples [reqs] 1 3131 15010819 33188857463
ldlm_glimpse_enqueue      125100 samples [reqs] 1 1 125100 125100
ost_connect               1 samples [usecs] 140 140 140 19600
ost_disconnect            1 samples [usecs] 230 230 230 52900
ldlm_cancel               7821 samples [usecs] 188 426088 1576987697 373987536028513
osc.exafs-OST0004-osc-ffff9b96b8157800.stats=
snapshot_time             482063.474493870 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474493870 secs.nsecs
req_waittime              133220 samples [usecs] 30 402683 1938227551 496692342133501
req_active                133220 samples [reqs] 1 3741 17164793 41132712891
ldlm_glimpse_enqueue      125381 samples [reqs] 1 1 125381 125381
ost_connect               1 samples [usecs] 106 106 106 11236
ost_disconnect            1 samples [usecs] 236 236 236 55696
ldlm_cancel               7837 samples [usecs] 327 402683 1791475030 496280422422484
osc.exafs-OST0005-osc-ffff9b96b8157800.stats=
snapshot_time             482063.474549629 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474549629 secs.nsecs
req_waittime              133213 samples [usecs] 29 393951 1936411387 496417875801293
req_active                133213 samples [reqs] 1 3740 17146480 41076362838
ldlm_glimpse_enqueue      125376 samples [reqs] 1 1 125376 125376
ost_connect               1 samples [usecs] 118 118 118 13924
ost_disconnect            1 samples [usecs] 199 199 199 39601
ldlm_cancel               7835 samples [usecs] 315 393951 1790421213 496012032268869
osc.exafs-OST0006-osc-ffff9b96b8157800.stats=
snapshot_time             482063.474603969 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474603969 secs.nsecs
req_waittime              132941 samples [usecs] 27 394743 1649085257 398814721042191
req_active                132941 samples [reqs] 1 3382 15661909 36181334593
ldlm_glimpse_enqueue      125116 samples [reqs] 1 1 125116 125116
ost_connect               1 samples [usecs] 99 99 99 9801
ost_disconnect            1 samples [usecs] 187 187 187 34969
ldlm_cancel               7823 samples [usecs] 347 394743 1639760732 398808893873090
osc.exafs-OST0007-osc-ffff9b96b8157800.stats=
snapshot_time             482063.474659084 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482063.474659084 secs.nsecs
req_waittime              132954 samples [usecs] 31 392261 1644646149 397155067488985
req_active                132954 samples [reqs] 1 3364 15624468 36037631194
ldlm_glimpse_enqueue      125131 samples [reqs] 1 1 125131 125131
ost_connect               1 samples [usecs] 121 121 121 14641
ost_disconnect            1 samples [usecs] 169 169 169 28561

In fact, we still spent 25sec even without drop cache. there were nothing requests from client to servers.
Because everything came from caches on local client.

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m25.309s
user	0m8.937s
sys	0m16.327s

llite.exafs-ffff9b96b8157800.stats=
snapshot_time             482212.579971327 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579971327 secs.nsecs
open                      1 samples [usecs] 189 189 189 35721
close                     1 samples [usecs] 159 159 159 25281
readdir                   1223 samples [usecs] 0 149 171544 24104442
getattr                   1000002 samples [usecs] 3 34 9690974 103987264
getxattr                  1 samples [usecs] 2 2 2 4
getxattr_hits             1 samples [reqs]
inode_permission          10000021 samples [usecs] 0 17 2009221 2041697
opencount                 1 samples [reqs] 1 1 1 1
openclosetime             1 samples [usecs] 102201708 102201708 102201708 10445189118117264

mdc.exafs-MDT0002-mdc-ffff9b96b8157800.stats=
snapshot_time             482212.579840759 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579840759 secs.nsecs
req_waittime              5 samples [usecs] 67 132 484 50542
req_active                5 samples [reqs] 1 1 5 5
ldlm_ibits_enqueue        1 samples [reqs] 1 1 1 1
mds_close                 1 samples [usecs] 126 126 126 15876
obd_ping                  3 samples [usecs] 67 87 226 17242

osc.exafs-OST0000-osc-ffff9b96b8157800.stats=
snapshot_time             482212.579286532 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579286532 secs.nsecs
req_waittime              3 samples [usecs] 76 83 235 18441
req_active                3 samples [reqs] 1 1 3 3
obd_ping                  3 samples [usecs] 76 83 235 18441
osc.exafs-OST0001-osc-ffff9b96b8157800.stats=
snapshot_time             482212.579346652 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579346652 secs.nsecs
req_waittime              3 samples [usecs] 69 78 222 16470
req_active                3 samples [reqs] 1 1 3 3
obd_ping                  3 samples [usecs] 69 78 222 16470
osc.exafs-OST0002-osc-ffff9b96b8157800.stats=
snapshot_time             482212.579396346 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579396346 secs.nsecs
req_waittime              3 samples [usecs] 50 61 164 9030
req_active                3 samples [reqs] 1 1 3 3
obd_ping                  3 samples [usecs] 50 61 164 9030
osc.exafs-OST0003-osc-ffff9b96b8157800.stats=
snapshot_time             482212.579440245 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579440245 secs.nsecs
req_waittime              3 samples [usecs] 57 79 196 13090
req_active                3 samples [reqs] 1 1 3 3
obd_ping                  3 samples [usecs] 57 79 196 13090
osc.exafs-OST0004-osc-ffff9b96b8157800.stats=
snapshot_time             482212.579489618 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579489618 secs.nsecs
req_waittime              3 samples [usecs] 76 84 240 19232
req_active                3 samples [reqs] 1 1 3 3
obd_ping                  3 samples [usecs] 76 84 240 19232
osc.exafs-OST0005-osc-ffff9b96b8157800.stats=
snapshot_time             482212.579536314 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579536314 secs.nsecs
req_waittime              3 samples [usecs] 76 82 236 18584
req_active                3 samples [reqs] 1 1 3 3
obd_ping                  3 samples [usecs] 76 82 236 18584
osc.exafs-OST0006-osc-ffff9b96b8157800.stats=
snapshot_time             482212.579579660 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579579660 secs.nsecs
req_waittime              3 samples [usecs] 46 56 158 8388
req_active                3 samples [reqs] 1 1 3 3
obd_ping                  3 samples [usecs] 46 56 158 8388
osc.exafs-OST0007-osc-ffff9b96b8157800.stats=
snapshot_time             482212.579628342 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              482212.579628342 secs.nsecs
req_waittime              3 samples [usecs] 60 71 198 13130
req_active                3 samples [reqs] 1 1 3 3
obd_ping                  3 samples [usecs] 60 71 198 13130

So, somehow, "ls -l" for 1M files take 25sec even everything comes from caches.
Without patch, we needed another 8sec for round-trip between clients and servers, but it can downed to 2sec by batched statahead patch.

Comment by Andreas Dilger [ 03/Dec/22 ]

It looks like the client is still sending 125k x 8=1M glimpse RPCs to the OSTs to fetch the file size. I guess this is because the client is not caching all 1M of the OST DLM locks on the objects between "ls -l" calls? Did you check OSC LRU size? I suspect this would have a sudden jump in speed once the number of files is small enough that all the OST DLM locks are cached.

Comment by Shuichi Ihara [ 03/Dec/22 ]

So, somehow, "ls -l" for 1M files take 25sec even everything comes from caches.

'ls -l' is doing other things after recived results. sorting, writing output, etc. sorting some impacts, but majority of time are still spent 1M times for statx() and getxattr() even all data in the cache.

1670051344.758188 getxattr("/exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/file.mdtest.0.945973", "system.posix_acl_access", NULL, 0) = -1 ENODATA (No data available) <0.000008>
1670051344.758214 statx(AT_FDCWD, "/exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/file.mdtest.0.877046", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_MODE|STATX_NLINK|STATX_MTIME|STATX_SIZE, {stx_mask=STATX_MODE|STATX_NLINK|STATX_MTIME|STATX_SIZE, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=47001, ...}) = 0 <0.000013>

Just in case, this is a test resutls with/without caches, but ls doesn't sort for result list.

[root@ec01 ~]# clush -w ai400x2-1-vm[1-4],ec01 " echo 3 > /proc/sys/vm/drop_caches "
[sihara@ec01 ~]$ time ls -fl /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m20.023s
user	0m1.903s
sys	0m13.363s
[sihara@ec01 ~]$ time ls -fl /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m18.164s
user	0m1.703s
sys	0m16.417s

still same 2 sec overheads for client and server roundtrips. So, batched statahead reduced that overheads  to 2sec from 8 sec which means 4x speedup.

Comment by Shuichi Ihara [ 03/Dec/22 ]

it might be not related to LU-14139 direclty, but still interested in why Lustre still takes 25sec for second 'ls -l'.

It looks like the client is still sending 125k x 8=1M glimpse RPCs to the OSTs to fetch the file size. I guess this is because the client is not caching all 1M of the OST DLM locks on the objects between "ls -l" calls? Did you check OSC LRU size? I suspect this would have a sudden jump in speed once the number of files is small enough that all the OST DLM locks are cached. 

As far as I observed, all locks seems to be kept as well. nothing shrinks or re-granting locks between two 'ls -l'.

[root@ec01 ~]# lctl get_param ldlm.namespaces.*.lock_count ldlm.namespaces.*.lru_size
ldlm.namespaces.MGC10.0.11.208@o2ib12.lock_count=4
ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.MGC10.0.11.208@o2ib12.lru_size=3200
ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lru_size=0

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m27.249s
user	0m9.111s
sys	0m13.544s

[root@ec01 ~]# lctl get_param ldlm.namespaces.*.lock_count ldlm.namespaces.*.lru_size
ldlm.namespaces.MGC10.0.11.208@o2ib12.lock_count=4
ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lock_count=2
ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lock_count=4
ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lock_count=1000004
ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lock_count=125089
ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lock_count=125073
ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lock_count=125091
ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lock_count=125085
ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lock_count=125295
ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lock_count=125285
ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lock_count=125103
ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lock_count=125099
ldlm.namespaces.MGC10.0.11.208@o2ib12.lru_size=3200
ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lru_size=2
ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lru_size=4
ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lru_size=1000004
ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lru_size=125089
ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lru_size=125073
ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lru_size=125091
ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lru_size=125085
ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lru_size=125295
ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lru_size=125285
ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lru_size=125103
ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lru_size=125099

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m25.615s
user	0m8.982s
sys	0m16.588s

[root@ec01 ~]# lctl get_param ldlm.namespaces.*.lock_count ldlm.namespaces.*.lru_size
ldlm.namespaces.MGC10.0.11.208@o2ib12.lock_count=4
ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lock_count=2
ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lock_count=4
ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lock_count=1000004
ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lock_count=0
ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lock_count=125089
ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lock_count=125073
ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lock_count=125091
ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lock_count=125085
ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lock_count=125295
ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lock_count=125285
ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lock_count=125103
ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lock_count=125099
ldlm.namespaces.MGC10.0.11.208@o2ib12.lru_size=3200
ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lru_size=2
ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lru_size=4
ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lru_size=1000004
ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lru_size=0
ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lru_size=125089
ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lru_size=125073
ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lru_size=125091
ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lru_size=125085
ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lru_size=125295
ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lru_size=125285
ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lru_size=125103
ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lru_size=125099

Here is same 1M files tests for ext4 on local disk of client.

[root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
[sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null

real	0m16.999s
user	0m8.956s
sys	0m5.855s
[sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null

real	0m11.832s
user	0m8.765s
sys	0m3.051s

I'm also attaching FG which captured at second 'ls -l' on Lustre.

Comment by Shuichi Ihara [ 03/Dec/22 ]

there are no LNET messages go out in 2nd 'ls -l'.
I've opened a separate ticket for more investigations. https://jira.whamcloud.com/browse/LU-16365

Comment by Qian Yingjin [ 03/Dec/22 ]

From the attaching FG, lots of time costs are:
__ldlm_handle2lock () ->class_handle2object()
ldlm_resource_get() >cfs_hash_bd_lookup_intent()>ldlm_res_hop_keycmp()

Each of them costs 5% for data and metadata DLM lock matching on the client side. All takes about 20% of the total sample...
It seems that the hash for looking up the lock handle and resource take lots of time.
And ldlm_res_hop_keycmp() reaches 5%, it means there are lots of elements in the same bucket of the hash table. We should increase the hash table and use the resizable hash table.

For class_handle2object for locks, we should maintain the lock handle per target (osc/mdc) lock namespace, not shared a single global hash table for handle on the client or even the server.

Comment by Andreas Dilger [ 03/Dec/22 ]

Is there a patch that converts ldlm to use rhashtable?

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/40720/
Subject: LU-14139 statahead: batched statahead processing
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4435d0121f72aac3ad01c98a33b265a496359890

Comment by Gerrit Updater [ 22/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/40943/
Subject: LU-14139 statahead: add stats for batch RPC requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a20f25d24b5f0ce7b5e77f7c596bffd0450cbdae

Comment by Gerrit Updater [ 22/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/40945/
Subject: LU-14139 ptlrpc: grow reply buffer properly for batch request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5edb883b44ac707528ce2c0bc812d65b9ffb4a50

Comment by Gerrit Updater [ 22/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/43707/
Subject: LU-14139 ptlrpc: grow PtlRPC properly when prepare sub request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5a2dfd36f9c2b6c10ab7ba44b0e9e86372623fde

Comment by Gerrit Updater [ 22/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/41220/
Subject: LU-14139 statahead: add test for batch statahead processing
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cd1e7bdd903ada4467064eaf8613ec62a358fa1c

Comment by Peter Jones [ 22/Apr/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:07:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.