[LU-14139] batched statahead processing Created: 17/Nov/20 Updated: 07/Jul/23 Resolved: 22/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Qian Yingjin | Assignee: | Qian Yingjin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Batched metadata processing may get a big performance boost. |
| Comments |
| Comment by Gerrit Updater [ 19/Nov/20 ] | ||||||||||||||||||||||||||||||||
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40712 | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 20/Nov/20 ] | ||||||||||||||||||||||||||||||||
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40720 | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 11/Dec/20 ] | ||||||||||||||||||||||||||||||||
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40943 | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 11/Dec/20 ] | ||||||||||||||||||||||||||||||||
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40945 | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 14/Jan/21 ] | ||||||||||||||||||||||||||||||||
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/41220 | ||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara [ 21/Jan/21 ] | ||||||||||||||||||||||||||||||||
|
Here is benchmark resutls of batch statahead patch. # clush -w es400nvx1-vm[1-4],ec[01-40] "echo 3 > /proc/sys/vm/drop_caches" # time ls -l /ai400x/dir/test-dir.0-0/mdtest_tree.0
This is a great improvement and the patch demonstrated 6x speeds up (180sec down to 30sec) for "ls -l" workload. | ||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 21/Jan/21 ] | ||||||||||||||||||||||||||||||||
|
Verify that it work with the command: du -a $dir
mkdir $dir || error "failed to mkdir $dir"
touch $dir/$tfile.{0..9}
$LCTL set_param mdc.*.batch_stats=0
lctl get_param -n llite.*.statahead_stats
strace -o output du -a $dir
lctl get_param -n llite.*.statahead_stats
lctl get_param -n mdc.*.batch_stats
The output of the test scripts: mdc.lustre-MDT0000-mdc-ffff97f3bde68000.batch_stats=0 statahead total: 0 statahead wrong: 0 agl total: 0 0 /mnt/lustre/d123d.sanity/f123d.sanity.6 0 /mnt/lustre/d123d.sanity/f123d.sanity.2 0 /mnt/lustre/d123d.sanity/f123d.sanity.5 0 /mnt/lustre/d123d.sanity/f123d.sanity.3 0 /mnt/lustre/d123d.sanity/f123d.sanity.7 0 /mnt/lustre/d123d.sanity/f123d.sanity.9 0 /mnt/lustre/d123d.sanity/f123d.sanity.4 0 /mnt/lustre/d123d.sanity/f123d.sanity.8 0 /mnt/lustre/d123d.sanity/f123d.sanity.1 0 /mnt/lustre/d123d.sanity/f123d.sanity.0 4 /mnt/lustre/d123d.sanity statahead total: 1 statahead wrong: 0 agl total: 1 snapshot_time: 1611218329. 88052008 (secs.nsecs) subreqs per batch batchs % cum % 1: 1 50 50 2: 0 0 50 4: 0 0 50 8: 1 50 100 | ||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 21/Jan/21 ] | ||||||||||||||||||||||||||||||||
|
Yingjin, for the next refresh of this patch, please include the benchmark results from Shuichi in the commit message of the patch. Note that different versions of userspace tools (e.g. du, find) may sometimes behave differently between e.g. RHEL7 and 8, or if run as root or a regular user, so it is always useful to include the version of the utility that is working or not. | ||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara [ 21/Jan/21 ] | ||||||||||||||||||||||||||||||||
|
See and try the following test cases. There are still something wrong and batch doesn't trigger properly if the number of file are increased in the directory. # mkdir dir0 dir1 dir2 # for i in `seq 0 100`; do touch dir0/file.$i; done # for i in `seq 0 1000`; do touch dir1/file.$i; done # for i in `seq 0 10000`; do touch dir2/file.$i; done 100 files # lctl set_param mdc.*MDT0000*.batch_stats=clear; lctl set_param llite.*.statahead_max=1024 llite.*.statahead_batch_max=1024; clush -w es400nvx1-vm[1-4],ec[01-40] " echo 3 > /proc/sys/vm/drop_caches" mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=clear llite.ai400x-ffff97559d533800.statahead_max=1024 llite.ai400x-ffff97559d533800.statahead_batch_max=1024 # du -a /ai400x/dir0 > /dev/null; lctl get_param mdc.*MDT0000*.batch_stats mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats= snapshot_time: 1611227271. 35992632 (secs.nsecs) subreqs per batch batchs % cum % 1: 0 0 0 2: 0 0 0 4: 0 0 0 8: 1 20 20 16: 1 20 40 32: 1 20 60 64: 2 40 100 1000 files # lctl set_param mdc.*MDT0000*.batch_stats=clear; lctl set_param llite.*.statahead_max=1024 llite.*.statahead_batch_max=1024; clush -w es400nvx1-vm[1-4],ec[01-40] " echo 3 > /proc/sys/vm/drop_caches" mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=clear llite.ai400x-ffff97559d533800.statahead_max=1024 llite.ai400x-ffff97559d533800.statahead_batch_max=1024 # du -a /ai400x/dir1 > /dev/null; lctl get_param mdc.*MDT0000*.batch_stats mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats= snapshot_time: 1611227330.733571614 (secs.nsecs) subreqs per batch batchs % cum % 1: 0 0 0 2: 0 0 0 4: 0 0 0 8: 1 25 25 16: 1 25 50 32: 1 25 75 64: 0 0 75 128: 0 0 75 256: 0 0 75 512: 0 0 75 1024: 1 25 100 10000 files # lctl set_param mdc.*MDT0000*.batch_stats=clear; lctl set_param llite.*.statahead_max=1024 llite.*.statahead_batch_max=1024; clush -w es400nvx1-vm[1-4],ec[01-40] " echo 3 > /proc/sys/vm/drop_caches" mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats=clear llite.ai400x-ffff97559d533800.statahead_max=1024 llite.ai400x-ffff97559d533800.statahead_batch_max=1024 # du -a /ai400x/dir2 > /dev/null; lctl get_param mdc.*MDT0000*.batch_stats mdc.ai400x-MDT0000-mdc-ffff97559d533800.batch_stats= snapshot_time: 1611227444.263062015 (secs.nsecs) subreqs per batch batchs % cum % 1: 0 0 0 | ||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 21/Jan/21 ] | ||||||||||||||||||||||||||||||||
|
I thought I found the reason why statahead does not trigger: 9 dentries: 0 /mnt/lustre/d123d.sanity/f123d.sanity.6 0 /mnt/lustre/d123d.sanity/f123d.sanity.2 0 /mnt/lustre/d123d.sanity/f123d.sanity.5 0 /mnt/lustre/d123d.sanity/f123d.sanity.3 0 /mnt/lustre/d123d.sanity/f123d.sanity.7 0 /mnt/lustre/d123d.sanity/f123d.sanity.9 0 /mnt/lustre/d123d.sanity/f123d.sanity.4 0 /mnt/lustre/d123d.sanity/f123d.sanity.8 0 /mnt/lustre/d123d.sanity/f123d.sanity.1 0 /mnt/lustre/d123d.sanity/f123d.sanity.0 4 /mnt/lustre/d123d.sanity 10000 dentries newfstatat(4, "f123d.sanity.1", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.2", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.3", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.4", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.5", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.6", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.7", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.8", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.9", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.10", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.11", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.12", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "f123d.sanity.13", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 ... When the entry count within a directory is larger than 10000, the program 'du' will first sort according to the file name. I though if the file name has certain kind of the regular pattern, i.e. mdtest.$(i+step), we can optimize to do statahead also. i.e.
| ||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara [ 21/Jan/21 ] | ||||||||||||||||||||||||||||||||
Okey, then "ls -l dir/*" could be also optimized as well if filename is a pattern. This is very useful for ML/AI with ingested data that have typically a rule of filename in the directory. | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 14/May/21 ] | ||||||||||||||||||||||||||||||||
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/43707 | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 08/Jul/21 ] | ||||||||||||||||||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40712/ | ||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 04/Aug/21 ] | ||||||||||||||||||||||||||||||||
|
Hi Andreas, SiYao Any idea and suggestions about how to reproduce the hang caused by Thanks, | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 04/Aug/21 ] | ||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 05/Aug/21 ] | ||||||||||||||||||||||||||||||||
|
After discussed with Lai, found the following stack dump: schedule+0x38/0xa0 [24373.096771] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24373.097828] ? finish_wait+0x80/0x80 [24373.098595] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24373.099578] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc] [24373.100929] ? ll_md_need_convert+0x160/0x160 [lustre] [24373.101985] ? mdc_changelog_cdev_finish+0x210/0x210 [mdc] [24373.103062] mdc_enqueue_base+0x335/0x1340 [mdc] [24373.103996] ? mdc_revalidate_lock+0x1a4/0x1d0 [mdc] [24373.104972] mdc_intent_lock+0x210/0x530 [mdc] [24373.105872] ? ll_md_need_convert+0x160/0x160 [lustre] [24373.106904] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc] [24373.108080] ? mdc_changelog_cdev_finish+0x210/0x210 [mdc] [24373.109211] lmv_revalidate_slaves+0x607/0x9e0 [lmv] [24373.110236] ? ll_md_need_convert+0x160/0x160 [lustre] [24373.111258] lmv_merge_attr+0x32/0x170 [lmv] [24373.112183] ll_update_lsm_md+0xb10/0xe60 [lustre] [24373.113163] ll_update_inode+0x41f/0x650 [lustre] [24373.114109] ll_read_inode2+0x4e/0x3a0 [lustre] [24373.115023] ll_iget+0xbd/0x320 [lustre] [24373.115843] ll_prep_inode+0x374/0x990 [lustre] [24373.116754] ? mdc_revalidate_lock+0x5b/0x1d0 [mdc] [24373.117739] ? lprocfs_counter_add+0xd2/0x140 [obdclass] [24373.118798] ll_statahead_interpret+0x51b/0xb90 [lustre] In the statahead callback interpret, it generate new RPCs for striped dir: lmv_merge_attr() -> lmv_revalidate_slaves() -> mdc_intent_lock(). It may cause the problem using out the RPC slots. Which one should we used? Any suggestions? | ||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 05/Aug/21 ] | ||||||||||||||||||||||||||||||||
|
I don't have a good idea on how complex each solution is. The first one sounds like it would provide better performance (not causing a stall in the statahead pipeline), but for now the more simple/robust solution is best so that you can move ahead with the other statahead patches. | ||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 23/Nov/21 ] | ||||||||||||||||||||||||||||||||
|
After the patch: I think it solve the problem of using out the RPC slots. The old patch should not cause the hang waiting for RPC slots. | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 23/Nov/21 ] | ||||||||||||||||||||||||||||||||
|
"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45648 | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 26/Jan/22 ] | ||||||||||||||||||||||||||||||||
|
"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46309 | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 19/Aug/22 ] | ||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45648/ | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 19/Aug/22 ] | ||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46309/ | ||||||||||||||||||||||||||||||||
| Comment by Alex Zhuravlev [ 06/Sep/22 ] | ||||||||||||||||||||||||||||||||
|
I'm hitting freeze in replay-dual/26 quite often, bisection gives the following stats: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION 76c3fa96dc 10 5 4 BAD LU-16082 ldiskfs: old-style EA inode handling fix 2447564e12 6 1 5 BAD LU-16011 lnet: use preallocate bulk for server 3f0bee2502 6 2 4 BAD LU-15959 kernel: new kernel [SLES15 SP4 5.14.21-150400.24.18.1] cf35c54224 5 3 2 BAD LU-14745 tests: ensure sanity/51d has enough objects bbe5e98180 3 0 3 BAD LU-15874 kernel: new kernel [RHEL 9.0 5.14.0-70.22.1.el9_0] 1ef1fa06b2 7 3 4 BAD LU-16078 o2iblnd: Salt comp_vector 0109cee261 10 4 6 BAD LU-13813 tests: fix stack_trap in conf-sanity test 110/111 b9167201a0 3 0 3 BAD LU-14139 statahead: add total hit/miss count stats 509d7305ce 6 4 1 BAD LU-14139 llite: simplify callback handling for async getattr 73ac8e35e5 10 10 0 GOOD LU-16061 osd-ldiskfs: clear EXTENT_FL for symlink agent inode 907dc0a2d3 10 10 0 GOOD LU-16060 osd-ldiskfs: copy nul byte terminator in writelink 33efefc496 10 10 0 GOOD LU-16037 build: remove quotes from %{mkconf_options} | ||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 07/Sep/22 ] | ||||||||||||||||||||||||||||||||
|
Hi Alex, Thanks, | ||||||||||||||||||||||||||||||||
| Comment by Alex Zhuravlev [ 07/Sep/22 ] | ||||||||||||||||||||||||||||||||
|
just local testing: this is the unique stack traces: schedule,osc_extent_wait,osc_cache_wait_range,osc_io_fsync_end,cl_io_end,lov_io_end_wrapper,lov_io_fsync_end,cl_io_end,cl_io_loop,cl_sync_file_range,ll_writepages,do_writepages,__writeback_single_inode,writeback_sb_inodes,wb_writeback,wb_workfn,process_one_work,worker_thread PIDs(1): "kworker/u4:3":190 schedule,osc_extent_wait,osc_cache_wait_range,osc_io_fsync_end,cl_io_end,lov_io_end_wrapper,lov_io_fsync_end,cl_io_end,cl_io_loop,cl_sync_file_range,vvp_prune,cl_object_prune,lov_conf_set,cl_conf_set,cl_file_inode_init,ll_update_inode,ll_iget,ll_prep_inode,ll_statahead_interpret_common,ll_statahead_interpret,mdc_intent_getattr_async_interpret,ptlrpc_check_set,ptlrpcd PIDs(2): "ptlrpcd_00_00":3041 "ptlrpcd_00_01":3042 schedule,obd_get_request_slot,ldlm_cli_enqueue,mdc_enqueue_base,mdc_intent_lock,lmv_intent_lock,ll_lookup_it.constprop.4,ll_lookup_nd,__lookup_slow,lookup_slow,walk_component,path_lookupat,filename_lookup,vfs_statx,__se_sys_newstat PIDs(1): "dbench":31243 schedule,rwsem_down_read_slowpath,down_read,lov_io_init,cl_io_init0,cl_sync_file_range,ll_writepages,do_writepages,__writeback_single_inode,writeback_sb_inodes,__writeback_inodes_wb,wb_writeback,get_nr_inodes,wb_workfn,_raw_spin_unlock_irq,process_one_work,worker_thread PIDs(1): "kworker/u4:4":31792 schedule,wait_for_common,osc_io_setattr_end,cl_io_end,lov_io_end_wrapper,lov_io_call,lov_io_end,cl_io_end,cl_io_loop,cl_setattr_ost,ll_setattr_raw,notify_change,utimes_common,do_utimes,__se_sys_utimensat PIDs(1): "tar":33454 | ||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara [ 03/Dec/22 ] | ||||||||||||||||||||||||||||||||
|
I did re-ran test "ls -l" (1M x 47001 byte file in single directory) with latest patch serices https://review.whamcloud.com/#/c/fs/lustre-release/+/49291/ [root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches" [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m33.757s user 0m9.363s sys 0m14.982s patched [root@ec01 ~]# lctl set_param llite.*.statahead_max=1024 llite.*.statahead_batch_max=64 [root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches" [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m27.296s user 0m9.089s sys 0m13.051s I see elapse time downed from 33sec to 27sec with patch. subreqs per batch batchs % cum % 1: 92322 76 76 2: 4742 3 80 4: 115 0 80 8: 657 0 80 16: 10736 8 89 32: 107 0 89 64: 12168 10 100 as far as I see batch_stats, more than 950K (64 * 12168 + 32 * 107 + 16 * 10736) of 1M requests can be batched with more than 16 request aggregations. Maybe, this is not so bad? However, we are still spending more than 27 sec here. Here is additinal tests something like "drop cache; ls -l; ls -l" [root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches" [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m27.385s user 0m8.994s sys 0m13.131s llite.exafs-ffff9b96b8157800.stats= snapshot_time 482063.475018642 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.475018642 secs.nsecs open 1 samples [usecs] 154 154 154 23716 close 1 samples [usecs] 131 131 131 17161 readdir 1223 samples [usecs] 0 38879 258067 1731301375 getattr 1000002 samples [usecs] 2 9770 5776755 3873882865 getxattr 1 samples [usecs] 70 70 70 4900 inode_permission 10000021 samples [usecs] 0 393 1848868 2164824 opencount 1 samples [reqs] 1 1 1 1 mdc.exafs-MDT0002-mdc-ffff9b96b8157800.stats= snapshot_time 482063.474876615 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474876615 secs.nsecs req_waittime 181839 samples [usecs] 57 2043896 22268808747 10620189011411911 req_active 181839 samples [reqs] 1 23615 644666200 9255887581600 ldlm_ibits_enqueue 6 samples [reqs] 1 1 6 6 mds_close 1 samples [usecs] 110 110 110 12100 mds_readpage 56 samples [usecs] 250 38375 464045 5002627263 mds_batch 119274 samples [usecs] 61 84527 353967107 2052370493893 ldlm_cancel 62501 samples [usecs] 182 725561 21912324500 10613954056275484 obd_ping 1 samples [usecs] 90 90 90 8100 osc.exafs-OST0000-osc-ffff9b96b8157800.stats= snapshot_time 482063.474252382 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474252382 secs.nsecs req_waittime 132963 samples [usecs] 29 408417 1867410218 500106174718090 req_active 132963 samples [reqs] 1 3612 17237320 44241112018 ldlm_glimpse_enqueue 125139 samples [reqs] 1 1 125139 125139 ost_connect 1 samples [usecs] 97 97 97 9409 ost_disconnect 1 samples [usecs] 253 253 253 64009 ldlm_cancel 7822 samples [usecs] 263 408417 1857609504 500103314555706 osc.exafs-OST0001-osc-ffff9b96b8157800.stats= snapshot_time 482063.474322305 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474322305 secs.nsecs req_waittime 132958 samples [usecs] 31 400200 1868293807 500417445772075 req_active 132958 samples [reqs] 1 3620 17260697 44413040365 ldlm_glimpse_enqueue 125134 samples [reqs] 1 1 125134 125134 ost_connect 1 samples [usecs] 134 134 134 17956 ost_disconnect 1 samples [usecs] 255 255 255 65025 ldlm_cancel 7822 samples [usecs] 249 400200 1858535688 500414552572698 osc.exafs-OST0002-osc-ffff9b96b8157800.stats= snapshot_time 482063.474385742 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474385742 secs.nsecs req_waittime 132922 samples [usecs] 30 403354 1585074597 373963525463297 req_active 132922 samples [reqs] 1 3139 14993568 33140340186 ldlm_glimpse_enqueue 125100 samples [reqs] 1 1 125100 125100 ost_connect 1 samples [usecs] 104 104 104 10816 ost_disconnect 1 samples [usecs] 229 229 229 52441 ldlm_cancel 7820 samples [usecs] 137 403354 1576064547 373959847577499 osc.exafs-OST0003-osc-ffff9b96b8157800.stats= snapshot_time 482063.474439284 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474439284 secs.nsecs req_waittime 132923 samples [usecs] 29 426088 1586055175 373991604658201 req_active 132923 samples [reqs] 1 3131 15010819 33188857463 ldlm_glimpse_enqueue 125100 samples [reqs] 1 1 125100 125100 ost_connect 1 samples [usecs] 140 140 140 19600 ost_disconnect 1 samples [usecs] 230 230 230 52900 ldlm_cancel 7821 samples [usecs] 188 426088 1576987697 373987536028513 osc.exafs-OST0004-osc-ffff9b96b8157800.stats= snapshot_time 482063.474493870 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474493870 secs.nsecs req_waittime 133220 samples [usecs] 30 402683 1938227551 496692342133501 req_active 133220 samples [reqs] 1 3741 17164793 41132712891 ldlm_glimpse_enqueue 125381 samples [reqs] 1 1 125381 125381 ost_connect 1 samples [usecs] 106 106 106 11236 ost_disconnect 1 samples [usecs] 236 236 236 55696 ldlm_cancel 7837 samples [usecs] 327 402683 1791475030 496280422422484 osc.exafs-OST0005-osc-ffff9b96b8157800.stats= snapshot_time 482063.474549629 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474549629 secs.nsecs req_waittime 133213 samples [usecs] 29 393951 1936411387 496417875801293 req_active 133213 samples [reqs] 1 3740 17146480 41076362838 ldlm_glimpse_enqueue 125376 samples [reqs] 1 1 125376 125376 ost_connect 1 samples [usecs] 118 118 118 13924 ost_disconnect 1 samples [usecs] 199 199 199 39601 ldlm_cancel 7835 samples [usecs] 315 393951 1790421213 496012032268869 osc.exafs-OST0006-osc-ffff9b96b8157800.stats= snapshot_time 482063.474603969 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474603969 secs.nsecs req_waittime 132941 samples [usecs] 27 394743 1649085257 398814721042191 req_active 132941 samples [reqs] 1 3382 15661909 36181334593 ldlm_glimpse_enqueue 125116 samples [reqs] 1 1 125116 125116 ost_connect 1 samples [usecs] 99 99 99 9801 ost_disconnect 1 samples [usecs] 187 187 187 34969 ldlm_cancel 7823 samples [usecs] 347 394743 1639760732 398808893873090 osc.exafs-OST0007-osc-ffff9b96b8157800.stats= snapshot_time 482063.474659084 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482063.474659084 secs.nsecs req_waittime 132954 samples [usecs] 31 392261 1644646149 397155067488985 req_active 132954 samples [reqs] 1 3364 15624468 36037631194 ldlm_glimpse_enqueue 125131 samples [reqs] 1 1 125131 125131 ost_connect 1 samples [usecs] 121 121 121 14641 ost_disconnect 1 samples [usecs] 169 169 169 28561 In fact, we still spent 25sec even without drop cache. there were nothing requests from client to servers. [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m25.309s user 0m8.937s sys 0m16.327s llite.exafs-ffff9b96b8157800.stats= snapshot_time 482212.579971327 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579971327 secs.nsecs open 1 samples [usecs] 189 189 189 35721 close 1 samples [usecs] 159 159 159 25281 readdir 1223 samples [usecs] 0 149 171544 24104442 getattr 1000002 samples [usecs] 3 34 9690974 103987264 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 10000021 samples [usecs] 0 17 2009221 2041697 opencount 1 samples [reqs] 1 1 1 1 openclosetime 1 samples [usecs] 102201708 102201708 102201708 10445189118117264 mdc.exafs-MDT0002-mdc-ffff9b96b8157800.stats= snapshot_time 482212.579840759 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579840759 secs.nsecs req_waittime 5 samples [usecs] 67 132 484 50542 req_active 5 samples [reqs] 1 1 5 5 ldlm_ibits_enqueue 1 samples [reqs] 1 1 1 1 mds_close 1 samples [usecs] 126 126 126 15876 obd_ping 3 samples [usecs] 67 87 226 17242 osc.exafs-OST0000-osc-ffff9b96b8157800.stats= snapshot_time 482212.579286532 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579286532 secs.nsecs req_waittime 3 samples [usecs] 76 83 235 18441 req_active 3 samples [reqs] 1 1 3 3 obd_ping 3 samples [usecs] 76 83 235 18441 osc.exafs-OST0001-osc-ffff9b96b8157800.stats= snapshot_time 482212.579346652 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579346652 secs.nsecs req_waittime 3 samples [usecs] 69 78 222 16470 req_active 3 samples [reqs] 1 1 3 3 obd_ping 3 samples [usecs] 69 78 222 16470 osc.exafs-OST0002-osc-ffff9b96b8157800.stats= snapshot_time 482212.579396346 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579396346 secs.nsecs req_waittime 3 samples [usecs] 50 61 164 9030 req_active 3 samples [reqs] 1 1 3 3 obd_ping 3 samples [usecs] 50 61 164 9030 osc.exafs-OST0003-osc-ffff9b96b8157800.stats= snapshot_time 482212.579440245 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579440245 secs.nsecs req_waittime 3 samples [usecs] 57 79 196 13090 req_active 3 samples [reqs] 1 1 3 3 obd_ping 3 samples [usecs] 57 79 196 13090 osc.exafs-OST0004-osc-ffff9b96b8157800.stats= snapshot_time 482212.579489618 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579489618 secs.nsecs req_waittime 3 samples [usecs] 76 84 240 19232 req_active 3 samples [reqs] 1 1 3 3 obd_ping 3 samples [usecs] 76 84 240 19232 osc.exafs-OST0005-osc-ffff9b96b8157800.stats= snapshot_time 482212.579536314 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579536314 secs.nsecs req_waittime 3 samples [usecs] 76 82 236 18584 req_active 3 samples [reqs] 1 1 3 3 obd_ping 3 samples [usecs] 76 82 236 18584 osc.exafs-OST0006-osc-ffff9b96b8157800.stats= snapshot_time 482212.579579660 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579579660 secs.nsecs req_waittime 3 samples [usecs] 46 56 158 8388 req_active 3 samples [reqs] 1 1 3 3 obd_ping 3 samples [usecs] 46 56 158 8388 osc.exafs-OST0007-osc-ffff9b96b8157800.stats= snapshot_time 482212.579628342 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 482212.579628342 secs.nsecs req_waittime 3 samples [usecs] 60 71 198 13130 req_active 3 samples [reqs] 1 1 3 3 obd_ping 3 samples [usecs] 60 71 198 13130 So, somehow, "ls -l" for 1M files take 25sec even everything comes from caches. | ||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 03/Dec/22 ] | ||||||||||||||||||||||||||||||||
|
It looks like the client is still sending 125k x 8=1M glimpse RPCs to the OSTs to fetch the file size. I guess this is because the client is not caching all 1M of the OST DLM locks on the objects between "ls -l" calls? Did you check OSC LRU size? I suspect this would have a sudden jump in speed once the number of files is small enough that all the OST DLM locks are cached. | ||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara [ 03/Dec/22 ] | ||||||||||||||||||||||||||||||||
'ls -l' is doing other things after recived results. sorting, writing output, etc. sorting some impacts, but majority of time are still spent 1M times for statx() and getxattr() even all data in the cache. 1670051344.758188 getxattr("/exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/file.mdtest.0.945973", "system.posix_acl_access", NULL, 0) = -1 ENODATA (No data available) <0.000008>
1670051344.758214 statx(AT_FDCWD, "/exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/file.mdtest.0.877046", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_MODE|STATX_NLINK|STATX_MTIME|STATX_SIZE, {stx_mask=STATX_MODE|STATX_NLINK|STATX_MTIME|STATX_SIZE, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=47001, ...}) = 0 <0.000013>
Just in case, this is a test resutls with/without caches, but ls doesn't sort for result list. [root@ec01 ~]# clush -w ai400x2-1-vm[1-4],ec01 " echo 3 > /proc/sys/vm/drop_caches " [sihara@ec01 ~]$ time ls -fl /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m20.023s user 0m1.903s sys 0m13.363s [sihara@ec01 ~]$ time ls -fl /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m18.164s user 0m1.703s sys 0m16.417s still same 2 sec overheads for client and server roundtrips. So, batched statahead reduced that overheads to 2sec from 8 sec which means 4x speedup. | ||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara [ 03/Dec/22 ] | ||||||||||||||||||||||||||||||||
|
it might be not related to
As far as I observed, all locks seems to be kept as well. nothing shrinks or re-granting locks between two 'ls -l'. [root@ec01 ~]# lctl get_param ldlm.namespaces.*.lock_count ldlm.namespaces.*.lru_size ldlm.namespaces.MGC10.0.11.208@o2ib12.lock_count=4 ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.MGC10.0.11.208@o2ib12.lru_size=3200 ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lru_size=0 [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m27.249s user 0m9.111s sys 0m13.544s [root@ec01 ~]# lctl get_param ldlm.namespaces.*.lock_count ldlm.namespaces.*.lru_size ldlm.namespaces.MGC10.0.11.208@o2ib12.lock_count=4 ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lock_count=2 ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lock_count=4 ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lock_count=1000004 ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lock_count=125089 ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lock_count=125073 ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lock_count=125091 ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lock_count=125085 ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lock_count=125295 ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lock_count=125285 ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lock_count=125103 ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lock_count=125099 ldlm.namespaces.MGC10.0.11.208@o2ib12.lru_size=3200 ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lru_size=2 ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lru_size=4 ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lru_size=1000004 ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lru_size=125089 ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lru_size=125073 ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lru_size=125091 ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lru_size=125085 ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lru_size=125295 ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lru_size=125285 ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lru_size=125103 ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lru_size=125099 [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m25.615s user 0m8.982s sys 0m16.588s [root@ec01 ~]# lctl get_param ldlm.namespaces.*.lock_count ldlm.namespaces.*.lru_size ldlm.namespaces.MGC10.0.11.208@o2ib12.lock_count=4 ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lock_count=2 ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lock_count=4 ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lock_count=1000004 ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lock_count=0 ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lock_count=125089 ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lock_count=125073 ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lock_count=125091 ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lock_count=125085 ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lock_count=125295 ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lock_count=125285 ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lock_count=125103 ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lock_count=125099 ldlm.namespaces.MGC10.0.11.208@o2ib12.lru_size=3200 ldlm.namespaces.exafs-MDT0000-mdc-ffff9b96b8157800.lru_size=2 ldlm.namespaces.exafs-MDT0001-mdc-ffff9b96b8157800.lru_size=4 ldlm.namespaces.exafs-MDT0002-mdc-ffff9b96b8157800.lru_size=1000004 ldlm.namespaces.exafs-MDT0003-mdc-ffff9b96b8157800.lru_size=0 ldlm.namespaces.exafs-OST0000-osc-ffff9b96b8157800.lru_size=125089 ldlm.namespaces.exafs-OST0001-osc-ffff9b96b8157800.lru_size=125073 ldlm.namespaces.exafs-OST0002-osc-ffff9b96b8157800.lru_size=125091 ldlm.namespaces.exafs-OST0003-osc-ffff9b96b8157800.lru_size=125085 ldlm.namespaces.exafs-OST0004-osc-ffff9b96b8157800.lru_size=125295 ldlm.namespaces.exafs-OST0005-osc-ffff9b96b8157800.lru_size=125285 ldlm.namespaces.exafs-OST0006-osc-ffff9b96b8157800.lru_size=125103 ldlm.namespaces.exafs-OST0007-osc-ffff9b96b8157800.lru_size=125099 Here is same 1M files tests for ext4 on local disk of client. [root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches [sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m16.999s user 0m8.956s sys 0m5.855s [sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m11.832s user 0m8.765s sys 0m3.051s I'm also attaching FG which captured at second 'ls -l' on Lustre. | ||||||||||||||||||||||||||||||||
| Comment by Shuichi Ihara [ 03/Dec/22 ] | ||||||||||||||||||||||||||||||||
|
there are no LNET messages go out in 2nd 'ls -l'. | ||||||||||||||||||||||||||||||||
| Comment by Qian Yingjin [ 03/Dec/22 ] | ||||||||||||||||||||||||||||||||
|
From the attaching FG, lots of time costs are: Each of them costs 5% for data and metadata DLM lock matching on the client side. All takes about 20% of the total sample... For class_handle2object for locks, we should maintain the lock handle per target (osc/mdc) lock namespace, not shared a single global hash table for handle on the client or even the server. | ||||||||||||||||||||||||||||||||
| Comment by Andreas Dilger [ 03/Dec/22 ] | ||||||||||||||||||||||||||||||||
|
Is there a patch that converts ldlm to use rhashtable? | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 11/Apr/23 ] | ||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/40720/ | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 22/Apr/23 ] | ||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/40943/ | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 22/Apr/23 ] | ||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/40945/ | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 22/Apr/23 ] | ||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/43707/ | ||||||||||||||||||||||||||||||||
| Comment by Gerrit Updater [ 22/Apr/23 ] | ||||||||||||||||||||||||||||||||
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/41220/ | ||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 22/Apr/23 ] | ||||||||||||||||||||||||||||||||
|
Landed for 2.16 |