[LU-16365] cached 'ls -l' is slow Created: 03/Dec/22 Updated: 31/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
While testing # echo 3 > /proc/sys/vm/drop_caches # time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ # time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ In theory, when 1st 'ls -l' finishes, client keeps data, metadata and locks in the cache, then second 'ls -l' output should come from it. Here is 'ls -l' results for 1M files in single directory. [root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches" [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m27.385s user 0m8.994s sys 0m13.131s [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m25.309s user 0m8.937s sys 0m16.327s There are no RPCs to go out in 2nd 'ls -l' below. I only saw only 16 x LNET messages on 2nd 'ls -l' against 1.1M LNET messages on 1st 'ls -l', but still almost same elapsed time. most of time costs is 'ls' itself and Lustre client side. [root@ec01 ~]# clush -w ai400x2-1-vm[1-4],ec01 " echo 3 > /proc/sys/vm/drop_caches "
[root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
send_count: 0
recv_count: 0
drop_count: 0
send_count: 65363661
recv_count: 62095891
drop_count: 1
real 0m26.145s
user 0m9.070s
sys 0m13.552s
send_count: 0
recv_count: 0
drop_count: 0
send_count: 66482277
recv_count: 63233245
drop_count: 1
[root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
send_count: 0
recv_count: 0
drop_count: 0
send_count: 66482277
recv_count: 63233245
drop_count: 1
real 0m25.569s
user 0m8.987s
sys 0m16.537s
send_count: 0
recv_count: 0
drop_count: 0
send_count: 66482293
recv_count: 63233261
drop_count: 1
This is same test for 1M files in ext4 of local disk and /dev/shm on client. [root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches [sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m16.999s user 0m8.956s sys 0m5.855s [sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m11.832s user 0m8.765s sys 0m3.051s [root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches [sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m8.296s user 0m5.465s sys 0m2.813s [sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m8.273s user 0m5.414s sys 0m2.847s Lustre can be similar performance of ext4 and memcache if everything in the cache, can't it? |
| Comments |
| Comment by Shuichi Ihara [ 03/Dec/22 ] |
|
In order to simplify cases, the number of files is reduced. 100K x 47001 byte files. [root@ec01 ~]# lctl set_param llite.*.stats=clear [root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches" [sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 67.27 1.180271 11 100001 statx 31.71 0.556354 5 100002 100002 getxattr 0.95 0.016659 134 124 getdents64 [root@ec01 ~]# lctl get_param llite.*.stats llite.exafs-ffff9b96b8157800.stats= snapshot_time 508528.306272142 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 508528.306272142 secs.nsecs open 1 samples [usecs] 163 163 163 26569 close 1 samples [usecs] 107 107 107 11449 readdir 124 samples [usecs] 0 14912 31175 224577441 getattr 100002 samples [usecs] 2 565 432562 2313162 getxattr 1 samples [usecs] 88 88 88 7744 inode_permission 1000021 samples [usecs] 0 314 195050 299488 opencount 1 samples [reqs] 1 1 1 1 [root@ec01 ~]# lctl set_param llite.*.stats=clear [sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 65.73 1.142941 11 100001 statx 33.22 0.577662 5 100002 100002 getxattr 0.93 0.016237 130 124 getdents64 [root@ec01 ~]# lctl get_param llite.*.stats llite.exafs-ffff9b96b8157800.stats= snapshot_time 508578.202659913 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 508578.202659913 secs.nsecs open 1 samples [usecs] 177 177 177 31329 close 1 samples [usecs] 124 124 124 15376 readdir 124 samples [usecs] 0 142 16484 2223244 getattr 100002 samples [usecs] 4 21 530944 2862706 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 1000021 samples [usecs] 0 13 191123 193591 opencount 1 samples [reqs] 2 2 2 4 openclosetime 1 samples [usecs] 44568082 44568082 44568082 1986313933158724 In 1st 'ls -l', total time cost of 100K x statx() is 1.18sec and spent 0.43sec for getattr() and 0.19sec for inode_permission() in Lustre. |
| Comment by Shuichi Ihara [ 03/Dec/22 ] |
|
same test, but 2M x 47001 files in the single directory. it can clarify the problem. [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m0.494s user 0m19.378s sys 0m28.927s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545315.042991581 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545315.042991581 secs.nsecs open 1 samples [usecs] 206 206 206 42436 close 1 samples [usecs] 141 141 141 19881 readdir 2502 samples [usecs] 0 40943 518532 2304508620 getattr 2048002 samples [usecs] 2 22144 12111659 9279727979 getxattr 1 samples [usecs] 90 90 90 8100 inode_permission 20480021 samples [usecs] 0 206 5211214 8280488 opencount 1 samples [reqs] 1 1 1 1 [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m6.651s user 0m19.172s sys 0m47.345s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545428.911309921 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545428.911309921 secs.nsecs open 1 samples [usecs] 8493 8493 8493 72131049 close 1 samples [usecs] 181 181 181 32761 readdir 2502 samples [usecs] 0 164 377300 56936396 getattr 2048002 samples [usecs] 4 52 31198624 564334832 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 20480021 samples [usecs] 0 24 4861557 6459227 opencount 1 samples [reqs] 2 2 2 4 openclosetime 1 samples [usecs] 55693604 55693604 55693604 3101777526508816 [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m6.656s user 0m19.146s sys 0m47.384s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545596.209353102 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545596.209353102 secs.nsecs open 1 samples [usecs] 233 233 233 54289 close 1 samples [usecs] 188 188 188 35344 readdir 2502 samples [usecs] 0 160 377048 56861048 getattr 2048002 samples [usecs] 4 50 31185467 563959387 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 20480021 samples [usecs] 0 25 4879991 6515725 opencount 1 samples [reqs] 4 4 4 16 openclosetime 1 samples [usecs] 24859044 24859044 24859044 617972068593936 cached case (2nd or 3rd 'ls -l') were slower than non-cached case (1st 'ls -l') and client spent total 31sec for getattr() on client against 12sec in non-cached case. |
| Comment by Oleg Drokin [ 05/Dec/22 ] |
|
seconds of user time is strange, is ls sorting things in RAM? Can you do ls -U and see what sort of a difference it makes. Also when everythign is cached there's still huge cpu time use in Lustre case compared to ext4, which given there are no RPC means it's somewhere inside lustre somewhere? And it doe not change much between cached/uncached case even, in fact cached takes more system time (13>16)
ext4 on the other hand is dropping system time 6>3 seconds (user time remains the same) |
| Comment by Andreas Dilger [ 06/Dec/22 ] |
|
The 19s of user time is caused by "ls -l" sorting the files before printing them, and that remains constant between calls (cached/uncached) and between filesystems (lustre/ext4), so we can't do anything about that. It looks like almost all of the system time is used by the getattr and inode_permission calls: sys 0m28.927s (17.32s in getattr + inode_permission) getattr 2048002 samples [usecs] 2 22144 12111659 9279727979 inode_permission 20480021 samples [usecs] 0 206 5211214 8280488 sys 0m47.384s (36.06s in getattr+inode_permission) getattr 2048002 samples [usecs] 4 52 31198624 564334832 inode_permission 20480021 samples [usecs] 0 24 4861557 6459227 sys 0m47.384s (36.07s in getattr + inode_permission) getattr 2048002 samples [usecs] 4 50 31185467 563959387 inode_permission 20480021 samples [usecs] 0 25 4879991 6515725 |
| Comment by Andreas Dilger [ 06/Dec/22 ] |
|
According to Yingjin's comments in
There is WIP patch https://review.whamcloud.com/45882 "LU-8130 ldlm: convert ldlm_resource hash to rhashtable" that would be interesting to try, both because rhashtable is supposed to be faster than cfs_hash, but also because the code is being converted to rhashtable and it would be good to know if it makes performance worse for some reason. However, that patch is over a year old and it failed every test session, so it would need to be updated/fixed before it could be tested here. |
| Comment by James A Simmons [ 06/Dec/22 ] |
|
Sorry I haven't worked out the details yet. Need to find the cycles. |