[LU-16365] cached 'ls -l' is slow Created: 03/Dec/22  Updated: 31/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File ls.svg    
Issue Links:
Related
is related to LU-14139 batched statahead processing Resolved
is related to LU-8130 Migrate from libcfs hash to rhashtable Open
is related to LU-17329 Relaxed POSIX Consistency for Lustre Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing LU-14139, there is an observed performance behavior.
Here is test workload

# echo 3 > /proc/sys/vm/drop_caches
# time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ 
# time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ 

In theory, when 1st 'ls -l' finishes, client keeps data, metadata and locks in the cache, then second 'ls -l' output should come from it.
It would expect 2nd 'ls -l'could be significant faster than 1st 'ls -l', but it's not very much.

Here is 'ls -l' results for 1M files in single directory.

[root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m27.385s
user	0m8.994s
sys	0m13.131s

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m25.309s
user	0m8.937s
sys	0m16.327s

There are no RPCs to go out in 2nd 'ls -l' below. I only saw only 16 x LNET messages on 2nd 'ls -l' against 1.1M LNET messages on 1st 'ls -l', but still almost same elapsed time. most of time costs is 'ls' itself and Lustre client side.

[root@ec01 ~]# clush -w ai400x2-1-vm[1-4],ec01 " echo 3 > /proc/sys/vm/drop_caches "
[root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 65363661
              recv_count: 62095891
              drop_count: 1

real	0m26.145s
user	0m9.070s
sys	0m13.552s
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482277
              recv_count: 63233245
              drop_count: 1
[root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482277
              recv_count: 63233245
              drop_count: 1

real	0m25.569s
user	0m8.987s
sys	0m16.537s
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482293
              recv_count: 63233261
              drop_count: 1

This is same test for 1M files in ext4 of local disk and /dev/shm on client.

[root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
[sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null

real	0m16.999s
user	0m8.956s
sys	0m5.855s
[sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null

real	0m11.832s
user	0m8.765s
sys	0m3.051s

[root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
[sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m8.296s
user	0m5.465s
sys	0m2.813s
[sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m8.273s
user	0m5.414s
sys	0m2.847s

Lustre can be similar performance of ext4 and memcache if everything in the cache, can't it?



 Comments   
Comment by Shuichi Ihara [ 03/Dec/22 ]

In order to simplify cases, the number of files is reduced. 100K x 47001 byte files.

[root@ec01 ~]# lctl set_param llite.*.stats=clear
[root@ec01 ~]#  clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
[sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 67.27    1.180271          11    100001           statx
 31.71    0.556354           5    100002    100002 getxattr
  0.95    0.016659         134       124           getdents64

[root@ec01 ~]# lctl get_param llite.*.stats
llite.exafs-ffff9b96b8157800.stats=
snapshot_time             508528.306272142 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              508528.306272142 secs.nsecs
open                      1 samples [usecs] 163 163 163 26569
close                     1 samples [usecs] 107 107 107 11449
readdir                   124 samples [usecs] 0 14912 31175 224577441
getattr                   100002 samples [usecs] 2 565 432562 2313162
getxattr                  1 samples [usecs] 88 88 88 7744
inode_permission          1000021 samples [usecs] 0 314 195050 299488
opencount                 1 samples [reqs] 1 1 1 1

[root@ec01 ~]# lctl set_param llite.*.stats=clear
[sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 65.73    1.142941          11    100001           statx
 33.22    0.577662           5    100002    100002 getxattr
  0.93    0.016237         130       124           getdents64

[root@ec01 ~]# lctl get_param llite.*.stats
llite.exafs-ffff9b96b8157800.stats=
snapshot_time             508578.202659913 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              508578.202659913 secs.nsecs
open                      1 samples [usecs] 177 177 177 31329
close                     1 samples [usecs] 124 124 124 15376
readdir                   124 samples [usecs] 0 142 16484 2223244
getattr                   100002 samples [usecs] 4 21 530944 2862706
getxattr                  1 samples [usecs] 2 2 2 4
getxattr_hits             1 samples [reqs]
inode_permission          1000021 samples [usecs] 0 13 191123 193591
opencount                 1 samples [reqs] 2 2 2 4
openclosetime             1 samples [usecs] 44568082 44568082 44568082 1986313933158724

In 1st 'ls -l', total time cost of 100K x statx() is 1.18sec and spent 0.43sec for getattr() and 0.19sec for inode_permission() in Lustre.
2nd 'ls -l', 100K x getattr() in Lustre is longer (0.53sec) than 1st 'ls -l'.  I did same test several times, but 2nd case is always slower.
In any cases, regardless cached/non-cached, 1/2 of total time is spent in the Lustre.

Comment by Shuichi Ihara [ 03/Dec/22 ]

same test, but 2M x 47001 files in the single directory. it can clarify the problem.

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	1m0.494s
user	0m19.378s
sys	0m28.927s

llite.exafs-ffff9b8b77585000.stats=
snapshot_time             545315.042991581 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              545315.042991581 secs.nsecs
open                      1 samples [usecs] 206 206 206 42436
close                     1 samples [usecs] 141 141 141 19881
readdir                   2502 samples [usecs] 0 40943 518532 2304508620
getattr                   2048002 samples [usecs] 2 22144 12111659 9279727979
getxattr                  1 samples [usecs] 90 90 90 8100
inode_permission          20480021 samples [usecs] 0 206 5211214 8280488
opencount                 1 samples [reqs] 1 1 1 1

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	1m6.651s
user	0m19.172s
sys	0m47.345s

llite.exafs-ffff9b8b77585000.stats=
snapshot_time             545428.911309921 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              545428.911309921 secs.nsecs
open                      1 samples [usecs] 8493 8493 8493 72131049
close                     1 samples [usecs] 181 181 181 32761
readdir                   2502 samples [usecs] 0 164 377300 56936396
getattr                   2048002 samples [usecs] 4 52 31198624 564334832
getxattr                  1 samples [usecs] 2 2 2 4
getxattr_hits             1 samples [reqs]
inode_permission          20480021 samples [usecs] 0 24 4861557 6459227
opencount                 1 samples [reqs] 2 2 2 4
openclosetime             1 samples [usecs] 55693604 55693604 55693604 3101777526508816


[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	1m6.656s
user	0m19.146s
sys	0m47.384s

llite.exafs-ffff9b8b77585000.stats=
snapshot_time             545596.209353102 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              545596.209353102 secs.nsecs
open                      1 samples [usecs] 233 233 233 54289
close                     1 samples [usecs] 188 188 188 35344
readdir                   2502 samples [usecs] 0 160 377048 56861048
getattr                   2048002 samples [usecs] 4 50 31185467 563959387
getxattr                  1 samples [usecs] 2 2 2 4
getxattr_hits             1 samples [reqs]
inode_permission          20480021 samples [usecs] 0 25 4879991 6515725
opencount                 1 samples [reqs] 4 4 4 16
openclosetime             1 samples [usecs] 24859044 24859044 24859044 617972068593936

cached case (2nd or 3rd 'ls -l') were slower than non-cached case (1st 'ls -l') and client spent total 31sec for getattr() on client against 12sec in non-cached case.

Comment by Oleg Drokin [ 05/Dec/22 ]

 seconds of user time is strange, is ls sorting things in RAM? Can you do ls -U and see what sort of a difference it makes. Also when everythign is cached there's still huge cpu time use in Lustre case compared to ext4, which given there are no RPC means it's somewhere inside lustre somewhere? And it doe not change much between cached/uncached case even, in fact cached takes more system time (13>16)

 

ext4 on the other hand is dropping system time 6>3 seconds (user time remains the same)

Comment by Andreas Dilger [ 06/Dec/22 ]

The 19s of user time is caused by "ls -l" sorting the files before printing them, and that remains constant between calls (cached/uncached) and between filesystems (lustre/ext4), so we can't do anything about that. It looks like almost all of the system time is used by the getattr and inode_permission calls:

sys	0m28.927s  (17.32s in getattr + inode_permission)
getattr                   2048002 samples [usecs] 2 22144 12111659 9279727979
inode_permission          20480021 samples [usecs] 0 206 5211214 8280488

sys	0m47.384s (36.06s in getattr+inode_permission)
getattr                   2048002 samples [usecs] 4 52 31198624 564334832
inode_permission          20480021 samples [usecs] 0 24 4861557 6459227

sys	0m47.384s (36.07s in getattr + inode_permission)
getattr                   2048002 samples [usecs] 4 50 31185467 563959387
inode_permission          20480021 samples [usecs] 0 25 4879991 6515725
Comment by Andreas Dilger [ 06/Dec/22 ]

According to Yingjin's comments in LU-14139 there is quite a bit of overhead in the LDLM hash code:

From the attaching FG, lots of time costs are:
{{__ldlm_handle2lock () ->class_handle2object() }}
ldlm_resource_get() >cfs_hash_bd_lookup_intent()>ldlm_res_hop_keycmp()

Each of them costs 5% for data and metadata DLM lock matching on the client side. All takes about 20% of the total sample...
It seems that the hash for looking up the lock handle and resource take lots of time.
And ldlm_res_hop_keycmp() reaches 5%, it means there are lots of elements in the same bucket of the hash table. We should increase the hash table and use the resizable hash table.

For class_handle2object() for locks, we should maintain the lock handle per target (osc/mdc) lock namespace, not shared a single global hash table for handle on the client or even the server.

There is WIP patch https://review.whamcloud.com/45882 "LU-8130 ldlm: convert ldlm_resource hash to rhashtable" that would be interesting to try, both because rhashtable is supposed to be faster than cfs_hash, but also because the code is being converted to rhashtable and it would be good to know if it makes performance worse for some reason. However, that patch is over a year old and it failed every test session, so it would need to be updated/fixed before it could be tested here.

Comment by James A Simmons [ 06/Dec/22 ]

Sorry I haven't worked out the details yet. Need to find the cycles.

Generated at Sat Feb 10 03:26:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.