[LU-16365] cached 'ls -l' is slow - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While testing ~~LU-14139~~, there is an observed performance behavior.
Here is test workload

# echo 3 > /proc/sys/vm/drop_caches
# time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ 
# time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/

In theory, when 1st 'ls -l' finishes, client keeps data, metadata and locks in the cache, then second 'ls -l' output should come from it.
It would expect 2nd 'ls -l'could be significant faster than 1st 'ls -l', but it's not very much.

Here is 'ls -l' results for 1M files in single directory.

[root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m27.385s
user	0m8.994s
sys	0m13.131s

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m25.309s
user	0m8.937s
sys	0m16.327s

There are no RPCs to go out in 2nd 'ls -l' below. I only saw only 16 x LNET messages on 2nd 'ls -l' against 1.1M LNET messages on 1st 'ls -l', but still almost same elapsed time. most of time costs is 'ls' itself and Lustre client side.

[root@ec01 ~]# clush -w ai400x2-1-vm[1-4],ec01 " echo 3 > /proc/sys/vm/drop_caches "
[root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 65363661
              recv_count: 62095891
              drop_count: 1

real	0m26.145s
user	0m9.070s
sys	0m13.552s
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482277
              recv_count: 63233245
              drop_count: 1
[root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482277
              recv_count: 63233245
              drop_count: 1

real	0m25.569s
user	0m8.987s
sys	0m16.537s
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482293
              recv_count: 63233261
              drop_count: 1

This is same test for 1M files in ext4 of local disk and /dev/shm on client.

[root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
[sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null

real	0m16.999s
user	0m8.956s
sys	0m5.855s
[sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null

real	0m11.832s
user	0m8.765s
sys	0m3.051s

[root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
[sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m8.296s
user	0m5.465s
sys	0m2.813s
[sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m8.273s
user	0m5.414s
sys	0m2.847s

Lustre can be similar performance of ext4 and memcache if everything in the cache, can't it?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ls.svg
141 kB
06/Dec/22 12:23 AM

Issue Links

is related to

LU-17329 Relaxed POSIX Consistency for Lustre

Open

is related to

LU-14139 batched statahead processing

Resolved

LU-8130 Migrate from libcfs hash to rhashtable

Open

Activity

[LU-16365] cached 'ls -l' is slow

James A Simmons added a comment - 24/Apr/25 4:39 PM

Patch https://review.whamcloud.com/45882 is ready for testing. Lets see if we can resolve these performance issues.

James A Simmons added a comment - 24/Apr/25 4:39 PM Patch https://review.whamcloud.com/45882 is ready for testing. Lets see if we can resolve these performance issues.

James A Simmons added a comment - 06/Dec/22 1:39 AM

Sorry I haven't worked out the details yet. Need to find the cycles.

James A Simmons added a comment - 06/Dec/22 1:39 AM Sorry I haven't worked out the details yet. Need to find the cycles.

Andreas Dilger added a comment - 06/Dec/22 12:21 AM

According to Yingjin's comments in ~~LU-14139~~ there is quite a bit of overhead in the LDLM hash code:

From the attaching FG, lots of time costs are:
{{__ldlm_handle2lock () ->class_handle2object() }}
ldlm_resource_get() >cfs_hash_bd_lookup_intent()>ldlm_res_hop_keycmp()

Each of them costs 5% for data and metadata DLM lock matching on the client side. All takes about 20% of the total sample...
It seems that the hash for looking up the lock handle and resource take lots of time.
And ldlm_res_hop_keycmp() reaches 5%, it means there are lots of elements in the same bucket of the hash table. We should increase the hash table and use the resizable hash table.

For class_handle2object() for locks, we should maintain the lock handle per target (osc/mdc) lock namespace, not shared a single global hash table for handle on the client or even the server.

There is WIP patch https://review.whamcloud.com/45882 "LU-8130 ldlm: convert ldlm_resource hash to rhashtable" that would be interesting to try, both because rhashtable is supposed to be faster than cfs_hash, but also because the code is being converted to rhashtable and it would be good to know if it makes performance worse for some reason. However, that patch is over a year old and it failed every test session, so it would need to be updated/fixed before it could be tested here.

Andreas Dilger added a comment - 06/Dec/22 12:21 AM According to Yingjin's comments in LU-14139 there is quite a bit of overhead in the LDLM hash code: From the attaching FG, lots of time costs are: {{__ldlm_handle2lock () ->class_handle2object() }} ldlm_resource_get() >cfs_hash_bd_lookup_intent()>ldlm_res_hop_keycmp() Each of them costs 5% for data and metadata DLM lock matching on the client side. All takes about 20% of the total sample... It seems that the hash for looking up the lock handle and resource take lots of time. And ldlm_res_hop_keycmp() reaches 5%, it means there are lots of elements in the same bucket of the hash table. We should increase the hash table and use the resizable hash table. For class_handle2object() for locks, we should maintain the lock handle per target (osc/mdc) lock namespace, not shared a single global hash table for handle on the client or even the server. There is WIP patch https://review.whamcloud.com/45882 " LU-8130 ldlm: convert ldlm_resource hash to rhashtable " that would be interesting to try, both because rhashtable is supposed to be faster than cfs_hash, but also because the code is being converted to rhashtable and it would be good to know if it makes performance worse for some reason. However, that patch is over a year old and it failed every test session, so it would need to be updated/fixed before it could be tested here.

Andreas Dilger added a comment - 06/Dec/22 12:09 AM

The 19s of user time is caused by "ls -l" sorting the files before printing them, and that remains constant between calls (cached/uncached) and between filesystems (lustre/ext4), so we can't do anything about that. It looks like almost all of the system time is used by the getattr and inode_permission calls:

sys	0m28.927s  (17.32s in getattr + inode_permission)
getattr                   2048002 samples [usecs] 2 22144 12111659 9279727979
inode_permission          20480021 samples [usecs] 0 206 5211214 8280488

sys	0m47.384s (36.06s in getattr+inode_permission)
getattr                   2048002 samples [usecs] 4 52 31198624 564334832
inode_permission          20480021 samples [usecs] 0 24 4861557 6459227

sys	0m47.384s (36.07s in getattr + inode_permission)
getattr                   2048002 samples [usecs] 4 50 31185467 563959387
inode_permission          20480021 samples [usecs] 0 25 4879991 6515725

Andreas Dilger added a comment - 06/Dec/22 12:09 AM The 19s of user time is caused by " ls -l " sorting the files before printing them, and that remains constant between calls (cached/uncached) and between filesystems (lustre/ext4), so we can't do anything about that. It looks like almost all of the system time is used by the getattr and inode_permission calls: sys 0m28.927s (17.32s in getattr + inode_permission) getattr 2048002 samples [usecs] 2 22144 12111659 9279727979 inode_permission 20480021 samples [usecs] 0 206 5211214 8280488 sys 0m47.384s (36.06s in getattr+inode_permission) getattr 2048002 samples [usecs] 4 52 31198624 564334832 inode_permission 20480021 samples [usecs] 0 24 4861557 6459227 sys 0m47.384s (36.07s in getattr + inode_permission) getattr 2048002 samples [usecs] 4 50 31185467 563959387 inode_permission 20480021 samples [usecs] 0 25 4879991 6515725

Oleg Drokin added a comment - 05/Dec/22 11:36 PM

seconds of user time is strange, is ls sorting things in RAM? Can you do ls -U and see what sort of a difference it makes. Also when everythign is cached there's still huge cpu time use in Lustre case compared to ext4, which given there are no RPC means it's somewhere inside lustre somewhere? And it doe not change much between cached/uncached case even, in fact cached takes more system time (13>16)

ext4 on the other hand is dropping system time 6>3 seconds (user time remains the same)

Oleg Drokin added a comment - 05/Dec/22 11:36 PM seconds of user time is strange, is ls sorting things in RAM? Can you do ls -U and see what sort of a difference it makes. Also when everythign is cached there's still huge cpu time use in Lustre case compared to ext4, which given there are no RPC means it's somewhere inside lustre somewhere? And it doe not change much between cached/uncached case even, in fact cached takes more system time (13>16) ext4 on the other hand is dropping system time 6>3 seconds (user time remains the same)

Shuichi Ihara added a comment - 03/Dec/22 11:42 PM

same test, but 2M x 47001 files in the single directory. it can clarify the problem.

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	1m0.494s
user	0m19.378s
sys	0m28.927s

llite.exafs-ffff9b8b77585000.stats=
snapshot_time             545315.042991581 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              545315.042991581 secs.nsecs
open                      1 samples [usecs] 206 206 206 42436
close                     1 samples [usecs] 141 141 141 19881
readdir                   2502 samples [usecs] 0 40943 518532 2304508620
getattr                   2048002 samples [usecs] 2 22144 12111659 9279727979
getxattr                  1 samples [usecs] 90 90 90 8100
inode_permission          20480021 samples [usecs] 0 206 5211214 8280488
opencount                 1 samples [reqs] 1 1 1 1

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	1m6.651s
user	0m19.172s
sys	0m47.345s

llite.exafs-ffff9b8b77585000.stats=
snapshot_time             545428.911309921 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              545428.911309921 secs.nsecs
open                      1 samples [usecs] 8493 8493 8493 72131049
close                     1 samples [usecs] 181 181 181 32761
readdir                   2502 samples [usecs] 0 164 377300 56936396
getattr                   2048002 samples [usecs] 4 52 31198624 564334832
getxattr                  1 samples [usecs] 2 2 2 4
getxattr_hits             1 samples [reqs]
inode_permission          20480021 samples [usecs] 0 24 4861557 6459227
opencount                 1 samples [reqs] 2 2 2 4
openclosetime             1 samples [usecs] 55693604 55693604 55693604 3101777526508816


[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	1m6.656s
user	0m19.146s
sys	0m47.384s

llite.exafs-ffff9b8b77585000.stats=
snapshot_time             545596.209353102 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              545596.209353102 secs.nsecs
open                      1 samples [usecs] 233 233 233 54289
close                     1 samples [usecs] 188 188 188 35344
readdir                   2502 samples [usecs] 0 160 377048 56861048
getattr                   2048002 samples [usecs] 4 50 31185467 563959387
getxattr                  1 samples [usecs] 2 2 2 4
getxattr_hits             1 samples [reqs]
inode_permission          20480021 samples [usecs] 0 25 4879991 6515725
opencount                 1 samples [reqs] 4 4 4 16
openclosetime             1 samples [usecs] 24859044 24859044 24859044 617972068593936

cached case (2nd or 3rd 'ls -l') were slower than non-cached case (1st 'ls -l') and client spent total 31sec for getattr() on client against 12sec in non-cached case.

Shuichi Ihara added a comment - 03/Dec/22 11:42 PM same test, but 2M x 47001 files in the single directory. it can clarify the problem. [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m0.494s user 0m19.378s sys 0m28.927s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545315.042991581 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545315.042991581 secs.nsecs open 1 samples [usecs] 206 206 206 42436 close 1 samples [usecs] 141 141 141 19881 readdir 2502 samples [usecs] 0 40943 518532 2304508620 getattr 2048002 samples [usecs] 2 22144 12111659 9279727979 getxattr 1 samples [usecs] 90 90 90 8100 inode_permission 20480021 samples [usecs] 0 206 5211214 8280488 opencount 1 samples [reqs] 1 1 1 1 [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m6.651s user 0m19.172s sys 0m47.345s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545428.911309921 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545428.911309921 secs.nsecs open 1 samples [usecs] 8493 8493 8493 72131049 close 1 samples [usecs] 181 181 181 32761 readdir 2502 samples [usecs] 0 164 377300 56936396 getattr 2048002 samples [usecs] 4 52 31198624 564334832 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 20480021 samples [usecs] 0 24 4861557 6459227 opencount 1 samples [reqs] 2 2 2 4 openclosetime 1 samples [usecs] 55693604 55693604 55693604 3101777526508816 [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m6.656s user 0m19.146s sys 0m47.384s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545596.209353102 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545596.209353102 secs.nsecs open 1 samples [usecs] 233 233 233 54289 close 1 samples [usecs] 188 188 188 35344 readdir 2502 samples [usecs] 0 160 377048 56861048 getattr 2048002 samples [usecs] 4 50 31185467 563959387 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 20480021 samples [usecs] 0 25 4879991 6515725 opencount 1 samples [reqs] 4 4 4 16 openclosetime 1 samples [usecs] 24859044 24859044 24859044 617972068593936 cached case (2nd or 3rd 'ls -l') were slower than non-cached case (1st 'ls -l') and client spent total 31sec for getattr() on client against 12sec in non-cached case.

Shuichi Ihara added a comment - 03/Dec/22 1:32 PM

In order to simplify cases, the number of files is reduced. 100K x 47001 byte files.

[root@ec01 ~]# lctl set_param llite.*.stats=clear
[root@ec01 ~]#  clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
[sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 67.27    1.180271          11    100001           statx
 31.71    0.556354           5    100002    100002 getxattr
  0.95    0.016659         134       124           getdents64

[root@ec01 ~]# lctl get_param llite.*.stats
llite.exafs-ffff9b96b8157800.stats=
snapshot_time             508528.306272142 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              508528.306272142 secs.nsecs
open                      1 samples [usecs] 163 163 163 26569
close                     1 samples [usecs] 107 107 107 11449
readdir                   124 samples [usecs] 0 14912 31175 224577441
getattr                   100002 samples [usecs] 2 565 432562 2313162
getxattr                  1 samples [usecs] 88 88 88 7744
inode_permission          1000021 samples [usecs] 0 314 195050 299488
opencount                 1 samples [reqs] 1 1 1 1

[root@ec01 ~]# lctl set_param llite.*.stats=clear
[sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 65.73    1.142941          11    100001           statx
 33.22    0.577662           5    100002    100002 getxattr
  0.93    0.016237         130       124           getdents64

[root@ec01 ~]# lctl get_param llite.*.stats
llite.exafs-ffff9b96b8157800.stats=
snapshot_time             508578.202659913 secs.nsecs
start_time                0.000000000 secs.nsecs
elapsed_time              508578.202659913 secs.nsecs
open                      1 samples [usecs] 177 177 177 31329
close                     1 samples [usecs] 124 124 124 15376
readdir                   124 samples [usecs] 0 142 16484 2223244
getattr                   100002 samples [usecs] 4 21 530944 2862706
getxattr                  1 samples [usecs] 2 2 2 4
getxattr_hits             1 samples [reqs]
inode_permission          1000021 samples [usecs] 0 13 191123 193591
opencount                 1 samples [reqs] 2 2 2 4
openclosetime             1 samples [usecs] 44568082 44568082 44568082 1986313933158724

In 1st 'ls -l', total time cost of 100K x statx() is 1.18sec and spent 0.43sec for getattr() and 0.19sec for inode_permission() in Lustre.
2nd 'ls -l', 100K x getattr() in Lustre is longer (0.53sec) than 1st 'ls -l'. I did same test several times, but 2nd case is always slower.
In any cases, regardless cached/non-cached, 1/2 of total time is spent in the Lustre.

Shuichi Ihara added a comment - 03/Dec/22 1:32 PM In order to simplify cases, the number of files is reduced. 100K x 47001 byte files. [root@ec01 ~]# lctl set_param llite.*.stats=clear [root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches" [sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 67.27 1.180271 11 100001 statx 31.71 0.556354 5 100002 100002 getxattr 0.95 0.016659 134 124 getdents64 [root@ec01 ~]# lctl get_param llite.*.stats llite.exafs-ffff9b96b8157800.stats= snapshot_time 508528.306272142 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 508528.306272142 secs.nsecs open 1 samples [usecs] 163 163 163 26569 close 1 samples [usecs] 107 107 107 11449 readdir 124 samples [usecs] 0 14912 31175 224577441 getattr 100002 samples [usecs] 2 565 432562 2313162 getxattr 1 samples [usecs] 88 88 88 7744 inode_permission 1000021 samples [usecs] 0 314 195050 299488 opencount 1 samples [reqs] 1 1 1 1 [root@ec01 ~]# lctl set_param llite.*.stats=clear [sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 65.73 1.142941 11 100001 statx 33.22 0.577662 5 100002 100002 getxattr 0.93 0.016237 130 124 getdents64 [root@ec01 ~]# lctl get_param llite.*.stats llite.exafs-ffff9b96b8157800.stats= snapshot_time 508578.202659913 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 508578.202659913 secs.nsecs open 1 samples [usecs] 177 177 177 31329 close 1 samples [usecs] 124 124 124 15376 readdir 124 samples [usecs] 0 142 16484 2223244 getattr 100002 samples [usecs] 4 21 530944 2862706 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 1000021 samples [usecs] 0 13 191123 193591 opencount 1 samples [reqs] 2 2 2 4 openclosetime 1 samples [usecs] 44568082 44568082 44568082 1986313933158724 In 1st 'ls -l', total time cost of 100K x statx() is 1.18sec and spent 0.43sec for getattr() and 0.19sec for inode_permission() in Lustre. 2nd 'ls -l', 100K x getattr() in Lustre is longer (0.53sec) than 1st 'ls -l'. I did same test several times, but 2nd case is always slower. In any cases, regardless cached/non-cached, 1/2 of total time is spent in the Lustre.

People

Assignee:: WC Triage

Reporter:: Shuichi Ihara

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 03/Dec/22 10:45 AM

Updated:: 24/Apr/25 4:39 PM