[LU-16365] cached 'ls -l' is slow - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While testing ~~LU-14139~~, there is an observed performance behavior.
Here is test workload

# echo 3 > /proc/sys/vm/drop_caches
# time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ 
# time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/

In theory, when 1st 'ls -l' finishes, client keeps data, metadata and locks in the cache, then second 'ls -l' output should come from it.
It would expect 2nd 'ls -l'could be significant faster than 1st 'ls -l', but it's not very much.

Here is 'ls -l' results for 1M files in single directory.

[root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m27.385s
user	0m8.994s
sys	0m13.131s

[sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m25.309s
user	0m8.937s
sys	0m16.327s

There are no RPCs to go out in 2nd 'ls -l' below. I only saw only 16 x LNET messages on 2nd 'ls -l' against 1.1M LNET messages on 1st 'ls -l', but still almost same elapsed time. most of time costs is 'ls' itself and Lustre client side.

[root@ec01 ~]# clush -w ai400x2-1-vm[1-4],ec01 " echo 3 > /proc/sys/vm/drop_caches "
[root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 65363661
              recv_count: 62095891
              drop_count: 1

real	0m26.145s
user	0m9.070s
sys	0m13.552s
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482277
              recv_count: 63233245
              drop_count: 1
[root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482277
              recv_count: 63233245
              drop_count: 1

real	0m25.569s
user	0m8.987s
sys	0m16.537s
              send_count: 0
              recv_count: 0
              drop_count: 0
              send_count: 66482293
              recv_count: 63233261
              drop_count: 1

This is same test for 1M files in ext4 of local disk and /dev/shm on client.

[root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
[sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null

real	0m16.999s
user	0m8.956s
sys	0m5.855s
[sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null

real	0m11.832s
user	0m8.765s
sys	0m3.051s

[root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
[sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m8.296s
user	0m5.465s
sys	0m2.813s
[sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null

real	0m8.273s
user	0m5.414s
sys	0m2.847s

Lustre can be similar performance of ext4 and memcache if everything in the cache, can't it?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ls.svg
141 kB
06/Dec/22 12:23 AM

Issue Links

is related to

LU-17329 Relaxed POSIX Consistency for Lustre

Open

is related to

LU-14139 batched statahead processing

Resolved

LU-8130 Migrate from libcfs hash to rhashtable

Open

cached 'ls -l' is slow

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates