Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
While testing LU-14139, there is an observed performance behavior.
Here is test workload
# echo 3 > /proc/sys/vm/drop_caches # time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ # time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/
In theory, when 1st 'ls -l' finishes, client keeps data, metadata and locks in the cache, then second 'ls -l' output should come from it.
It would expect 2nd 'ls -l'could be significant faster than 1st 'ls -l', but it's not very much.
Here is 'ls -l' results for 1M files in single directory.
[root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches" [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m27.385s user 0m8.994s sys 0m13.131s [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m25.309s user 0m8.937s sys 0m16.327s
There are no RPCs to go out in 2nd 'ls -l' below. I only saw only 16 x LNET messages on 2nd 'ls -l' against 1.1M LNET messages on 1st 'ls -l', but still almost same elapsed time. most of time costs is 'ls' itself and Lustre client side.
[root@ec01 ~]# clush -w ai400x2-1-vm[1-4],ec01 " echo 3 > /proc/sys/vm/drop_caches " [root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count send_count: 0 recv_count: 0 drop_count: 0 send_count: 65363661 recv_count: 62095891 drop_count: 1 real 0m26.145s user 0m9.070s sys 0m13.552s send_count: 0 recv_count: 0 drop_count: 0 send_count: 66482277 recv_count: 63233245 drop_count: 1 [root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count send_count: 0 recv_count: 0 drop_count: 0 send_count: 66482277 recv_count: 63233245 drop_count: 1 real 0m25.569s user 0m8.987s sys 0m16.537s send_count: 0 recv_count: 0 drop_count: 0 send_count: 66482293 recv_count: 63233261 drop_count: 1
This is same test for 1M files in ext4 of local disk and /dev/shm on client.
[root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches [sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m16.999s user 0m8.956s sys 0m5.855s [sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m11.832s user 0m8.765s sys 0m3.051s [root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches [sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m8.296s user 0m5.465s sys 0m2.813s [sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null real 0m8.273s user 0m5.414s sys 0m2.847s
Lustre can be similar performance of ext4 and memcache if everything in the cache, can't it?