Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      While testing LU-14139, there is an observed performance behavior.
      Here is test workload

      # echo 3 > /proc/sys/vm/drop_caches
      # time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ 
      # time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ 
      

      In theory, when 1st 'ls -l' finishes, client keeps data, metadata and locks in the cache, then second 'ls -l' output should come from it.
      It would expect 2nd 'ls -l'could be significant faster than 1st 'ls -l', but it's not very much.

      Here is 'ls -l' results for 1M files in single directory.

      [root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
      [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
      
      real	0m27.385s
      user	0m8.994s
      sys	0m13.131s
      
      [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
      
      real	0m25.309s
      user	0m8.937s
      sys	0m16.327s
      

      There are no RPCs to go out in 2nd 'ls -l' below. I only saw only 16 x LNET messages on 2nd 'ls -l' against 1.1M LNET messages on 1st 'ls -l', but still almost same elapsed time. most of time costs is 'ls' itself and Lustre client side.

      [root@ec01 ~]# clush -w ai400x2-1-vm[1-4],ec01 " echo 3 > /proc/sys/vm/drop_caches "
      [root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
                    send_count: 0
                    recv_count: 0
                    drop_count: 0
                    send_count: 65363661
                    recv_count: 62095891
                    drop_count: 1
      
      real	0m26.145s
      user	0m9.070s
      sys	0m13.552s
                    send_count: 0
                    recv_count: 0
                    drop_count: 0
                    send_count: 66482277
                    recv_count: 63233245
                    drop_count: 1
      [root@ec01 ~]# lnetctl net show -v| grep _count; time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null; lnetctl net show -v | grep _count
                    send_count: 0
                    recv_count: 0
                    drop_count: 0
                    send_count: 66482277
                    recv_count: 63233245
                    drop_count: 1
      
      real	0m25.569s
      user	0m8.987s
      sys	0m16.537s
                    send_count: 0
                    recv_count: 0
                    drop_count: 0
                    send_count: 66482293
                    recv_count: 63233261
                    drop_count: 1
      

      This is same test for 1M files in ext4 of local disk and /dev/shm on client.

      [root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
      [sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null
      
      real	0m16.999s
      user	0m8.956s
      sys	0m5.855s
      [sihara@ec01 ~]$ time ls -l /tmp/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/  > /dev/null
      
      real	0m11.832s
      user	0m8.765s
      sys	0m3.051s
      
      [root@ec01 ~]# echo 3 > /proc/sys/vm/drop_caches
      [sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null
      
      real	0m8.296s
      user	0m5.465s
      sys	0m2.813s
      [sihara@ec01 ~]$ time ls -l /dev/shm/testdir/test-dir.0-0/mdtest_tree.0/ > /dev/null
      
      real	0m8.273s
      user	0m5.414s
      sys	0m2.847s
      

      Lustre can be similar performance of ext4 and memcache if everything in the cache, can't it?

      Attachments

        Issue Links

          Activity

            [LU-16365] cached 'ls -l' is slow
            simmonsja James A Simmons added a comment -

            Patch  https://review.whamcloud.com/45882  is ready for testing. Lets see if we can resolve these performance issues.

            simmonsja James A Simmons added a comment - Patch   https://review.whamcloud.com/45882   is ready for testing. Lets see if we can resolve these performance issues.

            Sorry I haven't worked out the details yet. Need to find the cycles.

            simmonsja James A Simmons added a comment - Sorry I haven't worked out the details yet. Need to find the cycles.

            According to Yingjin's comments in LU-14139 there is quite a bit of overhead in the LDLM hash code:

            From the attaching FG, lots of time costs are:
            {{__ldlm_handle2lock () ->class_handle2object() }}
            ldlm_resource_get() >cfs_hash_bd_lookup_intent()>ldlm_res_hop_keycmp()

            Each of them costs 5% for data and metadata DLM lock matching on the client side. All takes about 20% of the total sample...
            It seems that the hash for looking up the lock handle and resource take lots of time.
            And ldlm_res_hop_keycmp() reaches 5%, it means there are lots of elements in the same bucket of the hash table. We should increase the hash table and use the resizable hash table.

            For class_handle2object() for locks, we should maintain the lock handle per target (osc/mdc) lock namespace, not shared a single global hash table for handle on the client or even the server.

            There is WIP patch https://review.whamcloud.com/45882 "LU-8130 ldlm: convert ldlm_resource hash to rhashtable" that would be interesting to try, both because rhashtable is supposed to be faster than cfs_hash, but also because the code is being converted to rhashtable and it would be good to know if it makes performance worse for some reason. However, that patch is over a year old and it failed every test session, so it would need to be updated/fixed before it could be tested here.

            adilger Andreas Dilger added a comment - According to Yingjin's comments in LU-14139 there is quite a bit of overhead in the LDLM hash code: From the attaching FG, lots of time costs are: {{__ldlm_handle2lock () ->class_handle2object() }} ldlm_resource_get() >cfs_hash_bd_lookup_intent()>ldlm_res_hop_keycmp() Each of them costs 5% for data and metadata DLM lock matching on the client side. All takes about 20% of the total sample... It seems that the hash for looking up the lock handle and resource take lots of time. And ldlm_res_hop_keycmp() reaches 5%, it means there are lots of elements in the same bucket of the hash table. We should increase the hash table and use the resizable hash table. For class_handle2object() for locks, we should maintain the lock handle per target (osc/mdc) lock namespace, not shared a single global hash table for handle on the client or even the server. There is WIP patch https://review.whamcloud.com/45882 " LU-8130 ldlm: convert ldlm_resource hash to rhashtable " that would be interesting to try, both because rhashtable is supposed to be faster than cfs_hash, but also because the code is being converted to rhashtable and it would be good to know if it makes performance worse for some reason. However, that patch is over a year old and it failed every test session, so it would need to be updated/fixed before it could be tested here.

            The 19s of user time is caused by "ls -l" sorting the files before printing them, and that remains constant between calls (cached/uncached) and between filesystems (lustre/ext4), so we can't do anything about that. It looks like almost all of the system time is used by the getattr and inode_permission calls:

            sys	0m28.927s  (17.32s in getattr + inode_permission)
            getattr                   2048002 samples [usecs] 2 22144 12111659 9279727979
            inode_permission          20480021 samples [usecs] 0 206 5211214 8280488
            
            sys	0m47.384s (36.06s in getattr+inode_permission)
            getattr                   2048002 samples [usecs] 4 52 31198624 564334832
            inode_permission          20480021 samples [usecs] 0 24 4861557 6459227
            
            sys	0m47.384s (36.07s in getattr + inode_permission)
            getattr                   2048002 samples [usecs] 4 50 31185467 563959387
            inode_permission          20480021 samples [usecs] 0 25 4879991 6515725
            
            adilger Andreas Dilger added a comment - The 19s of user time is caused by " ls -l " sorting the files before printing them, and that remains constant between calls (cached/uncached) and between filesystems (lustre/ext4), so we can't do anything about that. It looks like almost all of the system time is used by the getattr and inode_permission calls: sys 0m28.927s (17.32s in getattr + inode_permission) getattr 2048002 samples [usecs] 2 22144 12111659 9279727979 inode_permission 20480021 samples [usecs] 0 206 5211214 8280488 sys 0m47.384s (36.06s in getattr+inode_permission) getattr 2048002 samples [usecs] 4 52 31198624 564334832 inode_permission 20480021 samples [usecs] 0 24 4861557 6459227 sys 0m47.384s (36.07s in getattr + inode_permission) getattr 2048002 samples [usecs] 4 50 31185467 563959387 inode_permission 20480021 samples [usecs] 0 25 4879991 6515725
            green Oleg Drokin added a comment -

             seconds of user time is strange, is ls sorting things in RAM? Can you do ls -U and see what sort of a difference it makes. Also when everythign is cached there's still huge cpu time use in Lustre case compared to ext4, which given there are no RPC means it's somewhere inside lustre somewhere? And it doe not change much between cached/uncached case even, in fact cached takes more system time (13>16)

             

            ext4 on the other hand is dropping system time 6>3 seconds (user time remains the same)

            green Oleg Drokin added a comment -  seconds of user time is strange, is ls sorting things in RAM? Can you do ls -U and see what sort of a difference it makes. Also when everythign is cached there's still huge cpu time use in Lustre case compared to ext4, which given there are no RPC means it's somewhere inside lustre somewhere? And it doe not change much between cached/uncached case even, in fact cached takes more system time (13>16)   ext4 on the other hand is dropping system time 6>3 seconds (user time remains the same)

            same test, but 2M x 47001 files in the single directory. it can clarify the problem.

            [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
            
            real	1m0.494s
            user	0m19.378s
            sys	0m28.927s
            
            llite.exafs-ffff9b8b77585000.stats=
            snapshot_time             545315.042991581 secs.nsecs
            start_time                0.000000000 secs.nsecs
            elapsed_time              545315.042991581 secs.nsecs
            open                      1 samples [usecs] 206 206 206 42436
            close                     1 samples [usecs] 141 141 141 19881
            readdir                   2502 samples [usecs] 0 40943 518532 2304508620
            getattr                   2048002 samples [usecs] 2 22144 12111659 9279727979
            getxattr                  1 samples [usecs] 90 90 90 8100
            inode_permission          20480021 samples [usecs] 0 206 5211214 8280488
            opencount                 1 samples [reqs] 1 1 1 1
            
            [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
            
            real	1m6.651s
            user	0m19.172s
            sys	0m47.345s
            
            llite.exafs-ffff9b8b77585000.stats=
            snapshot_time             545428.911309921 secs.nsecs
            start_time                0.000000000 secs.nsecs
            elapsed_time              545428.911309921 secs.nsecs
            open                      1 samples [usecs] 8493 8493 8493 72131049
            close                     1 samples [usecs] 181 181 181 32761
            readdir                   2502 samples [usecs] 0 164 377300 56936396
            getattr                   2048002 samples [usecs] 4 52 31198624 564334832
            getxattr                  1 samples [usecs] 2 2 2 4
            getxattr_hits             1 samples [reqs]
            inode_permission          20480021 samples [usecs] 0 24 4861557 6459227
            opencount                 1 samples [reqs] 2 2 2 4
            openclosetime             1 samples [usecs] 55693604 55693604 55693604 3101777526508816
            
            
            [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
            
            real	1m6.656s
            user	0m19.146s
            sys	0m47.384s
            
            llite.exafs-ffff9b8b77585000.stats=
            snapshot_time             545596.209353102 secs.nsecs
            start_time                0.000000000 secs.nsecs
            elapsed_time              545596.209353102 secs.nsecs
            open                      1 samples [usecs] 233 233 233 54289
            close                     1 samples [usecs] 188 188 188 35344
            readdir                   2502 samples [usecs] 0 160 377048 56861048
            getattr                   2048002 samples [usecs] 4 50 31185467 563959387
            getxattr                  1 samples [usecs] 2 2 2 4
            getxattr_hits             1 samples [reqs]
            inode_permission          20480021 samples [usecs] 0 25 4879991 6515725
            opencount                 1 samples [reqs] 4 4 4 16
            openclosetime             1 samples [usecs] 24859044 24859044 24859044 617972068593936
            

            cached case (2nd or 3rd 'ls -l') were slower than non-cached case (1st 'ls -l') and client spent total 31sec for getattr() on client against 12sec in non-cached case.

            sihara Shuichi Ihara added a comment - same test, but 2M x 47001 files in the single directory. it can clarify the problem. [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m0.494s user 0m19.378s sys 0m28.927s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545315.042991581 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545315.042991581 secs.nsecs open 1 samples [usecs] 206 206 206 42436 close 1 samples [usecs] 141 141 141 19881 readdir 2502 samples [usecs] 0 40943 518532 2304508620 getattr 2048002 samples [usecs] 2 22144 12111659 9279727979 getxattr 1 samples [usecs] 90 90 90 8100 inode_permission 20480021 samples [usecs] 0 206 5211214 8280488 opencount 1 samples [reqs] 1 1 1 1 [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m6.651s user 0m19.172s sys 0m47.345s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545428.911309921 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545428.911309921 secs.nsecs open 1 samples [usecs] 8493 8493 8493 72131049 close 1 samples [usecs] 181 181 181 32761 readdir 2502 samples [usecs] 0 164 377300 56936396 getattr 2048002 samples [usecs] 4 52 31198624 564334832 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 20480021 samples [usecs] 0 24 4861557 6459227 opencount 1 samples [reqs] 2 2 2 4 openclosetime 1 samples [usecs] 55693604 55693604 55693604 3101777526508816 [sihara@ec01 ~]$ time ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null real 1m6.656s user 0m19.146s sys 0m47.384s llite.exafs-ffff9b8b77585000.stats= snapshot_time 545596.209353102 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 545596.209353102 secs.nsecs open 1 samples [usecs] 233 233 233 54289 close 1 samples [usecs] 188 188 188 35344 readdir 2502 samples [usecs] 0 160 377048 56861048 getattr 2048002 samples [usecs] 4 50 31185467 563959387 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 20480021 samples [usecs] 0 25 4879991 6515725 opencount 1 samples [reqs] 4 4 4 16 openclosetime 1 samples [usecs] 24859044 24859044 24859044 617972068593936 cached case (2nd or 3rd 'ls -l') were slower than non-cached case (1st 'ls -l') and client spent total 31sec for getattr() on client against 12sec in non-cached case.

            In order to simplify cases, the number of files is reduced. 100K x 47001 byte files.

            [root@ec01 ~]# lctl set_param llite.*.stats=clear
            [root@ec01 ~]#  clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches"
            [sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
            % time     seconds  usecs/call     calls    errors syscall
            ------ ----------- ----------- --------- --------- ----------------
             67.27    1.180271          11    100001           statx
             31.71    0.556354           5    100002    100002 getxattr
              0.95    0.016659         134       124           getdents64
            
            [root@ec01 ~]# lctl get_param llite.*.stats
            llite.exafs-ffff9b96b8157800.stats=
            snapshot_time             508528.306272142 secs.nsecs
            start_time                0.000000000 secs.nsecs
            elapsed_time              508528.306272142 secs.nsecs
            open                      1 samples [usecs] 163 163 163 26569
            close                     1 samples [usecs] 107 107 107 11449
            readdir                   124 samples [usecs] 0 14912 31175 224577441
            getattr                   100002 samples [usecs] 2 565 432562 2313162
            getxattr                  1 samples [usecs] 88 88 88 7744
            inode_permission          1000021 samples [usecs] 0 314 195050 299488
            opencount                 1 samples [reqs] 1 1 1 1
            
            [root@ec01 ~]# lctl set_param llite.*.stats=clear
            [sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null
            % time     seconds  usecs/call     calls    errors syscall
            ------ ----------- ----------- --------- --------- ----------------
             65.73    1.142941          11    100001           statx
             33.22    0.577662           5    100002    100002 getxattr
              0.93    0.016237         130       124           getdents64
            
            [root@ec01 ~]# lctl get_param llite.*.stats
            llite.exafs-ffff9b96b8157800.stats=
            snapshot_time             508578.202659913 secs.nsecs
            start_time                0.000000000 secs.nsecs
            elapsed_time              508578.202659913 secs.nsecs
            open                      1 samples [usecs] 177 177 177 31329
            close                     1 samples [usecs] 124 124 124 15376
            readdir                   124 samples [usecs] 0 142 16484 2223244
            getattr                   100002 samples [usecs] 4 21 530944 2862706
            getxattr                  1 samples [usecs] 2 2 2 4
            getxattr_hits             1 samples [reqs]
            inode_permission          1000021 samples [usecs] 0 13 191123 193591
            opencount                 1 samples [reqs] 2 2 2 4
            openclosetime             1 samples [usecs] 44568082 44568082 44568082 1986313933158724
            

            In 1st 'ls -l', total time cost of 100K x statx() is 1.18sec and spent 0.43sec for getattr() and 0.19sec for inode_permission() in Lustre.
            2nd 'ls -l', 100K x getattr() in Lustre is longer (0.53sec) than 1st 'ls -l'.  I did same test several times, but 2nd case is always slower.
            In any cases, regardless cached/non-cached, 1/2 of total time is spent in the Lustre.

            sihara Shuichi Ihara added a comment - In order to simplify cases, the number of files is reduced. 100K x 47001 byte files. [root@ec01 ~]# lctl set_param llite.*.stats=clear [root@ec01 ~]# clush -w ec01,ai400x2-1-vm[1-4] "echo 3 > /proc/sys/vm/drop_caches" [sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 67.27 1.180271 11 100001 statx 31.71 0.556354 5 100002 100002 getxattr 0.95 0.016659 134 124 getdents64 [root@ec01 ~]# lctl get_param llite.*.stats llite.exafs-ffff9b96b8157800.stats= snapshot_time 508528.306272142 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 508528.306272142 secs.nsecs open 1 samples [usecs] 163 163 163 26569 close 1 samples [usecs] 107 107 107 11449 readdir 124 samples [usecs] 0 14912 31175 224577441 getattr 100002 samples [usecs] 2 565 432562 2313162 getxattr 1 samples [usecs] 88 88 88 7744 inode_permission 1000021 samples [usecs] 0 314 195050 299488 opencount 1 samples [reqs] 1 1 1 1 [root@ec01 ~]# lctl set_param llite.*.stats=clear [sihara@ec01 ~]$ strace -c ls -l /exafs/testdir/mdtest.out/test-dir.0-0/mdtest_tree.0/ > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 65.73 1.142941 11 100001 statx 33.22 0.577662 5 100002 100002 getxattr 0.93 0.016237 130 124 getdents64 [root@ec01 ~]# lctl get_param llite.*.stats llite.exafs-ffff9b96b8157800.stats= snapshot_time 508578.202659913 secs.nsecs start_time 0.000000000 secs.nsecs elapsed_time 508578.202659913 secs.nsecs open 1 samples [usecs] 177 177 177 31329 close 1 samples [usecs] 124 124 124 15376 readdir 124 samples [usecs] 0 142 16484 2223244 getattr 100002 samples [usecs] 4 21 530944 2862706 getxattr 1 samples [usecs] 2 2 2 4 getxattr_hits 1 samples [reqs] inode_permission 1000021 samples [usecs] 0 13 191123 193591 opencount 1 samples [reqs] 2 2 2 4 openclosetime 1 samples [usecs] 44568082 44568082 44568082 1986313933158724 In 1st 'ls -l', total time cost of 100K x statx() is 1.18sec and spent 0.43sec for getattr() and 0.19sec for inode_permission() in Lustre. 2nd 'ls -l', 100K x getattr() in Lustre is longer (0.53sec) than 1st 'ls -l'.  I did same test several times, but 2nd case is always slower. In any cases, regardless cached/non-cached, 1/2 of total time is spent in the Lustre.

            People

              wc-triage WC Triage
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: