Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2600

lustre metadata performance is very slow on zfs

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.0
    • 3
    • 6060

    Description

      mds-survey show create and unlink are very slow

      [root@mds01 mds01]# tests_str="create lookup destroy" thrlo=192 thrhi=192 file_count=3840000 mds-survey
      Wed Jan 2 16:43:00 PST 2013 /usr/bin/mds-survey from mds01
      mdt 1 file 3840000 dir 192 thr 192 create 2220.02 [ 0.00,19997.98] lookup 9429.79 [ 0.00,41998.40] destroy 1545.46 [ 0.00,15998.32]
      done!

      Attachments

        1. oprofile.tar.gz
          1.76 MB
        2. oprofile2.tar.gz
          1.01 MB

        Issue Links

          Activity

            [LU-2600] lustre metadata performance is very slow on zfs

            ZAP prefetching at object creation should improve metadata performance.

            bzzz Alex Zhuravlev added a comment - ZAP prefetching at object creation should improve metadata performance.

            Just a note that if a patch that uses dsl_sync_task is landed again, we'd need to patch zfs so as not to increase async writes when there's only nowaiter sync tasks pending. See:
            https://github.com/zfsonlinux/zfs/pull/2716#issuecomment-58540555

            isaac Isaac Huang (Inactive) added a comment - Just a note that if a patch that uses dsl_sync_task is landed again, we'd need to patch zfs so as not to increase async writes when there's only nowaiter sync tasks pending. See: https://github.com/zfsonlinux/zfs/pull/2716#issuecomment-58540555

            http://review.whamcloud.com/#/c/7157/ was reverted to fix LU-4968 if this is resubmitted please include the changs that happened for LU-4944 (http://review.whamcloud.com/#/c/10064/)

            utopiabound Nathaniel Clark added a comment - http://review.whamcloud.com/#/c/7157/ was reverted to fix LU-4968 if this is resubmitted please include the changs that happened for LU-4944 ( http://review.whamcloud.com/#/c/10064/ )

            I did some benchmarking with Lustre-ZFS vs. Lustre-LDISKFS and ZFS vs. EXT4 with mdtest. The results suggests that the slow metadata performance is probably due to ZFS rather than to Lustre. The following setup is used:

            1 MGS/MDT server, formatted with ldiskfs(ext4) or ZFS [build: 2.4.0-RC2-gd3f91c4-PRISTINE-2.6.32-358.6.2.el6_lustre.g230b174.x86_64]
            1 OSS/OST server, formatted with ZFS [build: v2_4_92_0-ge089a51-CHANGED-3.6.11-lustre-tstibor-build]
            1 Client [build: v2_4_92_0-ge089a51-CHANGED-3.6.11-lustre-tstibor-build]
            (lustre mountpoint /mnt)

            The benchmark is performed on the client
            and gives the following results:

            ** Setup, single MDT0 with ZFS, OSS/OST with ZFS and mdtest executed on the client
            -- started at 10/07/2013 16:43:48 --
            
            mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s)
            Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /mnt/mdtest/
            Path: /mnt/mdtest
            FS: 98.7 TiB   Used FS: 0.0%   Inodes: 0.5 Mi   Used Inodes: 0.0%
            
            1 tasks, 5040 files/directories
            
            SUMMARY: (of 20 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:       1948.194       1717.011       1814.171         58.454
               Directory stat    :       8550.010       7276.497       8112.847        415.032
               Directory removal :       2045.658       1892.629       1963.691         46.917
               File creation     :       1188.975       1118.650       1152.378         18.880
               File stat         :       3398.468       3222.576       3328.069         53.387
               File read         :       8630.149       8034.409       8421.248        151.027
               File removal      :       1393.756       1296.246       1340.168         28.650
               Tree creation     :       1853.699        713.171       1713.243        234.610
               Tree removal      :       1811.968       1600.404       1734.573         42.491
            
            -- finished at 10/07/2013 16:49:14 --
            
            ** Setup, single MDT0 with ldiskfs (ext4), OSS/OST with ZFS and mdtest executed on the client
            -- started at 10/07/2013 15:17:41 --
            mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s)
            Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /mnt/mdtest/
            Path: /mnt/mdtest
            FS: 98.7 TiB   Used FS: 0.0%   Inodes: 32.0 Mi   Used Inodes: 0.0%
            
            1 tasks, 5040 files/directories
            
            SUMMARY: (of 20 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:       3797.437       3241.010       3581.207        179.154
               Directory stat    :       8885.475       8488.148       8680.477         89.058
               Directory removal :       3815.363       3292.796       3638.044        159.870
               File creation     :       2451.821       2284.533       2364.546         49.688
               File stat         :       3532.868       3284.716       3426.642         68.167
               File read         :       8745.646       7888.261       8479.615        199.443
               File removal      :       2659.047       2475.945       2573.788         64.199
               Tree creation     :       3522.699        797.295       3290.452        578.813
               Tree removal      :       3246.246       2869.909       3151.856         75.039
            
            -- finished at 10/07/2013 15:20:52 --
            

            Roughly speaking ldiskfs is nearly twice as fast as ZFS on the
            artificial metadata tests except on stat calls and read.

            Repeating the experiment, however, this time on plain formated ext4 and ZFS filesystems (no Lustre involved).
            The underlying hardware is the original MGS/MDT Server results in:

            *** EXT4
            -- started at 10/08/2013 10:26:55 --
            
            mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s)
            Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /ext4/mdtest
            Path: /ext4
            FS: 63.0 GiB   Used FS: 0.3%   Inodes: 4.0 Mi   Used Inodes: 0.0%
            
            1 tasks, 5040 files/directories
            
            SUMMARY: (of 20 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      40562.779      30483.751      35626.407       3019.069
               Directory stat    :     146904.697     144106.646     145177.353        735.623
               Directory removal :      45658.402      18579.207      42666.602       7721.446
               File creation     :      55150.631      54306.775      54710.376        272.139
               File stat         :     145148.567     142614.316     143752.697        712.729
               File read         :     118738.722     115982.356     117299.713        677.185
               File removal      :      74535.433      72932.338      73898.577        552.812
               Tree creation     :      45488.234      19224.529      30160.072       8360.361
               Tree removal      :      21829.091      21270.317      21597.907        166.265
            
            -- finished at 10/08/2013 10:27:06 --
            
            *** ZFS
            -- started at 10/08/2013 10:24:13 --
            
            mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s)
            Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /zfs/mdtest
            Path: /zfs
            FS: 63.0 GiB   Used FS: 0.0%   Inodes: 126.0 Mi   Used Inodes: 0.0%
            
            1 tasks, 5040 files/directories
            
            SUMMARY: (of 20 iterations)
               Operation                      Max            Min           Mean        Std Dev
               ---------                      ---            ---           ----        -------
               Directory creation:      17430.759       3494.324      13857.069       3667.221
               Directory stat    :     126509.106     124125.352     125720.502        641.879
               Directory removal :      17380.099       1341.726      16070.861       3468.179
               File creation     :      19416.201       1946.750      14450.802       4466.843
               File stat         :     126687.275     124279.327     125842.726        602.232
               File read         :     109161.802     106555.834     107863.681        674.730
               File removal      :      18087.791       1073.455      15315.115       5133.140
               Tree creation     :      19085.674       3313.867      17736.690       3428.476
               Tree removal      :      11679.683       1222.614      10843.046       2247.838
            
            -- finished at 10/08/2013 10:24:58 --
            

            Of course one can question how well such a metadata benchmark are reflecting true working sets, however, by just observing the plain ZFS vs. ext4 one could conclude that the slow metadata performance is NOT due to Lustre.

            Thomas.

            thomas.stibor Thomas Stibor added a comment - I did some benchmarking with Lustre-ZFS vs. Lustre-LDISKFS and ZFS vs. EXT4 with mdtest. The results suggests that the slow metadata performance is probably due to ZFS rather than to Lustre. The following setup is used: 1 MGS/MDT server, formatted with ldiskfs(ext4) or ZFS [build: 2.4.0-RC2-gd3f91c4-PRISTINE-2.6.32-358.6.2.el6_lustre.g230b174.x86_64] 1 OSS/OST server, formatted with ZFS [build: v2_4_92_0-ge089a51-CHANGED-3.6.11-lustre-tstibor-build] 1 Client [build: v2_4_92_0-ge089a51-CHANGED-3.6.11-lustre-tstibor-build] (lustre mountpoint /mnt) The benchmark is performed on the client and gives the following results: ** Setup, single MDT0 with ZFS, OSS/OST with ZFS and mdtest executed on the client -- started at 10/07/2013 16:43:48 -- mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s) Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /mnt/mdtest/ Path: /mnt/mdtest FS: 98.7 TiB Used FS: 0.0% Inodes: 0.5 Mi Used Inodes: 0.0% 1 tasks, 5040 files/directories SUMMARY: (of 20 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 1948.194 1717.011 1814.171 58.454 Directory stat : 8550.010 7276.497 8112.847 415.032 Directory removal : 2045.658 1892.629 1963.691 46.917 File creation : 1188.975 1118.650 1152.378 18.880 File stat : 3398.468 3222.576 3328.069 53.387 File read : 8630.149 8034.409 8421.248 151.027 File removal : 1393.756 1296.246 1340.168 28.650 Tree creation : 1853.699 713.171 1713.243 234.610 Tree removal : 1811.968 1600.404 1734.573 42.491 -- finished at 10/07/2013 16:49:14 -- ** Setup, single MDT0 with ldiskfs (ext4), OSS/OST with ZFS and mdtest executed on the client -- started at 10/07/2013 15:17:41 -- mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s) Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /mnt/mdtest/ Path: /mnt/mdtest FS: 98.7 TiB Used FS: 0.0% Inodes: 32.0 Mi Used Inodes: 0.0% 1 tasks, 5040 files/directories SUMMARY: (of 20 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 3797.437 3241.010 3581.207 179.154 Directory stat : 8885.475 8488.148 8680.477 89.058 Directory removal : 3815.363 3292.796 3638.044 159.870 File creation : 2451.821 2284.533 2364.546 49.688 File stat : 3532.868 3284.716 3426.642 68.167 File read : 8745.646 7888.261 8479.615 199.443 File removal : 2659.047 2475.945 2573.788 64.199 Tree creation : 3522.699 797.295 3290.452 578.813 Tree removal : 3246.246 2869.909 3151.856 75.039 -- finished at 10/07/2013 15:20:52 -- Roughly speaking ldiskfs is nearly twice as fast as ZFS on the artificial metadata tests except on stat calls and read. Repeating the experiment, however, this time on plain formated ext4 and ZFS filesystems (no Lustre involved). The underlying hardware is the original MGS/MDT Server results in: *** EXT4 -- started at 10/08/2013 10:26:55 -- mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s) Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /ext4/mdtest Path: /ext4 FS: 63.0 GiB Used FS: 0.3% Inodes: 4.0 Mi Used Inodes: 0.0% 1 tasks, 5040 files/directories SUMMARY: (of 20 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 40562.779 30483.751 35626.407 3019.069 Directory stat : 146904.697 144106.646 145177.353 735.623 Directory removal : 45658.402 18579.207 42666.602 7721.446 File creation : 55150.631 54306.775 54710.376 272.139 File stat : 145148.567 142614.316 143752.697 712.729 File read : 118738.722 115982.356 117299.713 677.185 File removal : 74535.433 72932.338 73898.577 552.812 Tree creation : 45488.234 19224.529 30160.072 8360.361 Tree removal : 21829.091 21270.317 21597.907 166.265 -- finished at 10/08/2013 10:27:06 -- *** ZFS -- started at 10/08/2013 10:24:13 -- mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s) Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /zfs/mdtest Path: /zfs FS: 63.0 GiB Used FS: 0.0% Inodes: 126.0 Mi Used Inodes: 0.0% 1 tasks, 5040 files/directories SUMMARY: (of 20 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 17430.759 3494.324 13857.069 3667.221 Directory stat : 126509.106 124125.352 125720.502 641.879 Directory removal : 17380.099 1341.726 16070.861 3468.179 File creation : 19416.201 1946.750 14450.802 4466.843 File stat : 126687.275 124279.327 125842.726 602.232 File read : 109161.802 106555.834 107863.681 674.730 File removal : 18087.791 1073.455 15315.115 5133.140 Tree creation : 19085.674 3313.867 17736.690 3428.476 Tree removal : 11679.683 1222.614 10843.046 2247.838 -- finished at 10/08/2013 10:24:58 -- Of course one can question how well such a metadata benchmark are reflecting true working sets, however, by just observing the plain ZFS vs. ext4 one could conclude that the slow metadata performance is NOT due to Lustre. Thomas.

            Some improvements have been made to ZFS performance, but this is still an ongoing issue so move this to 2.5.1 along with LU-2476.

            adilger Andreas Dilger added a comment - Some improvements have been made to ZFS performance, but this is still an ongoing issue so move this to 2.5.1 along with LU-2476 .

            Actually, my previous comment is incorrect. That patch may help with some real-world workloads like untar, but would not help mds-survey or similar that are not doing chown/chmod.

            In LU-2476 Alex posted a link to http://review.whamcloud.com/7157 "a proto for optimized object accounting" which I think is actually more relevant to this bug. It batches the quota accounting updates, which was part of the change in Minh's first test that doubled the ZFS performance. However it wasn't clear if it was the quota zap or the LMA/LIV xattrs that were the main bottleneck, so it would be good to test those separately.

            adilger Andreas Dilger added a comment - Actually, my previous comment is incorrect. That patch may help with some real-world workloads like untar, but would not help mds-survey or similar that are not doing chown/chmod. In LU-2476 Alex posted a link to http://review.whamcloud.com/7157 "a proto for optimized object accounting" which I think is actually more relevant to this bug. It batches the quota accounting updates, which was part of the change in Minh's first test that doubled the ZFS performance. However it wasn't clear if it was the quota zap or the LMA/LIV xattrs that were the main bottleneck, so it would be good to test those separately.

            The patch from LU-3671 (http://review.whamcloud.com/7257 "mdd: sync perm for dir and perm reduction only") may help this a little bit, but there are still other issues that need to be worked on.

            adilger Andreas Dilger added a comment - The patch from LU-3671 ( http://review.whamcloud.com/7257 "mdd: sync perm for dir and perm reduction only") may help this a little bit, but there are still other issues that need to be worked on.

            Yea lets test this the next session.

            I will help setup some basic iostat so we can get a little better picture of the data rates to the disk themselves.

            keith Keith Mannthey (Inactive) added a comment - Yea lets test this the next session. I will help setup some basic iostat so we can get a little better picture of the data rates to the disk themselves.

            We may be able to do this in the next test session. Kieth, there are no brw_stats available under ZFS

            cliffw Cliff White (Inactive) added a comment - We may be able to do this in the next test session. Kieth, there are no brw_stats available under ZFS

            I remember Brian B. said it's doing OK locally. would you mind to try few createmany in parallel with locally mounted ZFS, please? so we have some basic numbers for pure ZFS? Lustre is doing much more than that (OI, few EAs, etc), but still the numbers would give us some idea.

            bzzz Alex Zhuravlev added a comment - I remember Brian B. said it's doing OK locally. would you mind to try few createmany in parallel with locally mounted ZFS, please? so we have some basic numbers for pure ZFS? Lustre is doing much more than that (OI, few EAs, etc), but still the numbers would give us some idea.

            People

              bzzz Alex Zhuravlev
              mdiep Minh Diep
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: