[LU-2600] lustre metadata performance is very slow on zfs Created: 10/Jan/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Minh Diep Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: JL, performance, prz, zfs
Environment:

mdt is a zpool with 3 sata drives

pool: pool2
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
pool2 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0


Attachments: File oprofile.tar.gz     File oprofile2.tar.gz    
Issue Links:
Blocker
is blocked by LU-5041 FID Prefetching Open
is blocked by LU-7235 ZFS: dmu_object_alloc() serializes ob... Closed
Duplicate
is duplicated by LU-2476 poor OST file creation rate performan... Closed
Related
is related to LU-4108 Failure on test suite performance-san... Resolved
is related to LU-4968 Test failure sanity test_132: umount ... Resolved
is related to LU-4696 Test timeout on sanity test_51ba: nli... Resolved
Severity: 3
Rank (Obsolete): 6060

 Description   

mds-survey show create and unlink are very slow

[root@mds01 mds01]# tests_str="create lookup destroy" thrlo=192 thrhi=192 file_count=3840000 mds-survey
Wed Jan 2 16:43:00 PST 2013 /usr/bin/mds-survey from mds01
mdt 1 file 3840000 dir 192 thr 192 create 2220.02 [ 0.00,19997.98] lookup 9429.79 [ 0.00,41998.40] destroy 1545.46 [ 0.00,15998.32]
done!



 Comments   
Comment by Alex Zhuravlev [ 10/Jan/13 ]

if possible it'd be interesting to try with:
1) quota accounting disabled (comment two zap_increment_int() in osd_object_create() in osd-zfs)
2) lma setting disabled (comment osd_init_lma() in tat osd_object_create())

Comment by Minh Diep [ 14/Jan/13 ]

1) quota accounting disabled (comment two zap_increment_int() in osd_object_create() in osd-zfs)

zfs

[root@mds01 mds01]# dir_count=192 thrlo=192 thrhi=192 file_count=3840000 mds-survey
Tue Jan 15 08:46:01 PST 2013 /usr/bin/mds-survey from mds01
mdt 1 file 3840000 dir 192 thr 192 create 3033.82 [ 0.00,20997.59] lookup 8937.11 [ 0.00,38997.93] md_getattr 2386.50 [ 0.00,19996.00] setxattr 2788.85 [ 0.00,16998.22] destroy 1572.63 [ 0.00,15998.16]
done!

ldiskfs

[root@mds01 mds01]# dir_count=192 thrlo=192 thrhi=192 file_count=3840000 mds-survey
Tue Jan 15 08:27:37 PST 2013 /usr/bin/mds-survey from mds01
mdt 1 file 3840000 dir 192 thr 192 create 12324.83 [ 0.00,191982.34] lookup 2082165.28 [2082165.28,2082165.28] md_getattr 849267.98 [807941.83,807941.83] setxattr 13708.98 [ 0.00,191982.53] destroy 15192.13 [ 0.00,191980.23]
done!

2) lma setting disabled (comment osd_init_lma() in tat osd_object_create()) + quota acct disable

zfs

[root@mds01 mds01]# dir_count=192 thrlo=192 thrhi=192 file_count=3840000 mds-survey
Mon Jan 14 22:17:39 PST 2013 /usr/bin/mds-survey from mds01
mdt 1 file 3840000 dir 192 thr 192 create 6278.19 [ 0.00,20997.61] lookup 60380.25 [ 0.00,128983.49] md_getattr 57846.24 [ 0.00,179987.40] setxattr
3235.05 [ 0.00,57991.13] destroy 2234.80 [ 0.00,14998.41]
done!

ldiskfs

[root@mds01 mds01]# dir_count=192 thrlo=192 thrhi=192 file_count=3840000 mds-survey Mon Jan 14 23:26:56 PST 2013 /usr/bin/mds-survey from mds01
mdt 1 file 3840000 dir 192 thr 192 create 13928.46 [ 0.00,191979.27] lookup 2028421.14 [2028421.14,2028421.14] md_getattr 829627.73 [809944.92,809944.92] setxattr 16770.19 [ 0.00,191985.22] destroy 14236.86 [ 0.00,183984.18]
done!

Comment by Alex Zhuravlev [ 17/Jan/13 ]

please, collect:

1) /proc/spl/kstat/zfs/dmu_tx - just before the run and right after
to see how often txgs are overflowed

2) /proc/spl/kstat/zfs/txgs-*-mdt1 - few times during the run
to see amount of reads/writes, lifetime of txgs

also, on my local setup (4GB RAM) I observed with 16+ threads txgs are overflowed all the time
and only 8 threads let txg to go well. but with 8 threads overall performance isn't that greast.
please try with different number of threads, collect (1) and (2)

Comment by Cliff White (Inactive) [ 21/May/13 ]

agb5 -MDS/MGS, iwc client, dit29 -OSS

Comment by Peter Jones [ 21/May/13 ]

Alex

Could you please pass comment on this?

Thanks

Peter

Comment by Cliff White (Inactive) [ 21/May/13 ]

Oprofile data from second mdtest run. MDS/MGS = agb5, OSS = agb14, client = iwc44

Comment by Cliff White (Inactive) [ 21/May/13 ]

We have done further testing with oprofile on Hyperion. We see ZFS performance about 1/2 to 1/3 of ldiskfs performance on the same hardware.
Sample ZFS run of mdtest 64 clients:
MDTEST RESULTS
0000: SUMMARY: (of 2 iterations)
0000: Operation Max Min Mean Std Dev
0000: --------- — — ---- -------
0000: Directory creation: 6763.114 6344.339 6553.727 209.388
0000: Directory stat : 90793.846 81346.095 86069.971 4723.875
0000: Directory removal : 8081.377 7495.103 7788.240 293.137
0000: File creation : 5954.588 4746.843 5350.716 603.872
0000: File stat : 106938.996 106643.378 106791.187 147.809
0000: File removal : 6058.458 5910.333 5984.395 74.062
0000: Tree creation : 4.561 1.916 3.239 1.323
0000: Tree removal : 4.861 4.851 4.856 0.005
ldiskfs:
0000: SUMMARY: (of 5 iterations)
0000: Operation Max Min Mean Std Dev
0000: --------- — — ---- -------
0000: Directory creation: 23037.696 13215.305 17736.740 4389.854
0000: Directory stat : 109887.449 108488.532 109053.155 461.796
0000: Directory removal : 29968.170 18835.706 26369.453 4035.133
0000: File creation : 25026.952 21532.197 23428.767 1151.335
0000: File stat : 109340.749 107623.272 108451.866 724.519
0000: File removal : 31242.455 19217.523 24616.635 4673.904
0000: Tree creation : 25.988 22.719 24.484 1.050
0000: Tree removal : 16.938 13.962 15.872 1.069
----------
oprofile of MDS, one client and one OSS attached.

Comment by Keith Mannthey (Inactive) [ 22/May/13 ]
vma      samples  %        image name               app name                 symbol name
ffffffff812d3cd0 89202     8.2903  vmlinux                  vmlinux                  intel_idle
ffffffff8127f7d0 37003     3.4390  vmlinux                  vmlinux                  format_decode
ffffffff812811b0 26557     2.4682  vmlinux                  vmlinux                  vsnprintf
ffffffff8127f3f0 24524     2.2792  vmlinux                  vmlinux                  number
0000000000040540 24463     2.2736  zfs.ko                   zfs.ko                   lzjb_decompress
ffffffff812834c0 20643     1.9185  vmlinux                  vmlinux                  memcpy
ffffffff812d93e0 19812     1.8413  vmlinux                  vmlinux                  port_inb
ffffffff8150f460 17173     1.5960  vmlinux                  vmlinux                  mutex_lock
ffffffff81059540 14331     1.3319  vmlinux                  vmlinux                  find_busiest_group
ffffffff81169160 14164     1.3164  vmlinux                  vmlinux                  kfree
ffffffff81415450 13765     1.2793  vmlinux                  vmlinux                  poll_idle
ffffffff8150f1a0 13696     1.2729  vmlinux                  vmlinux                  mutex_unlock
0000000000040610 13003     1.2085  zfs.ko                   zfs.ko                   lzjb_compress
ffffffff81283780 11341     1.0540  vmlinux                  vmlinux                  memset
0000000000003090 10486     0.9746  spl.ko                   spl.ko                   taskq_thread
ffffffff8150db90 9346      0.8686  vmlinux                  vmlinux                  schedule
ffffffff8127eb70 9255      0.8602  vmlinux                  vmlinux                  strrchr
ffffffff81052130 9172      0.8524  vmlinux                  vmlinux                  mutex_spin_on_owner
ffffffff8127ec90 8590      0.7983  vmlinux                  vmlinux                  strlen
ffffffff8109b960 8120      0.7547  vmlinux                  vmlinux                  __hrtimer_start_range_ns

Why are print functions so high?

Comment by Alex Zhuravlev [ 23/May/13 ]

in my local testing it's EAs which were pretty expensive. also, I'd expect quota accouting contributes to this as well.

Comment by Liang Zhen (Inactive) [ 23/May/13 ]

I would say printf is fine, CDEBUG should contribute most of them.
Actually I cannot get too much information from the oprofile output, everything looks reasonable, which means there could be heavy operations protected by mutex/semaphore.

Comment by Keith Mannthey (Inactive) [ 23/May/13 ]

Cliff I wonder do you have any iostat data or Lustre /proc stats?

Comment by Alex Zhuravlev [ 27/May/13 ]

I remember Brian B. said it's doing OK locally. would you mind to try few createmany in parallel with locally mounted ZFS, please? so we have some basic numbers for pure ZFS? Lustre is doing much more than that (OI, few EAs, etc), but still the numbers would give us some idea.

Comment by Cliff White (Inactive) [ 16/Jul/13 ]

We may be able to do this in the next test session. Kieth, there are no brw_stats available under ZFS

Comment by Keith Mannthey (Inactive) [ 16/Jul/13 ]

Yea lets test this the next session.

I will help setup some basic iostat so we can get a little better picture of the data rates to the disk themselves.

Comment by Andreas Dilger [ 03/Sep/13 ]

The patch from LU-3671 (http://review.whamcloud.com/7257 "mdd: sync perm for dir and perm reduction only") may help this a little bit, but there are still other issues that need to be worked on.

Comment by Andreas Dilger [ 20/Sep/13 ]

Actually, my previous comment is incorrect. That patch may help with some real-world workloads like untar, but would not help mds-survey or similar that are not doing chown/chmod.

In LU-2476 Alex posted a link to http://review.whamcloud.com/7157 "a proto for optimized object accounting" which I think is actually more relevant to this bug. It batches the quota accounting updates, which was part of the change in Minh's first test that doubled the ZFS performance. However it wasn't clear if it was the quota zap or the LMA/LIV xattrs that were the main bottleneck, so it would be good to test those separately.

Comment by Andreas Dilger [ 01/Oct/13 ]

Some improvements have been made to ZFS performance, but this is still an ongoing issue so move this to 2.5.1 along with LU-2476.

Comment by Thomas Stibor [ 08/Oct/13 ]

I did some benchmarking with Lustre-ZFS vs. Lustre-LDISKFS and ZFS vs. EXT4 with mdtest. The results suggests that the slow metadata performance is probably due to ZFS rather than to Lustre. The following setup is used:

1 MGS/MDT server, formatted with ldiskfs(ext4) or ZFS [build: 2.4.0-RC2-gd3f91c4-PRISTINE-2.6.32-358.6.2.el6_lustre.g230b174.x86_64]
1 OSS/OST server, formatted with ZFS [build: v2_4_92_0-ge089a51-CHANGED-3.6.11-lustre-tstibor-build]
1 Client [build: v2_4_92_0-ge089a51-CHANGED-3.6.11-lustre-tstibor-build]
(lustre mountpoint /mnt)

The benchmark is performed on the client
and gives the following results:

** Setup, single MDT0 with ZFS, OSS/OST with ZFS and mdtest executed on the client
-- started at 10/07/2013 16:43:48 --

mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s)
Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /mnt/mdtest/
Path: /mnt/mdtest
FS: 98.7 TiB   Used FS: 0.0%   Inodes: 0.5 Mi   Used Inodes: 0.0%

1 tasks, 5040 files/directories

SUMMARY: (of 20 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:       1948.194       1717.011       1814.171         58.454
   Directory stat    :       8550.010       7276.497       8112.847        415.032
   Directory removal :       2045.658       1892.629       1963.691         46.917
   File creation     :       1188.975       1118.650       1152.378         18.880
   File stat         :       3398.468       3222.576       3328.069         53.387
   File read         :       8630.149       8034.409       8421.248        151.027
   File removal      :       1393.756       1296.246       1340.168         28.650
   Tree creation     :       1853.699        713.171       1713.243        234.610
   Tree removal      :       1811.968       1600.404       1734.573         42.491

-- finished at 10/07/2013 16:49:14 --
** Setup, single MDT0 with ldiskfs (ext4), OSS/OST with ZFS and mdtest executed on the client
-- started at 10/07/2013 15:17:41 --
mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s)
Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /mnt/mdtest/
Path: /mnt/mdtest
FS: 98.7 TiB   Used FS: 0.0%   Inodes: 32.0 Mi   Used Inodes: 0.0%

1 tasks, 5040 files/directories

SUMMARY: (of 20 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:       3797.437       3241.010       3581.207        179.154
   Directory stat    :       8885.475       8488.148       8680.477         89.058
   Directory removal :       3815.363       3292.796       3638.044        159.870
   File creation     :       2451.821       2284.533       2364.546         49.688
   File stat         :       3532.868       3284.716       3426.642         68.167
   File read         :       8745.646       7888.261       8479.615        199.443
   File removal      :       2659.047       2475.945       2573.788         64.199
   Tree creation     :       3522.699        797.295       3290.452        578.813
   Tree removal      :       3246.246       2869.909       3151.856         75.039

-- finished at 10/07/2013 15:20:52 --

Roughly speaking ldiskfs is nearly twice as fast as ZFS on the
artificial metadata tests except on stat calls and read.

Repeating the experiment, however, this time on plain formated ext4 and ZFS filesystems (no Lustre involved).
The underlying hardware is the original MGS/MDT Server results in:

*** EXT4
-- started at 10/08/2013 10:26:55 --

mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s)
Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /ext4/mdtest
Path: /ext4
FS: 63.0 GiB   Used FS: 0.3%   Inodes: 4.0 Mi   Used Inodes: 0.0%

1 tasks, 5040 files/directories

SUMMARY: (of 20 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:      40562.779      30483.751      35626.407       3019.069
   Directory stat    :     146904.697     144106.646     145177.353        735.623
   Directory removal :      45658.402      18579.207      42666.602       7721.446
   File creation     :      55150.631      54306.775      54710.376        272.139
   File stat         :     145148.567     142614.316     143752.697        712.729
   File read         :     118738.722     115982.356     117299.713        677.185
   File removal      :      74535.433      72932.338      73898.577        552.812
   Tree creation     :      45488.234      19224.529      30160.072       8360.361
   Tree removal      :      21829.091      21270.317      21597.907        166.265

-- finished at 10/08/2013 10:27:06 --
*** ZFS
-- started at 10/08/2013 10:24:13 --

mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s)
Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /zfs/mdtest
Path: /zfs
FS: 63.0 GiB   Used FS: 0.0%   Inodes: 126.0 Mi   Used Inodes: 0.0%

1 tasks, 5040 files/directories

SUMMARY: (of 20 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:      17430.759       3494.324      13857.069       3667.221
   Directory stat    :     126509.106     124125.352     125720.502        641.879
   Directory removal :      17380.099       1341.726      16070.861       3468.179
   File creation     :      19416.201       1946.750      14450.802       4466.843
   File stat         :     126687.275     124279.327     125842.726        602.232
   File read         :     109161.802     106555.834     107863.681        674.730
   File removal      :      18087.791       1073.455      15315.115       5133.140
   Tree creation     :      19085.674       3313.867      17736.690       3428.476
   Tree removal      :      11679.683       1222.614      10843.046       2247.838

-- finished at 10/08/2013 10:24:58 --

Of course one can question how well such a metadata benchmark are reflecting true working sets, however, by just observing the plain ZFS vs. ext4 one could conclude that the slow metadata performance is NOT due to Lustre.

Thomas.

Comment by Nathaniel Clark [ 09/May/14 ]

http://review.whamcloud.com/#/c/7157/ was reverted to fix LU-4968 if this is resubmitted please include the changs that happened for LU-4944 (http://review.whamcloud.com/#/c/10064/)

Comment by Isaac Huang (Inactive) [ 09/Oct/14 ]

Just a note that if a patch that uses dsl_sync_task is landed again, we'd need to patch zfs so as not to increase async writes when there's only nowaiter sync tasks pending. See:
https://github.com/zfsonlinux/zfs/pull/2716#issuecomment-58540555

Comment by Alex Zhuravlev [ 07/Oct/15 ]

ZAP prefetching at object creation should improve metadata performance.

Generated at Sat Feb 10 01:26:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.