[LU-2600] lustre metadata performance is very slow on zfs Created: 10/Jan/13 Updated: 09/Jan/20 Resolved: 09/Jan/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Minh Diep | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | JL, performance, prz, zfs | ||
| Environment: |
mdt is a zpool with 3 sata drives pool: pool2 NAME STATE READ WRITE CKSUM |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 6060 | ||||||||||||||||||||||||||||||||||||
| Description |
|
mds-survey show create and unlink are very slow [root@mds01 mds01]# tests_str="create lookup destroy" thrlo=192 thrhi=192 file_count=3840000 mds-survey |
| Comments |
| Comment by Alex Zhuravlev [ 10/Jan/13 ] |
|
if possible it'd be interesting to try with: |
| Comment by Minh Diep [ 14/Jan/13 ] |
|
1) quota accounting disabled (comment two zap_increment_int() in osd_object_create() in osd-zfs) zfs [root@mds01 mds01]# dir_count=192 thrlo=192 thrhi=192 file_count=3840000 mds-survey ldiskfs [root@mds01 mds01]# dir_count=192 thrlo=192 thrhi=192 file_count=3840000 mds-survey 2) lma setting disabled (comment osd_init_lma() in tat osd_object_create()) + quota acct disable zfs [root@mds01 mds01]# dir_count=192 thrlo=192 thrhi=192 file_count=3840000 mds-survey ldiskfs [root@mds01 mds01]# dir_count=192 thrlo=192 thrhi=192 file_count=3840000 mds-survey Mon Jan 14 23:26:56 PST 2013 /usr/bin/mds-survey from mds01 |
| Comment by Alex Zhuravlev [ 17/Jan/13 ] |
|
please, collect: 1) /proc/spl/kstat/zfs/dmu_tx - just before the run and right after 2) /proc/spl/kstat/zfs/txgs-*-mdt1 - few times during the run also, on my local setup (4GB RAM) I observed with 16+ threads txgs are overflowed all the time |
| Comment by Cliff White (Inactive) [ 21/May/13 ] |
|
agb5 -MDS/MGS, iwc client, dit29 -OSS |
| Comment by Peter Jones [ 21/May/13 ] |
|
Alex Could you please pass comment on this? Thanks Peter |
| Comment by Cliff White (Inactive) [ 21/May/13 ] |
|
Oprofile data from second mdtest run. MDS/MGS = agb5, OSS = agb14, client = iwc44 |
| Comment by Cliff White (Inactive) [ 21/May/13 ] |
|
We have done further testing with oprofile on Hyperion. We see ZFS performance about 1/2 to 1/3 of ldiskfs performance on the same hardware. |
| Comment by Keith Mannthey (Inactive) [ 22/May/13 ] |
vma samples % image name app name symbol name ffffffff812d3cd0 89202 8.2903 vmlinux vmlinux intel_idle ffffffff8127f7d0 37003 3.4390 vmlinux vmlinux format_decode ffffffff812811b0 26557 2.4682 vmlinux vmlinux vsnprintf ffffffff8127f3f0 24524 2.2792 vmlinux vmlinux number 0000000000040540 24463 2.2736 zfs.ko zfs.ko lzjb_decompress ffffffff812834c0 20643 1.9185 vmlinux vmlinux memcpy ffffffff812d93e0 19812 1.8413 vmlinux vmlinux port_inb ffffffff8150f460 17173 1.5960 vmlinux vmlinux mutex_lock ffffffff81059540 14331 1.3319 vmlinux vmlinux find_busiest_group ffffffff81169160 14164 1.3164 vmlinux vmlinux kfree ffffffff81415450 13765 1.2793 vmlinux vmlinux poll_idle ffffffff8150f1a0 13696 1.2729 vmlinux vmlinux mutex_unlock 0000000000040610 13003 1.2085 zfs.ko zfs.ko lzjb_compress ffffffff81283780 11341 1.0540 vmlinux vmlinux memset 0000000000003090 10486 0.9746 spl.ko spl.ko taskq_thread ffffffff8150db90 9346 0.8686 vmlinux vmlinux schedule ffffffff8127eb70 9255 0.8602 vmlinux vmlinux strrchr ffffffff81052130 9172 0.8524 vmlinux vmlinux mutex_spin_on_owner ffffffff8127ec90 8590 0.7983 vmlinux vmlinux strlen ffffffff8109b960 8120 0.7547 vmlinux vmlinux __hrtimer_start_range_ns Why are print functions so high? |
| Comment by Alex Zhuravlev [ 23/May/13 ] |
|
in my local testing it's EAs which were pretty expensive. also, I'd expect quota accouting contributes to this as well. |
| Comment by Liang Zhen (Inactive) [ 23/May/13 ] |
|
I would say printf is fine, CDEBUG should contribute most of them. |
| Comment by Keith Mannthey (Inactive) [ 23/May/13 ] |
|
Cliff I wonder do you have any iostat data or Lustre /proc stats? |
| Comment by Alex Zhuravlev [ 27/May/13 ] |
|
I remember Brian B. said it's doing OK locally. would you mind to try few createmany in parallel with locally mounted ZFS, please? so we have some basic numbers for pure ZFS? Lustre is doing much more than that (OI, few EAs, etc), but still the numbers would give us some idea. |
| Comment by Cliff White (Inactive) [ 16/Jul/13 ] |
|
We may be able to do this in the next test session. Kieth, there are no brw_stats available under ZFS |
| Comment by Keith Mannthey (Inactive) [ 16/Jul/13 ] |
|
Yea lets test this the next session. I will help setup some basic iostat so we can get a little better picture of the data rates to the disk themselves. |
| Comment by Andreas Dilger [ 03/Sep/13 ] |
|
The patch from |
| Comment by Andreas Dilger [ 20/Sep/13 ] |
|
Actually, my previous comment is incorrect. That patch may help with some real-world workloads like untar, but would not help mds-survey or similar that are not doing chown/chmod. In |
| Comment by Andreas Dilger [ 01/Oct/13 ] |
|
Some improvements have been made to ZFS performance, but this is still an ongoing issue so move this to 2.5.1 along with |
| Comment by Thomas Stibor [ 08/Oct/13 ] |
|
I did some benchmarking with Lustre-ZFS vs. Lustre-LDISKFS and ZFS vs. EXT4 with mdtest. The results suggests that the slow metadata performance is probably due to ZFS rather than to Lustre. The following setup is used: 1 MGS/MDT server, formatted with ldiskfs(ext4) or ZFS [build: 2.4.0-RC2-gd3f91c4-PRISTINE-2.6.32-358.6.2.el6_lustre.g230b174.x86_64] The benchmark is performed on the client ** Setup, single MDT0 with ZFS, OSS/OST with ZFS and mdtest executed on the client -- started at 10/07/2013 16:43:48 -- mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s) Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /mnt/mdtest/ Path: /mnt/mdtest FS: 98.7 TiB Used FS: 0.0% Inodes: 0.5 Mi Used Inodes: 0.0% 1 tasks, 5040 files/directories SUMMARY: (of 20 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 1948.194 1717.011 1814.171 58.454 Directory stat : 8550.010 7276.497 8112.847 415.032 Directory removal : 2045.658 1892.629 1963.691 46.917 File creation : 1188.975 1118.650 1152.378 18.880 File stat : 3398.468 3222.576 3328.069 53.387 File read : 8630.149 8034.409 8421.248 151.027 File removal : 1393.756 1296.246 1340.168 28.650 Tree creation : 1853.699 713.171 1713.243 234.610 Tree removal : 1811.968 1600.404 1734.573 42.491 -- finished at 10/07/2013 16:49:14 -- ** Setup, single MDT0 with ldiskfs (ext4), OSS/OST with ZFS and mdtest executed on the client -- started at 10/07/2013 15:17:41 -- mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s) Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /mnt/mdtest/ Path: /mnt/mdtest FS: 98.7 TiB Used FS: 0.0% Inodes: 32.0 Mi Used Inodes: 0.0% 1 tasks, 5040 files/directories SUMMARY: (of 20 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 3797.437 3241.010 3581.207 179.154 Directory stat : 8885.475 8488.148 8680.477 89.058 Directory removal : 3815.363 3292.796 3638.044 159.870 File creation : 2451.821 2284.533 2364.546 49.688 File stat : 3532.868 3284.716 3426.642 68.167 File read : 8745.646 7888.261 8479.615 199.443 File removal : 2659.047 2475.945 2573.788 64.199 Tree creation : 3522.699 797.295 3290.452 578.813 Tree removal : 3246.246 2869.909 3151.856 75.039 -- finished at 10/07/2013 15:20:52 -- Roughly speaking ldiskfs is nearly twice as fast as ZFS on the Repeating the experiment, however, this time on plain formated ext4 and ZFS filesystems (no Lustre involved). *** EXT4 -- started at 10/08/2013 10:26:55 -- mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s) Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /ext4/mdtest Path: /ext4 FS: 63.0 GiB Used FS: 0.3% Inodes: 4.0 Mi Used Inodes: 0.0% 1 tasks, 5040 files/directories SUMMARY: (of 20 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 40562.779 30483.751 35626.407 3019.069 Directory stat : 146904.697 144106.646 145177.353 735.623 Directory removal : 45658.402 18579.207 42666.602 7721.446 File creation : 55150.631 54306.775 54710.376 272.139 File stat : 145148.567 142614.316 143752.697 712.729 File read : 118738.722 115982.356 117299.713 677.185 File removal : 74535.433 72932.338 73898.577 552.812 Tree creation : 45488.234 19224.529 30160.072 8360.361 Tree removal : 21829.091 21270.317 21597.907 166.265 -- finished at 10/08/2013 10:27:06 -- *** ZFS -- started at 10/08/2013 10:24:13 -- mdtest-1.9.1 was launched with 1 total task(s) on 1 node(s) Command line used: ./mdtest -i 20 -b 2 -I 80 -z 5 -d /zfs/mdtest Path: /zfs FS: 63.0 GiB Used FS: 0.0% Inodes: 126.0 Mi Used Inodes: 0.0% 1 tasks, 5040 files/directories SUMMARY: (of 20 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- Directory creation: 17430.759 3494.324 13857.069 3667.221 Directory stat : 126509.106 124125.352 125720.502 641.879 Directory removal : 17380.099 1341.726 16070.861 3468.179 File creation : 19416.201 1946.750 14450.802 4466.843 File stat : 126687.275 124279.327 125842.726 602.232 File read : 109161.802 106555.834 107863.681 674.730 File removal : 18087.791 1073.455 15315.115 5133.140 Tree creation : 19085.674 3313.867 17736.690 3428.476 Tree removal : 11679.683 1222.614 10843.046 2247.838 -- finished at 10/08/2013 10:24:58 -- Of course one can question how well such a metadata benchmark are reflecting true working sets, however, by just observing the plain ZFS vs. ext4 one could conclude that the slow metadata performance is NOT due to Lustre. Thomas. |
| Comment by Nathaniel Clark [ 09/May/14 ] |
|
http://review.whamcloud.com/#/c/7157/ was reverted to fix |
| Comment by Isaac Huang (Inactive) [ 09/Oct/14 ] |
|
Just a note that if a patch that uses dsl_sync_task is landed again, we'd need to patch zfs so as not to increase async writes when there's only nowaiter sync tasks pending. See: |
| Comment by Alex Zhuravlev [ 07/Oct/15 ] |
|
ZAP prefetching at object creation should improve metadata performance. |