<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:08:43 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14320] Poor zfs performance (particularly reads) with ZFS 0.8.5 on RHEL 7.9</title>
                <link>https://jira.whamcloud.com/browse/LU-14320</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Creating a new issue as a follow on for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14293&quot; title=&quot;Poor lnet/ksocklnd(?) performance on 2x100G bonded ethernet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14293&quot;&gt;&lt;del&gt;LU-14293&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This issue is affecting one production file system and one that&apos;s currently in acceptance.&lt;/p&gt;

&lt;p&gt;When we stood up the system in acceptance, we ran some benchmarks on the raw block storage, so we&apos;re confident that the block storage can provide ~7GB/s read per LUN, with ~65GB/s read across the 12 LUNs in aggregate. What we did not do, however, was run any benchmarks on ZFS after the zpools were created on top of the LUN. Since LNET was no longer our bottleneck, we figured it would make sense to verify the stack from the bottom up, starting with the zpools. We set the zpools to `canmount=on` and changed the mountpoints, then mounted them and ran fio on them. Performance is &lt;b&gt;terrible&lt;/b&gt;.&lt;/p&gt;

&lt;p&gt;Given that we have another file system running with the exact same tunings and general layout, we also checked that file system in the same manner to much the same results. Since we have past benchmarking results from that file system, we&apos;re fairly confident that at some point in the past ZFS was functioning correctly. With that knowledge (and after looking at various zfs github issues) we decided to roll back from zfs 0.8.5 to 0.7.13 to test the performance there. It seems that 0.7.13 is also providing the same results.&lt;/p&gt;

&lt;p&gt;I think that there may be potential value in rolling back our kernel to match what it was when we initialized the other file system, as there might be some odd interaction occurring with the kernel version we&apos;re running, but I&apos;m not sure.&lt;/p&gt;


&lt;p&gt;Here&apos;s the results of our testing on a single LUN with ZFS. Keep in mind this LUN can do ~7GB/s at the block level.&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;files | read | write&lt;br/&gt;
1 file - 396 MB/s | 4.2 GB/s&lt;br/&gt;
4 files - 751 MB/s | 4.7 GB/s&lt;br/&gt;
12 files - 1.6 GB/s | 4.7 GB/s&lt;/li&gt;
&lt;/ol&gt;



&lt;p&gt;And here&apos;s the really simple fio we&apos;re running to get these numbers:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
fio --rw=read --size 20G --bs=1M --name=something --ioengine=libaio --runtime=60s --numjobs=12
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We&apos;re also noticing some issues where Lustre is eating into those numbers significantly when layered on top. We&apos;re going to hold off on debugging that at all until zfs is stable though, as it may just be due to the same zfs issues.&lt;/p&gt;

&lt;p&gt;Here&apos;s our current zfs module tunings:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;metaslab_debug_unload=1&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_arc_max=150000000000&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_prefetch_disable=1&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_dirty_data_max_percent=30&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_arc_average_blocksize=1048576&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_max_recordsize=1048576&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_vdev_aggregation_limit=1048576&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_multihost_interval=10000&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_multihost_fail_intervals=0&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_vdev_async_write_active_min_dirty_percent=20&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_vdev_scheduler=deadline&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_vdev_async_write_max_active=10&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_vdev_async_write_min_active=5&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_vdev_async_read_max_active=16&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_vdev_async_read_min_active=16&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfetch_max_distance=67108864&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;dbuf_cache_max_bytes=10485760000&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;dbuf_cache_shift=3&apos;&lt;/span&gt;
  - &lt;span class=&quot;code-quote&quot;&gt;&apos;zfs_txg_timeout=60&apos;&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;ve tried with zfs checksums on and off with no real change in speed. Screen grabs of the flame graphs from those runs are attached.&lt;/p&gt;</description>
                <environment></environment>
        <key id="62300">LU-14320</key>
            <summary>Poor zfs performance (particularly reads) with ZFS 0.8.5 on RHEL 7.9</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="utopiabound">Nathaniel Clark</assignee>
                                    <reporter username="nilesj">Jeff Niles</reporter>
                        <labels>
                            <label>ORNL</label>
                            <label>ornl</label>
                    </labels>
                <created>Mon, 11 Jan 2021 14:59:36 +0000</created>
                <updated>Sat, 17 Sep 2022 19:41:07 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>13</watches>
                                                                            <comments>
                            <comment id="289191" author="pjones" created="Mon, 11 Jan 2021 15:16:17 +0000"  >&lt;p&gt;Nathaniel&lt;/p&gt;

&lt;p&gt;What do you advise here?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="289195" author="utopiabound" created="Mon, 11 Jan 2021 15:44:30 +0000"  >&lt;p&gt;Okay, to be clear.&lt;/p&gt;

&lt;p&gt;You are running fio directly on a ZFS vdev and getting these result for both 0.8.5 and 0.7.13?&lt;/p&gt;

&lt;p&gt;Can you dump info on the format of the zpool?&#160; Is it just a single logical unit on the a storage array?&lt;/p&gt;</comment>
                            <comment id="289207" author="nilesj" created="Mon, 11 Jan 2021 16:42:09 +0000"  >&lt;p&gt;Correct, we&apos;re running fio directly on the vdev and the results are the same for 0.8.5 and 0.7.13 (at least on this kernel version).&lt;/p&gt;

&lt;p&gt;As far as zpool layout, each OSS is primary for two zpools, each constructed from a single block device (LUN) provided by an external storage device that handles RAID (DDN). Any other info you need, let me know.&lt;/p&gt;</comment>
                            <comment id="289211" author="bzzz" created="Mon, 11 Jan 2021 17:22:10 +0000"  >&lt;p&gt;memcpy seem to contribute a lot. Lustre doesn&apos;t need extra copy and can send data from ARC directly&lt;/p&gt;</comment>
                            <comment id="289212" author="utopiabound" created="Mon, 11 Jan 2021 17:25:05 +0000"  >&lt;p&gt;would it be possible to attach an sosreport the OSS?&lt;/p&gt;</comment>
                            <comment id="289217" author="nilesj" created="Mon, 11 Jan 2021 18:08:02 +0000"  >&lt;p&gt;I sent the sosreport to your whamcloud email.&lt;/p&gt;</comment>
                            <comment id="289232" author="utopiabound" created="Mon, 11 Jan 2021 20:30:24 +0000"  >&lt;p&gt;The block scheduler for the disks and mpaths is &quot;mq-deadline&quot;. This is the system default, since &lt;tt&gt;zfs_vdev_scheduler&lt;/tt&gt; is disabled (at least in 2.0/master). I&apos;m wondering if setting the scheduler to none might help.&lt;/p&gt;

&lt;p&gt;The other oddity I found was multipath has max_sectors_kb set to 8196 for the SFA14KX (but the current versions of the multipath.conf file I&apos;ve found do not have such a setting, and I believe the default is 32M instead of 8M). I&apos;m not sure this is affecting you, given the test blocksize is 1M.&lt;/p&gt;

&lt;p&gt;Does FIO perform better with &lt;tt&gt;zfs_prefetch_disable=0&lt;/tt&gt;?&lt;/p&gt;

&lt;p&gt;There&apos;s also a small ARC fix in ZFS 0.8.6.&lt;/p&gt;</comment>
                            <comment id="289430" author="nilesj" created="Wed, 13 Jan 2021 18:02:10 +0000"  >&lt;p&gt;Sorry for the delayed response, we&apos;ve been working on testing and migrating over to a test system.&lt;/p&gt;

&lt;p&gt;Our zfs_vdev_scheduler is currently getting tuned to deadline. We tried setting it to noop, and then tried setting both it and the scheduler for the disks/mpaths to noop as well. No noticeable change in performance.&lt;/p&gt;

&lt;p&gt;We played with the max_sectors_kb and 32M doesn&apos;t seem to provide a tangible benefit either. We also tried setting nr_requests higher, same thing.&lt;/p&gt;

&lt;p&gt;We do get about a 2x speed increase (~1.3GB/s -&amp;gt; ~2.5GB/s) when enabling prefetching. While better, won&apos;t this impact smaller file workloads in a negative way? Also, ~2.5GB/s is still way short of the mark. It &lt;b&gt;does&lt;/b&gt; prove that zfs can push more bandwidth than it currently is.&lt;/p&gt;

&lt;p&gt;We also tried tuning the &lt;tt&gt;zfs_vdev_&lt;span class=&quot;error&quot;&gt;&amp;#91;async/sync&amp;#93;&lt;/span&gt;&lt;em&gt;read&lt;/em&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;max/min&amp;#93;&lt;/span&gt;_active&lt;/tt&gt; parameters with values ranging from 1 - 256, particularly focused on &lt;tt&gt;zfs_vdev_async_read_max_active&lt;/tt&gt;. These also seemingly provided no change. It seems like we&apos;re bottlenecked somewhere else.&lt;/p&gt;

&lt;p&gt;We&apos;re also nearly set up for a test where we&apos;re going to break a LUN up into 8 smaller LUNs and then feed those into ZFS to see if it&apos;s choking on the single large block device. I don&apos;t think we really expect much out of it, but will at least give us a datapoint. I&apos;ll let you know how that goes, but in the mean time do you have any more suggestions?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Jeff&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="289718" author="adilger" created="Mon, 18 Jan 2021 09:59:50 +0000"  >&lt;p&gt;A few comments here - the EDONR checksum that shows in the flame graphs seems to be consuming a &lt;b&gt;lot&lt;/b&gt; of CPU.  This checksum is new to me, so I&apos;m not sure of its performance or overhead.  Have you tried a more standard checksum (e.g. Fletcher4) which also has Intel CPU assembly optimizations that we added a few years ago?&lt;/p&gt;

&lt;p&gt;The other question of interest is what the zpool config is like (how many disks, how many VDEVs, RAID type, etc)?  Definitely ZFS gets better performance driving separate zpools than having a large single zpool, since there is otherwise contention at commit time when there are many disks in the pool.  On the one hand, several 8+2 RAID-Z2 as separate OSTs will probably give better performance, but on the other hand, there is convenience and some amount of additional robustness when having at least 3 VDEVs in the pool (it allows mirror metadata copies to br written to different disks).&lt;/p&gt;

&lt;p&gt;Finally, if you have some SSDs available and you are running ZFS 0.8+, it might be worthwhile to test with an SSD Metadata Allocation Class VDEV that is all-flash.  Then ZFS could put all of the internal metadtata (dnodes, indirect blocks, Merkle tree) on the  SSDs and only use the HDDs for data.&lt;/p&gt;</comment>
                            <comment id="289780" author="scadmin" created="Tue, 19 Jan 2021 09:04:11 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;I found this ticket by mistake, so please forgive the intrusion. but I had a thought - is your ashift=12? we tend to create zpools by hand, so I&apos;m not sure what the lustre tools set.&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;zpool get ashift&lt;br/&gt;
NAME                   PROPERTY  VALUE   SOURCE&lt;br/&gt;
arkle1-dagg-OST0-pool  ashift    12      local&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;also I&apos;m not sure I&apos;ve ever seen a great zfs read speed, but we did tweak these a bit on our system&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;zfs get recordsize,dnodesize&lt;br/&gt;
NAME                        PROPERTY    VALUE    SOURCE&lt;br/&gt;
arkle1-dagg-OST0-pool       recordsize  2M       local&lt;br/&gt;
arkle1-dagg-OST0-pool       dnodesize   auto     local&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;with zfs module option&lt;br/&gt;
options zfs zfs_max_recordsize=2097152&lt;/p&gt;

&lt;p&gt;also, FWIW we use 12+3 z3 vdevs with 4 vdevs per pool (ie. pool is 60 disks). no doubt z2 is faster, but we use z3 &apos;cos speed isn&apos;t really our main goal.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="289813" author="nilesj" created="Tue, 19 Jan 2021 15:58:11 +0000"  >&lt;p&gt;Andreas,&lt;/p&gt;

&lt;p&gt;The image that has 1:12 in the title shows a later run with checksumming disabled entirely, which made no meaningful change to the outcome. I am curious about your thoughts on the checksum type though, as EDONR is set on this system and our other system in the creation script. I think the reason that we&apos;re using it has been lost to time. Should we consider changing to Fletcher4, regardless of performance impact? Would be pretty low effort.&lt;/p&gt;

&lt;p&gt;For the zpool config: Each OSS controls two zpools, each with a single VDEV created from a single ~550TB block device that&apos;s presented over IB via SRP. I believe zfs sets this up as RAID0 internally, but I&apos;m not sure.&lt;/p&gt;

&lt;p&gt;Unfortunately, I don&apos;t have the drives on hand to test, but I think that would make a fantastic test. Might be useful to see if it&apos;s a good idea to include SSD/NVMe in future OSS purchases to offload that VDEV onto.&lt;/p&gt;


&lt;p&gt;Robin,&lt;/p&gt;

&lt;p&gt;No worries on stopping by, we&apos;ll take all the help we can get. Yes, we currently set ashift to 12; recordsize on our systems is 1M to align with the block device, and dnodesize is set to auto.&lt;/p&gt;

&lt;p&gt;I assume your enclosures are direct attached and you let ZFS handle all the disks? I think this may be part of our problem; we&apos;re trying to offload as much of this onto the block storage as possible, and ZFS just doesn&apos;t like it.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Jeff&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="290113" author="adilger" created="Fri, 22 Jan 2021 06:32:07 +0000"  >&lt;p&gt;As I mentioned on the call today, but I&apos;ll record here as well, I don&apos;t think creating the zpool on a single large VDEV is very good for ZFS performance.  Preferably you should have 3 leaf VDEVs so that ditto blocks can be written to different devices.  Also, a single large zpool causes contention at commit time, and in the past we saw better performance with multiple smaller zpools (e.g. 2x 8+2 RAID-Z2 VDEVs per OST) to allow better parallelism.&lt;/p&gt;

&lt;p&gt;It sounds like you have a RAID controller in front of the disks?  Is it possible that the controller is interfering with the IO from ZFS?&lt;/p&gt;

&lt;p&gt;You don&apos;t need dnodesize=auto for the OSTs.  Also, depending on what ZFS version you have, there were previously problems with this feature on the MDT.&lt;/p&gt;</comment>
                            <comment id="290157" author="nilesj" created="Fri, 22 Jan 2021 18:23:59 +0000"  >&lt;p&gt;Agree that a single VDEV zpool probably isn&apos;t the best way to organize these. I think we&apos;ll try to explore some different options there in the future. On the raid controller, yes. The backend system is a DDN 14KX with DCR pools (hence the huge LUN).&lt;/p&gt;

&lt;p&gt;With that being said, we&apos;ve recently moved to testing on a development system that has a direct attached disk enclosure and we can reproduce the problem on a scale as low as 16 disks. We tried giving ZFS full control over the disks, where we put them into a zpool with each drive as a vdev (more traditional setup) with no RAID and the results were pretty bad. We then tried to replicate the production DDN case by creating a RAID0 MD device for the exact same disks, then laid ZFS on top of that. Those results were also fairly poor. Raw mdraid device performance was as expected.&lt;/p&gt;

&lt;p&gt;Raw mdraid device - 16 disks - 2375 MB/s write, 2850 read&lt;br/&gt;
mdraid with zfs on top - 16 disks - 1700 write, 950 read&lt;br/&gt;
zfs managing drives - 16 disks - 1500 write, 1100 read&lt;/p&gt;</comment>
                            <comment id="347027" author="lflis" created="Sat, 17 Sep 2022 19:41:07 +0000"  >&lt;p&gt;@nilesj Out of curiosity - have you succedded with getting expected performance out of this setup?&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="62202">LU-14293</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="37137" name="Screen Shot 2021-01-10 at 1.12.20 PM.png" size="134717" author="nilesj" created="Mon, 11 Jan 2021 14:59:30 +0000"/>
                            <attachment id="37136" name="Screen Shot 2021-01-10 at 12.58.44 PM.png" size="156404" author="nilesj" created="Mon, 11 Jan 2021 14:59:30 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01j2f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>