<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:33:57 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3442] MDS performance degraded by reading of ZFS spacemaps</title>
                <link>https://jira.whamcloud.com/browse/LU-3442</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We started to experience degraded performance on our MDS with a ZFS backend.  Certain RPCs were taking many seconds or even minutes to service.  Users would accordingly see very slow interactive responsiveness. On investigation, this turned out to be due to ZFS transaction groups taking very long to sync, blocking request handlers that needed to write out an llog record.  This in turn was due to zio processing threads waiting in space_map_load_wait():&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[&amp;lt;ffffffffa038cdad&amp;gt;] cv_wait_common+0xed/0x100 [spl]                              
[&amp;lt;ffffffffa038ce15&amp;gt;] __cv_wait+0x15/0x20 [spl]                                    
[&amp;lt;ffffffffa0480f2f&amp;gt;] space_map_load_wait+0x2f/0x40 [zfs]                          
[&amp;lt;ffffffffa046ab47&amp;gt;] metaslab_activate+0x77/0x160 [zfs]                           
[&amp;lt;ffffffffa046b67e&amp;gt;] metaslab_alloc+0x4fe/0x950 [zfs]                             
[&amp;lt;ffffffffa04c801a&amp;gt;] zio_dva_allocate+0xaa/0x350 [zfs]                            
[&amp;lt;ffffffffa04c93e0&amp;gt;] zio_ready+0x3c0/0x460 [zfs]                                  
[&amp;lt;ffffffffa04c93e0&amp;gt;] zio_ready+0x3c0/0x460 [zfs]                                  
[&amp;lt;ffffffffa04c6293&amp;gt;] zio_execute+0xb3/0x130 [zfs]                                 
[&amp;lt;ffffffffa0389277&amp;gt;] taskq_thread+0x1e7/0x3f0 [spl]                               
[&amp;lt;ffffffff81096c76&amp;gt;] kthread+0x96/0xa0                                            
[&amp;lt;ffffffff8100c0ca&amp;gt;] child_rip+0xa/0x20                                           
[&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff          
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We are able to mitigate this problem by setting the zfs module option &lt;tt&gt;metaslab_debug=1&lt;/tt&gt;, which forces all spacemaps to stay resident in memory.  However, this solution is a bit heavy-handed, and we&apos;d like to gain a better understanding of why we&apos;re reading spacemaps from disk so often, and what should be done about it.&lt;/p&gt;

&lt;p&gt;Our first thought was that pool fragmentation was the underlying cause, causing the block allocator to search all spacemaps to find a suitable interval.  Our thinking was that llog cancellation promotes fragmentation by punching holes in otherwise contiguously allocated regions.  But I&apos;m not sure this theory is consistent with how llogs actually work, or with how the ZFS allocator works for that matter.&lt;/p&gt;

&lt;p&gt;Another idea is that a concurrent write and unlink workload could cause this behaviour, but it&apos;s all just speculation until we better understand the workload and how ZFS manages spacemaps.&lt;/p&gt;

&lt;p&gt;The most appealing approach we&apos;ve discussed so far is to modify ZFS to use the ARC to cache spacemap objects.  I believe ZFS currently only keeps one spacemap (per vdev?) active in memory at a time, and it bypasses the ARC for these objects.  Using the ARC would keep the hot spacemaps in memory, but allow them to get pitched under memory pressure.&lt;/p&gt;

&lt;p&gt;So, I&apos;m not sure there&apos;s a Lustre bug here, but it&apos;s an issue to be aware of when using ZFS backends. &lt;/p&gt;</description>
                <environment>server: lustre-2.4.0-RC2_2chaos_2.6.32_358.6.1.3chaos.ch5.1.ch5.1.x86_64&lt;br/&gt;
clients: mix of PPC/Lustre 2.4 and x86_64/Lustre 2.1</environment>
        <key id="19309">LU-3442</key>
            <summary>MDS performance degraded by reading of ZFS spacemaps</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="nedbass">Ned Bass</reporter>
                        <labels>
                            <label>performance</label>
                            <label>zfs</label>
                    </labels>
                <created>Thu, 6 Jun 2013 18:56:35 +0000</created>
                <updated>Wed, 26 Apr 2017 23:55:37 +0000</updated>
                            <resolved>Wed, 26 Apr 2017 23:55:37 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="60109" author="bzzz" created="Thu, 6 Jun 2013 19:03:22 +0000"  >&lt;p&gt;llog for unlinks has two levels: a catalogs referencing &quot;plain&quot; llogs containing actual records. when actual record is cancelled, then only bitmap in the header of plain llog is updated. once all the records in a plain llog is cancelled, then it&apos;s removed as a whole and corresponding record in the catalog is cancelled (which is bitmap update again). IOW, it doesn&apos;t punch to cancel, but it does update 8K header containing a bitmap again and again.&lt;/p&gt;

&lt;p&gt;thanks for the report.&lt;/p&gt;</comment>
                            <comment id="60111" author="adilger" created="Thu, 6 Jun 2013 19:47:30 +0000"  >&lt;p&gt;This doesn&apos;t sound wholly different than similar problems seen on ldiskfs due to bitmap loading.  Definitely it makes the most sense to aggressively keep the bitmaps in cache if possible.  Unfortunately, the L2ARC won&apos;t help since the bitmaps are only useful when they need to be written.&lt;/p&gt;

&lt;p&gt;I&apos;ve thought for a long time that it would be useful to have a SPA policy for writing metadata like this or the metadnode on an SSD/NVRAM device (e.g. shared with ZIL), but that introduces reliability issues if this device fails then the whole pool is unusable unless everything written there is mirrored.  Possibly that would still be be achievable that a bitmap copy is always written to the SSD device for low-latency reads, but a &quot;write-only&quot; copy is also sent to the VDEV in case the SSD fails.&lt;/p&gt;

&lt;p&gt;That said, it doesn&apos;t make sense that individual RPCs would take &lt;em&gt;minutes&lt;/em&gt; to service unless they are backed up behind a large number of other RPCs that each need to do many bitmap reads from disk.  If you have a chance, could you also try testing with &lt;tt&gt;lctl set_param ofd.*.read_cache_enable=0&lt;/tt&gt; on all the OSS nodes (instead of &lt;tt&gt;metaslab_debug=1&lt;/tt&gt;) to see if this has the same effect?  I&apos;m wondering if there is any benefit to having read cache enabled at the OFD/pagecache level if there will be duplicate caching at the ARC/L2ARC?&lt;/p&gt;</comment>
                            <comment id="60119" author="nedbass" created="Thu, 6 Jun 2013 21:00:22 +0000"  >&lt;p&gt;Alex, another factor to consider is we are using Robinhood with this filesystem.  So many changelogs are continually being created and canceled.  Incidentally, I believe this issue is probably why we had so many blocked service threads in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3379&quot; title=&quot;osd_attr_get() ASSERTION( dt_object_exists(dt) ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3379&quot;&gt;&lt;del&gt;LU-3379&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Andreas,&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;it doesn&apos;t make sense that individual RPCs would take minutes to service unless they are backed up behind a large number of other RPCs that each need to do many bitmap reads from disk.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;In severe cases I was seeing txg&apos;s take upwards of 60s to sync (this can be monitored in &lt;tt&gt;/proc/spl/kstat/zfs/txgs-poolname&lt;/tt&gt;).  In that case the currently open txg reaches its max write limit, so an RPC handler would block in &lt;tt&gt;txg_wait_open()&lt;/tt&gt;.  So in effect the RPC handler may wait 60s for a new txg to open, then another 60s while its txg sits in the open state, and another 60s for its txg to sync.  And it may be that a request has to wait several iterations just to get into an open txg due to the write limit, depending on how backed up things are.  We&apos;ve seen cases of Lustre service threads blocking in the neighborhood of ten &lt;del&gt;seconds&lt;/del&gt; minutes, so this could explain why.  As to why we&apos;re reading so many spacemaps from disk, that&apos;s what I would like to understand better.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;could you also try testing with lctl set_param ofd.*.read_cache_enable=0 on all the OSS nodes&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Note this issue is about MDS performance.  We may very well be having similar issues on the OSS nodes, but I haven&apos;t investigated that yet.  Is that tunable setting obsolete for Lustre 2.4?  I can&apos;t find it under proc on our 2.4/zfs-osd servers.&lt;/p&gt;</comment>
                            <comment id="60127" author="nedbass" created="Thu, 6 Jun 2013 23:04:06 +0000"  >&lt;p&gt;Opened &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3443&quot; title=&quot;performance impact of mdc_rpc_lock serialization&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3443&quot;&gt;&lt;del&gt;LU-3443&lt;/del&gt;&lt;/a&gt; to document a client-side performance issue precipitated by this issue.&lt;/p&gt;</comment>
                            <comment id="60166" author="pjones" created="Fri, 7 Jun 2013 14:19:14 +0000"  >&lt;p&gt;Niu&lt;/p&gt;

&lt;p&gt;Do you have anything to add here?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="193698" author="adilger" created="Wed, 26 Apr 2017 23:55:37 +0000"  >&lt;p&gt;This is really a ZFS problem.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="19312">LU-3443</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 22 Jul 2013 18:56:35 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvsuv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>8581</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 6 Jun 2013 18:56:35 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>