<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:21:45 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8927] osp-syn processes contending for osq_lock drives system cpu usage &gt; 80%</title>
                <link>https://jira.whamcloud.com/browse/LU-8927</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Ran jobs which created remote directories (not striped) and then ran mdtest within them, several MDS nodes are using &amp;gt;80% of their cpu time for osp-syn-* processes.&lt;/p&gt;

&lt;p&gt;There are 36 osp-syn-* processes.&lt;/p&gt;

&lt;p&gt;The processes are spending almost all their time contending for osq_lock.  According to perf, the offending stack is:&lt;/p&gt;

&lt;p&gt;osq_lock&lt;br/&gt;
__mutex_lock_slowpath&lt;br/&gt;
mutex_lock&lt;br/&gt;
spa_config_enter&lt;br/&gt;
bp_get_dsize&lt;br/&gt;
dmu_tx_hold_free&lt;br/&gt;
osd_declare_object_destroy&lt;br/&gt;
llog_osd_declare_destroy&lt;br/&gt;
llog_declare_destroy&lt;br/&gt;
llog_cancel_rec&lt;br/&gt;
llog_cat_cancel_records&lt;br/&gt;
osp_sync_process_committed&lt;br/&gt;
osp_sync_process_queues&lt;br/&gt;
llog_process_thread&lt;br/&gt;
llog_process_or_fork&lt;br/&gt;
llog_cat_process_cb&lt;br/&gt;
llog_process_thread&lt;br/&gt;
llog_process_or_fork&lt;br/&gt;
llog_cat_process_or_fork&lt;br/&gt;
llog_cat_process&lt;br/&gt;
osp_sync_thread&lt;br/&gt;
kthread&lt;br/&gt;
ret_from_fork&lt;br/&gt;
osp-syn-X-Y&lt;/p&gt;

</description>
                <environment>lustre-2.8.0_5.chaos-2.ch6.x86_64&lt;br/&gt;
zfs-0.7.0-0.6llnl.ch6.x86_64&lt;br/&gt;
DNE with 16 MDTs</environment>
        <key id="42333">LU-8927</key>
            <summary>osp-syn processes contending for osq_lock drives system cpu usage &gt; 80%</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="bzzz">Alex Zhuravlev</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                            <label>zfs</label>
                    </labels>
                <created>Thu, 8 Dec 2016 23:21:54 +0000</created>
                <updated>Mon, 18 Sep 2017 21:29:26 +0000</updated>
                            <resolved>Tue, 5 Sep 2017 16:58:09 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="177167" author="ofaaland" created="Thu, 8 Dec 2016 23:23:54 +0000"  >&lt;p&gt;Our stack is available to Intel engineers via repository named &quot;lustre-release-fe-llnl&quot; hosted on your gerritt server.&lt;/p&gt;</comment>
                            <comment id="177169" author="ofaaland" created="Thu, 8 Dec 2016 23:27:06 +0000"  >&lt;p&gt;I see that llog_cancel_rec() contains the following:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;        rc = llog_declare_write_rec(env, loghandle, &amp;amp;llh-&amp;gt;llh_hdr, index,
        if (rc &amp;lt; 0)
                GOTO(out_trans, rc);

        if ((llh-&amp;gt;llh_flags &amp;amp; LLOG_F_ZAP_WHEN_EMPTY))
                rc = llog_declare_destroy(env, loghandle, th);

        th-&amp;gt;th_wait_submit = 1;
        rc = dt_trans_start_local(env, dt, th);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So it seems to declare that it will destroy the llog object every time it cancels a record, as if every record is the last one.  Why is that? Shouldn&apos;t it also depend on how many active records the llog contains? &lt;/p&gt;</comment>
                            <comment id="177201" author="bzzz" created="Fri, 9 Dec 2016 05:44:14 +0000"  >&lt;p&gt;when we declare llog cancelation we don&apos;t known whether it will be last one or not, otherwise we&apos;d have to lock llog since declaration upto transaction stop killing concurrency. the newer versions of Lustre will fix this problem in osd-zfs module.&lt;/p&gt;</comment>
                            <comment id="177240" author="pjones" created="Fri, 9 Dec 2016 15:53:39 +0000"  >&lt;p&gt;Alex&lt;/p&gt;

&lt;p&gt;Could you please elaborate about the work underway in this area?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="177340" author="ofaaland" created="Fri, 9 Dec 2016 23:04:44 +0000"  >&lt;p&gt;Yes, please elaborate.  I know there are many ways to work on this and it would be great to know the nature and scope of the fix you have in mind.&lt;/p&gt;

&lt;p&gt;I looked again and the MDTs are still working to clear llog records from jobs run about 45 hours ago (contended the entire time).  I don&apos;t think we can go into production without a fix for this.&lt;/p&gt;</comment>
                            <comment id="177540" author="bzzz" created="Tue, 13 Dec 2016 07:01:22 +0000"  >&lt;p&gt;in very few words - I&apos;ve been working to make declarations with ZFS cheap. right now those are quite expensive because DMU API works with dnode numbers, so every time it needs to translate dnode number into dnode structure using the global hash table. few patches have been landed already onto master branch and released as a part of 2.9 (e.g. &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7898&quot; title=&quot;remove unnecessary declarations from osd-zfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7898&quot;&gt;&lt;del&gt;LU-7898&lt;/del&gt;&lt;/a&gt; osd: remove unnecessary declarations). yet more improvements are expected with landing of the following patches:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8882&quot; title=&quot;osd-zfs to use bynode methods&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8882&quot;&gt;&lt;del&gt;LU-8882&lt;/del&gt;&lt;/a&gt; osd: use bydnode methods to access ZAP&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8928&quot; title=&quot;osd-zfs should use dnode_t instead of dbuf&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8928&quot;&gt;&lt;del&gt;LU-8928&lt;/del&gt;&lt;/a&gt; osd: convert osd-zfs to reference dnode, not db&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8873&quot; title=&quot;use sa_handle_get_from_db()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8873&quot;&gt;&lt;del&gt;LU-8873&lt;/del&gt;&lt;/a&gt; osd: use sa_handle_get_from_db()&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2435&quot; title=&quot;inode accounting in osd-zfs is racy&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2435&quot;&gt;&lt;del&gt;LU-2435&lt;/del&gt;&lt;/a&gt; osd-zfs: use zfs native dnode accounting&lt;br/&gt;
and &lt;a href=&quot;https://github.com/zfsonlinux/zfs/pull/5464&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/zfsonlinux/zfs/pull/5464&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;this way the declarations should become mostly lockless and much cheaper.&lt;/p&gt;</comment>
                            <comment id="177627" author="ofaaland" created="Tue, 13 Dec 2016 19:26:40 +0000"  >&lt;p&gt;OK, thanks.&lt;/p&gt;

&lt;p&gt;In the above list of tickets, you include &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8893&quot; title=&quot;SSK - sanity test_126 cannot touch, Permission denied&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8893&quot;&gt;LU-8893&lt;/a&gt;.  Did you mean &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8873&quot; title=&quot;use sa_handle_get_from_db()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8873&quot;&gt;&lt;del&gt;LU-8873&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;I looked briefly at the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7898&quot; title=&quot;remove unnecessary declarations from osd-zfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7898&quot;&gt;&lt;del&gt;LU-7898&lt;/del&gt;&lt;/a&gt; patch to remove unnecessary declarations.  I&apos;ll see if I can apply and test it.&lt;/p&gt;</comment>
                            <comment id="177628" author="bzzz" created="Tue, 13 Dec 2016 19:32:16 +0000"  >&lt;p&gt;you&apos;re right, I mean &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8873&quot; title=&quot;use sa_handle_get_from_db()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8873&quot;&gt;&lt;del&gt;LU-8873&lt;/del&gt;&lt;/a&gt;, basically yet another point to save on dnode#-&amp;gt;dnode_t lookup.&lt;/p&gt;</comment>
                            <comment id="179428" author="ofaaland" created="Tue, 3 Jan 2017 20:15:05 +0000"  >&lt;p&gt;Alex,&lt;/p&gt;

&lt;p&gt;I applied &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7898&quot; title=&quot;remove unnecessary declarations from osd-zfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7898&quot;&gt;&lt;del&gt;LU-7898&lt;/del&gt;&lt;/a&gt; on top of our 2.8.0+patch stack and see the same symptoms.   The patch didn&apos;t appear to me to change any of the functions in the contending stacks, so not surprising.&lt;/p&gt;

&lt;p&gt;The full set of patches above would be too much for a stable branch, I would think.  So I&apos;ve rewritten llog_cancel_rec() to destroy the llog in a second transaction, if it&apos;s necessary.   Maybe this is a poor approach; feedback or an alternative would be welcome.  In any case I&apos;ve pushed it to gerrit and will do local testing after it passes maloo.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/24687/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/24687/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="207374" author="pjones" created="Mon, 4 Sep 2017 14:43:46 +0000"  >&lt;p&gt;I would like a level set on this ticket. All of the planned work to improve metadata performance for ZFS has now landed to master (and b2_10). Are there any specific tasks identified and remaining beyond that?&lt;/p&gt;</comment>
                            <comment id="207425" author="ofaaland" created="Tue, 5 Sep 2017 16:57:43 +0000"  >&lt;p&gt;This lock contention has not resulted in problems in production, and there is so much related change in 2.10 and master that it&apos;s quite possible the problem does not occur there.  Closing the ticket.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="14304">LU-2435</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="41833">LU-8873</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="41889">LU-8882</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="42348">LU-8928</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="24454" name="perf-report.txt" size="1487190" author="ofaaland" created="Thu, 8 Dec 2016 23:30:21 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyxzr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>