<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:01:21 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13447] Sudden slow file create (MDS problem)</title>
                <link>https://jira.whamcloud.com/browse/LU-13447</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We faced a sudden MDS problem with 2.12.4, a few hours after &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13442&quot; title=&quot;OSS high load and deadlock after 20 days of run time&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13442&quot;&gt;LU-13442&lt;/a&gt;, where users reported this kind of slowness when creating new files:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;sh02-ln03:bp86 09:03:28&amp;gt; time touch $SCRATCH/asdf

real 0m26.923s
user 0m0.000s
sys 0m0.003s
 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking at the MDS in question, we could see backtraces like these:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[1803096.937756] LNet: Service thread pid 41461 completed after 237.28s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[1805310.391443] LNet: Service thread pid 20879 was inactive for 200.02s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[1805310.408554] Pid: 20879, comm: mdt03_010 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[1805310.418900] Call Trace:
[1805310.421543]  [&amp;lt;ffffffffc1855fa8&amp;gt;] osp_precreate_reserve+0x2e8/0x800 [osp]
[1805310.428575]  [&amp;lt;ffffffffc184a949&amp;gt;] osp_declare_create+0x199/0x5f0 [osp]
[1805310.435312]  [&amp;lt;ffffffffc179269f&amp;gt;] lod_sub_declare_create+0xdf/0x210 [lod]
[1805310.442330]  [&amp;lt;ffffffffc178a86e&amp;gt;] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[1805310.449586]  [&amp;lt;ffffffffc178d80e&amp;gt;] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
[1805310.457012]  [&amp;lt;ffffffffc179192d&amp;gt;] lod_qos_prep_create+0x12fd/0x1890 [lod]
[1805310.464007]  [&amp;lt;ffffffffc177296a&amp;gt;] lod_declare_instantiate_components+0x9a/0x1d0 [lod]
[1805310.472042]  [&amp;lt;ffffffffc1785725&amp;gt;] lod_declare_layout_change+0xb65/0x10f0 [lod]
[1805310.479468]  [&amp;lt;ffffffffc17f7f82&amp;gt;] mdd_declare_layout_change+0x62/0x120 [mdd]
[1805310.486724]  [&amp;lt;ffffffffc1800ec6&amp;gt;] mdd_layout_change+0xb46/0x16a0 [mdd]
[1805310.493473]  [&amp;lt;ffffffffc166135f&amp;gt;] mdt_layout_change+0x2df/0x480 [mdt]
[1805310.500130]  [&amp;lt;ffffffffc16697d0&amp;gt;] mdt_intent_layout+0x8a0/0xe00 [mdt]
[1805310.506787]  [&amp;lt;ffffffffc1666d35&amp;gt;] mdt_intent_policy+0x435/0xd80 [mdt]
[1805310.513459]  [&amp;lt;ffffffffc0ffbe06&amp;gt;] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[1805310.520394]  [&amp;lt;ffffffffc10244f6&amp;gt;] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[1805310.527688]  [&amp;lt;ffffffffc10acb12&amp;gt;] tgt_enqueue+0x62/0x210 [ptlrpc]
[1805310.534034]  [&amp;lt;ffffffffc10b564a&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
[1805310.541167]  [&amp;lt;ffffffffc105843b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1805310.549063]  [&amp;lt;ffffffffc105bda4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1805310.555596]  [&amp;lt;ffffffff9fac2e81&amp;gt;] kthread+0xd1/0xe0
[1805310.560688]  [&amp;lt;ffffffffa0177c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
[1805310.567340]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[1805310.572544] LustreError: dumping log to /tmp/lustre-log.1586359839.20879
[1805316.373786] LNet: Service thread pid 20879 completed after 206.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;We took a crash dump available in the Whamcloud FTP as &lt;tt&gt;fir-md1-s3_20200408_vmcore&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;Attaching dmesg as  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/34666/34666_fir-md1-s3_20200408_vmcore-dmesg.txt&quot; title=&quot;fir-md1-s3_20200408_vmcore-dmesg.txt attached to LU-13447&quot;&gt;fir-md1-s3_20200408_vmcore-dmesg.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; and foreach bt of the crash dump as  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/34665/34665_fir-md1-s3_20200408_foreach_bt.txt&quot; title=&quot;fir-md1-s3_20200408_foreach_bt.txt attached to LU-13447&quot;&gt;fir-md1-s3_20200408_foreach_bt.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;A restart of the MDS has fixed the issue for now.&lt;/p&gt;</description>
                <environment>CentOS 7.6</environment>
        <key id="58739">LU-13447</key>
            <summary>Sudden slow file create (MDS problem)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Fri, 10 Apr 2020 16:58:39 +0000</created>
                <updated>Sun, 16 Jan 2022 08:45:03 +0000</updated>
                            <resolved>Sun, 16 Jan 2022 08:45:03 +0000</resolved>
                                    <version>Lustre 2.12.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="267403" author="pjones" created="Fri, 10 Apr 2020 22:12:09 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="267451" author="tappro" created="Mon, 13 Apr 2020 06:24:26 +0000"  >&lt;p&gt;Logs shows connectivity problem with OSTs, so MDT cannot create OST objects for a file being created. I am not sure is that network problem or OST server problem, but that is not MDT issue. There are many messages like this:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[1767428.569313] LNetError: 20288:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
[1767428.579482] LNetError: 20288:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 10.0.10.109@o2ib7 (5): c: 0, oc: 0, rc: 8
[1767428.591908] LNetError: 20288:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.0.10.53@o2ib7 added to recovery queue. Health = 900
[1767428.604016] LNetError: 20288:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 5 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and as result:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[1767243.181693] LustreError: 20702:0:(osp_precreate.c:970:osp_precreate_cleanup_orphans()) fir-OST0034-osc-MDT0002: cannot cleanup orphans: rc = -11
[1767244.194720] LustreError: 20702:0:(osp_precreate.c:970:osp_precreate_cleanup_orphans()) fir-OST0034-osc-MDT0002: cannot cleanup orphans: rc = -11
[1767327.245811] LustreError: 20710:0:(osp_precreate.c:970:osp_precreate_cleanup_orphans()) fir-OST0038-osc-MDT0002: cannot cleanup orphans: rc = -11
[1767341.462169] LustreError: 20714:0:(osp_precreate.c:970:osp_precreate_cleanup_orphans()) fir-OST003a-osc-MDT0002: cannot cleanup orphans: rc = -11
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="267542" author="adilger" created="Tue, 14 Apr 2020 11:42:49 +0000"  >&lt;p&gt;We shouldn&apos;t really be blocking the create if an OST is unreachable unless &lt;b&gt;all&lt;/b&gt; of the OSTs are unreachable. &lt;/p&gt;</comment>
                            <comment id="268069" author="sthiell" created="Mon, 20 Apr 2020 16:15:53 +0000"  >&lt;p&gt;Thanks for taking the time to look at this! Still, to me, it&apos;s bizarre that this could be due to a network or OST specific problem, as other MDTs accessing the OSTs seemed fine at that time, and all targets are connected to the same IB switch.  Also, only one MDT was affected (1 / 4). Anyway, this is the single occurrence of this problem that we have seen so far, so I guess it&apos;s very rare. We&apos;ll report back if we see it again.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="34665" name="fir-md1-s3_20200408_foreach_bt.txt" size="876443" author="sthiell" created="Fri, 10 Apr 2020 16:57:46 +0000"/>
                            <attachment id="34666" name="fir-md1-s3_20200408_vmcore-dmesg.txt" size="518464" author="sthiell" created="Fri, 10 Apr 2020 16:57:30 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00xq7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>