<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:54:39 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5803] This server is not able to keep up with request traffic</title>
                <link>https://jira.whamcloud.com/browse/LU-5803</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;The following was with our local lustre version 2.5.3-1chaos (see github.com/chaos/lustre), using the zfs osd.&lt;/p&gt;

&lt;p&gt;I am noticing that with this 2.5.3 release that every time I boot the entire server cluster, the OSS nodes all have the following block of console noise.  We are always past the deadline by &quot;-6&quot;.&lt;/p&gt;

&lt;p&gt;I hadn&apos;t seen this noise with the 2.4 branch.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2014-10-23 18:36:30 LustreError: 13a-8: Failed to get MGS log params and no local copy.
2014-10-23 18:36:41 LustreError: 7401:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8810142b3800 x1482806576283684/t0(0) o253-&amp;gt;MGC10.1.1.169@o2ib9@10.1.1.1
69@o2ib9:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
2014-10-23 18:36:41 LustreError: 7401:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 1 previous similar message
2014-10-23 18:36:42 Lustre: lcy-OST0001: Will be in recovery for at least 5:00, or until 135 clients reconnect.
2014-10-23 18:37:01 LustreError: 137-5: lcy-OST0000_UUID: not available for connect from 192.168.121.102@o2ib2 (no target). If you are running an HA pair check that the target is mounted
 on the other server.
2014-10-23 18:37:01 LustreError: Skipped 27 previous similar messages
2014-10-23 18:37:33 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:33 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=2 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:33 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff88081a1cb850 x1482683500276896/t0(0) o400-&amp;gt;c6423062-7cd6-6647-06fc-f8be419a9edc@192.168.121.111@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114647 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:34 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff881013dab850 x1482683530936232/t0(0) o400-&amp;gt;393de519-5053-8c5e-d1f5-ab0fa701f469@192.168.121.140@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114648 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
2014-10-23 18:37:36 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:36 Lustre: Skipped 1 previous similar message
2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 1 previous similar message
2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff88081a1c0050 x1482683492672308/t0(0) o400-&amp;gt;c3d0ea3a-8781-d75b-9934-fd211665f796@192.168.121.19@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114650 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
2014-10-23 18:37:39 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:39 Lustre: Skipped 2 previous similar messages
2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 2 previous similar messages
2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff88081a1b5050 x1482683502391272/t0(0) o400-&amp;gt;ea71043b-0040-df9e-f053-fc396ff5a07b@192.168.121.137@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114653 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
2014-10-23 18:37:43 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:43 Lustre: Skipped 5 previous similar messages
2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 5 previous similar messages
2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff881013d8a850 x1482683531137984/t0(0) o400-&amp;gt;84c85d3d-b1c2-dfa0-5dad-84193a0250a9@192.168.121.151@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114657 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages
2014-10-23 18:37:52 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:52 Lustre: Skipped 28 previous similar messages
2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=5 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 28 previous similar messages
2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff88081a20e050 x1482683498963872/t0(0) o400-&amp;gt;770acbfa-1b07-454e-d9b5-c1becbe0bcee@192.168.121.87@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114666 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 84 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="27291">LU-5803</key>
            <summary>This server is not able to keep up with request traffic</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="yujian">Jian Yu</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Fri, 24 Oct 2014 01:43:34 +0000</created>
                <updated>Wed, 13 Oct 2021 03:12:37 +0000</updated>
                            <resolved>Wed, 13 Oct 2021 03:12:37 +0000</resolved>
                                    <version>Lustre 2.5.3</version>
                                    <fixVersion>Lustre 2.5.4</fixVersion>
                                        <due></due>
                            <votes>1</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="97353" author="ezell" created="Fri, 24 Oct 2014 01:54:40 +0000"  >&lt;p&gt;Chris-&lt;/p&gt;

&lt;p&gt;We&apos;ve seen a lot of issues with recovery in 2.5.3 in production.  Check out &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5724&quot; title=&quot;IR recovery doesn&amp;#39;t behave properly with Lustre 2.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5724&quot;&gt;&lt;del&gt;LU-5724&lt;/del&gt;&lt;/a&gt;.  In our testbed, we pulled in the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5079&quot; title=&quot;conf-sanity test_47 timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5079&quot;&gt;&lt;del&gt;LU-5079&lt;/del&gt;&lt;/a&gt; and have seen better results, but we aren&apos;t running that in production yet.  You might give it a try to see if the problem goes away.  We&apos;d be interested in seeing your results.&lt;/p&gt;</comment>
                            <comment id="97356" author="morrone" created="Fri, 24 Oct 2014 02:10:19 +0000"  >&lt;p&gt;Thanks.  I&apos;ll should have time to give that a try tomorrow.&lt;/p&gt;</comment>
                            <comment id="97366" author="pjones" created="Fri, 24 Oct 2014 04:27:48 +0000"  >&lt;p&gt;Yu, Jian&lt;/p&gt;

&lt;p&gt;Do you agree that these failures seem consistent with those seen with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5079&quot; title=&quot;conf-sanity test_47 timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5079&quot;&gt;&lt;del&gt;LU-5079&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="97422" author="adilger" created="Fri, 24 Oct 2014 17:19:12 +0000"  >&lt;p&gt;I was also going to suggest &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5079&quot; title=&quot;conf-sanity test_47 timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5079&quot;&gt;&lt;del&gt;LU-5079&lt;/del&gt;&lt;/a&gt;, since that fixes a busy-wait loop on the MDS and OSS during recovery that also causes a large amount of memory consumption (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5077&quot; title=&quot;insanity test_1: out of memory on MDT in crypto_create_tfm()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5077&quot;&gt;&lt;del&gt;LU-5077&lt;/del&gt;&lt;/a&gt;).&lt;/p&gt;</comment>
                            <comment id="97440" author="morrone" created="Fri, 24 Oct 2014 18:48:24 +0000"  >&lt;p&gt;So if I understand correctly, we are talking about this patch:&lt;/p&gt;

&lt;p&gt;  master: &lt;a href=&quot;http://review.whamcloud.com/11213/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11213/&lt;/a&gt;&lt;br/&gt;
  b2_5: &lt;a href=&quot;http://review.whamcloud.com/12365&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12365&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But at least the b2_5 version has been identified as causing a new regression test failure.&lt;/p&gt;

&lt;p&gt;I am wondering if reverting 9298b92e69d095b3e7809f26d47ae22e646d55c2 , &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4578&quot; title=&quot;Early replies do not honor at_max&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4578&quot;&gt;&lt;del&gt;LU-4578&lt;/del&gt;&lt;/a&gt; ptlrpc: Early replies need to honor at_max&quot;, might not be the better way to go.&lt;/p&gt;</comment>
                            <comment id="97474" author="morrone" created="Fri, 24 Oct 2014 21:49:53 +0000"  >&lt;p&gt;I tried the &lt;a href=&quot;http://review.whamcloud.com/12365&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12365&lt;/a&gt; patch.  It looks like we swapped one set of noisy console messages for another.  See experience in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5805&quot; title=&quot;tgt_recov blocked and &amp;quot;waking for gap in transno&amp;quot;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5805&quot;&gt;&lt;del&gt;LU-5805&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="24752">LU-5079</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 24 Oct 2014 01:43:34 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwzef:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16270</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 24 Oct 2014 01:43:34 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>