<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:52:17 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5532] LustreError: 77234:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600</title>
                <link>https://jira.whamcloud.com/browse/LU-5532</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Lustre OSS reports lustre is unhealthy after issuing the following sequence of messages. Second occurrence in the last 24 hours. &lt;/p&gt;

&lt;p&gt;Aug 21 14:00:23 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3794666.155556&amp;#93;&lt;/span&gt; Lustre: atlas1-OST0016: Client 1942a1b8-14c2-1c85-f1cc-f5a627755ef9 (at 10.38.145.2@o2ib4) reconnecting&lt;br/&gt;
Aug 21 14:00:23 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3794666.185134&amp;#93;&lt;/span&gt; Lustre: Skipped 6 previous similar messages&lt;br/&gt;
Aug 21 14:02:53 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3794816.201589&amp;#93;&lt;/span&gt; LustreError: 137-5: atlas1-OST00a2_UUID: not available for connect from 10.38.145.2@o2ib4 (no target)&lt;br/&gt;
Aug 21 14:12:20 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3795383.544447&amp;#93;&lt;/span&gt; Lustre: atlas1-OST0016: Client 1942a1b8-14c2-1c85-f1cc-f5a627755ef9 (at 10.38.145.2@o2ib4) reconnecting&lt;br/&gt;
Aug 21 14:12:20 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3795383.567821&amp;#93;&lt;/span&gt; Lustre: Skipped 5 previous similar messages&lt;br/&gt;
Aug 21 14:19:20 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3795803.534475&amp;#93;&lt;/span&gt; LustreError: 137-5: atlas1-OST00a2_UUID: not available for connect from 10.38.145.2@o2ib4 (no target)&lt;br/&gt;
Aug 21 14:19:20 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3795803.566134&amp;#93;&lt;/span&gt; LustreError: Skipped 3 previous similar messages&lt;br/&gt;
Aug 21 14:26:17 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3796220.710761&amp;#93;&lt;/span&gt; Lustre: atlas1-OST01c6: Client 5d5389e1-62ad-c671-5318-48ff669e4a6e (at 10.38.145.2@o2ib4) reconnecting&lt;br/&gt;
Aug 21 14:26:17 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3796220.734900&amp;#93;&lt;/span&gt; Lustre: Skipped 11 previous similar messages&lt;br/&gt;
Aug 21 14:31:17 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3796520.850175&amp;#93;&lt;/span&gt; LustreError: 137-5: atlas1-OST00a2_UUID: not available for connect from 10.38.145.2@o2ib4 (no target)&lt;br/&gt;
Aug 21 14:31:17 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3796520.878544&amp;#93;&lt;/span&gt; LustreError: Skipped 4 previous similar messages&lt;br/&gt;
Aug 21 14:38:14 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3796937.996148&amp;#93;&lt;/span&gt; Lustre: atlas1-OST0136: Client 5d5389e1-62ad-c671-5318-48ff669e4a6e (at 10.38.145.2@o2ib4) reconnecting&lt;br/&gt;
Aug 21 14:38:14 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3796938.026732&amp;#93;&lt;/span&gt; Lustre: Skipped 6 previous similar messages&lt;br/&gt;
Aug 21 14:44:39 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797323.165332&amp;#93;&lt;/span&gt; LustreError: 33287:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 603, more than at_max 600&lt;br/&gt;
Aug 21 14:44:39 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797323.165334&amp;#93;&lt;/span&gt;  ns: filter-atlas1-OST0376_UUID lock: ffff88034b20a900/0xecd0a12120f17d4c lrc: 4/0,0 mode: PW/PW res: &lt;span class=&quot;error&quot;&gt;&amp;#91;0x5c2a0f:0x0:0x0&amp;#93;&lt;/span&gt;.0 rrc: 2 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0-&amp;gt;16383) flags: 0x10020 nid: 10.36.205.208@o2ib remote: 0xc39a22a84c361c61 expref: 26 pid: 14426 timeout: 8090929608 lvb_type: 0&lt;br/&gt;
Aug 21 14:44:39 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797323.292254&amp;#93;&lt;/span&gt; LustreError: 33287:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) Skipped 3 previous similar messages&lt;br/&gt;
Aug 21 14:44:57 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797340.909561&amp;#93;&lt;/span&gt; Lustre: atlas1-OST0136: Slow creates, 128/256 objects created at a rate of 2/s&lt;br/&gt;
Aug 21 14:45:53 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797397.137591&amp;#93;&lt;/span&gt; LustreError: 137-5: atlas1-OST0372_UUID: not available for connect from 10.38.145.2@o2ib4 (no target)&lt;br/&gt;
Aug 21 14:45:53 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797397.161035&amp;#93;&lt;/span&gt; LustreError: Skipped 2 previous similar messages&lt;br/&gt;
Aug 21 14:48:24 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797548.207840&amp;#93;&lt;/span&gt; Lustre: atlas1-OST01c6: Client 1942a1b8-14c2-1c85-f1cc-f5a627755ef9 (at 10.38.145.2@o2ib4) reconnecting&lt;br/&gt;
Aug 21 14:48:24 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797548.238152&amp;#93;&lt;/span&gt; Lustre: Skipped 6 previous similar messages&lt;br/&gt;
Aug 21 14:52:26 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797790.627280&amp;#93;&lt;/span&gt; LustreError: 45633:0:(service.c:3216:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 610s&lt;/p&gt;


&lt;p&gt;Then followed by several messages like this:&lt;/p&gt;

&lt;p&gt;Aug 21 14:54:27 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797911.436996&amp;#93;&lt;/span&gt; Lustre: 33219:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/3), not sending early reply&lt;br/&gt;
Aug 21 14:54:27 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797911.436998&amp;#93;&lt;/span&gt;   req@ffff8802f640e400 x1476723491621465/t0(0) o3-&amp;gt;6ae34883-e4de-08ed-e3e9-ccdd41e9934d@9719@gni108:0/0 lens 448/432 e 0 to 0 dl 1408647272 ref 2 fl Interpret:/0/0 rc 0/0&lt;br/&gt;
Aug 21 14:54:32 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797916.390933&amp;#93;&lt;/span&gt; Lustre: atlas1-OST0136: Bulk IO read error with 711e8a57-ae3b-7204-47a9-6a996887d00c (at 7253@gni108), client will retry: rc -110&lt;br/&gt;
Aug 21 14:54:32 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797916.392844&amp;#93;&lt;/span&gt; LustreError: 33261:0:(ldlm_lib.c:2702:target_bulk_io()) @@@ timeout on bulk PUT after 0+0s  req@ffff8803bab27000 x1476723483171409/t0(0) o3-&amp;gt;711e8a57-ae3b-7204-47a9-6a996887d00c@7253@gni108:0/0 lens 448/432 e 0 to 0 dl 1408647272 ref 1 fl Interpret:/0/0 rc 0/0&lt;/p&gt;

&lt;p&gt;Full syslog to follow. &lt;/p&gt;</description>
                <environment>Lustre 2.4.3, RHEL 6.4, kernel 2.6.32-358.23.2.el6.atlas.x86_64</environment>
        <key id="26112">LU-5532</key>
            <summary>LustreError: 77234:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="hilljjornl">Jason Hill</reporter>
                        <labels>
                    </labels>
                <created>Thu, 21 Aug 2014 19:28:35 +0000</created>
                <updated>Thu, 28 Aug 2014 18:50:48 +0000</updated>
                            <resolved>Thu, 28 Aug 2014 18:50:48 +0000</resolved>
                                    <version>Lustre 2.4.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="92167" author="hilljjornl" created="Thu, 21 Aug 2014 19:39:49 +0000"  >&lt;p&gt;Lustre logs from the affected OSS and the MDS for that filesystem for the 2 occurences of this particular issue.&lt;/p&gt;</comment>
                            <comment id="92168" author="hilljjornl" created="Thu, 21 Aug 2014 19:41:33 +0000"  >&lt;p&gt;also of significance (I think), there is very little back-end IO happening; system load is over 300 in all categories, and memory utilization is extremely high; &amp;lt; 650MB/64GB free.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-oss1c7 ~&amp;#93;&lt;/span&gt;# free -m&lt;br/&gt;
             total       used       free     shared    buffers     cached&lt;br/&gt;
Mem:         64306      64038        268          0        761      51748&lt;br/&gt;
-/+ buffers/cache:      11527      52778&lt;br/&gt;
Swap:            0          0          0&lt;/p&gt;

&lt;p&gt;Yesterday I successfully unmounted all the OST&apos;s from this OSS, removed all lustre kernel modules and restarted Lustre to a positive outcome. &lt;/p&gt;</comment>
                            <comment id="92169" author="hilljjornl" created="Thu, 21 Aug 2014 19:44:18 +0000"  >&lt;p&gt;clients are both from the same cluster, both are running :&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=&quot; class=&quot;user-hover&quot; rel=&quot;&quot;&gt;&lt;/a&gt;# rpm -qa | grep lustre&lt;br/&gt;
lustre-client-2.4.2-2.6.32_358.23.2.el6.x86_64_g89cc68b.x86_64&lt;br/&gt;
lustre-client-modules-2.4.2-2.6.32_358.23.2.el6.x86_64_g89cc68b.x86_64&lt;/p&gt;</comment>
                            <comment id="92170" author="hilljjornl" created="Thu, 21 Aug 2014 19:48:13 +0000"  >&lt;p&gt;please drop severity; that was my mistake I tabbed through the field and had not intended to hit sev 2.&lt;/p&gt;</comment>
                            <comment id="92221" author="pjones" created="Fri, 22 Aug 2014 14:22:49 +0000"  >&lt;p&gt;Oleg is looking into this one&lt;/p&gt;</comment>
                            <comment id="92227" author="green" created="Fri, 22 Aug 2014 15:25:14 +0000"  >&lt;p&gt;So in the logs it seems there&apos;s a severe disk backend slowness going on.&lt;br/&gt;
We see things like &quot; Aug 21 14:52:26 atlas-oss1c7.ccs.ornl.gov kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3797790.627280&amp;#93;&lt;/span&gt; LustreError: 45633:0:(service.c:3216:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 610s&quot; and slow creates at the rate of 2/sec.&lt;/p&gt;

&lt;p&gt;I remember one of teh times we hit something like this before was soon after you restarted an OST. Was this restart you are talking about  also before these problems occurred?&lt;/p&gt;

&lt;p&gt;Any interesting data from the DDN side about load on the array?&lt;/p&gt;</comment>
                            <comment id="92448" author="hilljjornl" created="Tue, 26 Aug 2014 16:12:10 +0000"  >&lt;p&gt;Oleg,&lt;/p&gt;

&lt;p&gt;No interesting data at the DDN level, but we were able to see some errors on the infiniband interface between the OSS and the DDN. We replaced the IB cable and that seems to have quelled the issue. This was just unexpected because I thought the only way to clear an &quot;unhealthy&quot; state in /proc/fs/lustre/health_check was to reboot or at least unmount the deviced and unload the lustre modules. &lt;/p&gt;

&lt;p&gt;I&apos;m good if we close this issue; thanks for the response and I apologize for my delay in getting back to you.&lt;/p&gt;

&lt;p&gt;&amp;#8211;&lt;br/&gt;
-Jason&lt;/p&gt;</comment>
                            <comment id="92708" author="green" created="Thu, 28 Aug 2014 14:09:42 +0000"  >&lt;p&gt;If &quot;unhealthy&quot; state was set due to slowness of the disk, and then the disk performance improves (ie requests no longer take 10 minutes to complete), unhealthy state will clear.&lt;/p&gt;</comment>
                            <comment id="92736" author="jamesanunez" created="Thu, 28 Aug 2014 18:50:48 +0000"  >&lt;p&gt;Per ORNL, the issue is now understood and we can close the ticket.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="15565" name="atlas-mds1.20140820.kern.log" size="21380" author="hilljjornl" created="Thu, 21 Aug 2014 19:39:49 +0000"/>
                            <attachment id="15566" name="atlas-mds1.20140821.kern.log" size="21380" author="hilljjornl" created="Thu, 21 Aug 2014 19:39:49 +0000"/>
                            <attachment id="15568" name="atlas-oss1c7.20140820.kern.log" size="231057" author="hilljjornl" created="Thu, 21 Aug 2014 19:39:49 +0000"/>
                            <attachment id="15567" name="atlas-oss1c7.20140821.kern.log" size="117075" author="hilljjornl" created="Thu, 21 Aug 2014 19:39:49 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwuaf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>15399</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>