<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:44:08 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4593] Lustre OSTs on one of the login nodes are evicted but never reconnect.</title>
                <link>https://jira.whamcloud.com/browse/LU-4593</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;AWE has reported the following problem about Lustre.&lt;br/&gt;
Occasionally some of the Lustre OSTs on sprig2 (one of the login nodes) are evicted but never reconnect and they see the following error in the syslog:&lt;/p&gt;

&lt;p&gt;Jan 14 08:54:13 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;680476.616831&amp;#93;&lt;/span&gt; LustreError: 809:0:(cl_lock.c:1420:cl_unuse_try()) result = -108, this is unlikely!&lt;/p&gt;

&lt;p&gt;extract of syslog&lt;br/&gt;
Jan 14 08:55:19 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;680478.577953&amp;#93;&lt;/span&gt; LustreError: 829:0:(cl_lock.c:1435:cl_unuse_locked()) } lock@ffff882d00691eb8&lt;br/&gt;
Jan 14 08:55:19 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;680478.577958&amp;#93;&lt;/span&gt; LustreError: 829:0:(cl_lock.c:1435:cl_unuse_locked())    3 0: &amp;#8212;&lt;br/&gt;
Jan 14 08:55:19 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;680478.577963&amp;#93;&lt;/span&gt; LustreError: 829:0:(cl_lock.c:1435:cl_unuse_locked())    4 0: &amp;#8212;&lt;br/&gt;
Jan 14 08:55:19 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;680478.577972&amp;#93;&lt;/span&gt; LustreError: 829:0:(cl_lock.c:1435:cl_unuse_locked())    5 0: lock@ffff8838357b34d8&lt;span class=&quot;error&quot;&gt;&amp;#91;0 5 0 0 0 00000000&amp;#93;&lt;/span&gt; R(1):&lt;span class=&quot;error&quot;&gt;&amp;#91;0, 18446744073709551615&amp;#93;&lt;/span&gt;@&lt;span class=&quot;error&quot;&gt;&amp;#91;0x100070000:0x41dd7e:0x0&amp;#93;&lt;/span&gt; {&lt;br/&gt;
Jan 14 08:55:19 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;680478.577982&amp;#93;&lt;/span&gt; LustreError: 829:0:(cl_lock.c:1435:cl_unuse_locked())    lovsub@ffff882cb4efc760: [5 ffff883bbf9a88e8 P(0):[0, 1844674&lt;br/&gt;
4073709551615]@&lt;span class=&quot;error&quot;&gt;&amp;#91;0x5000013f2:0x2872:0x0&amp;#93;&lt;/span&gt;]&lt;br/&gt;
Jan 14 08:55:19 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;680478.577992&amp;#93;&lt;/span&gt; LustreError: 829:0:(cl_lock.c:1435:cl_unuse_locked())    osc@ffff883cb5239d80: ffff883c470b7b40    0x20000041001 0x54d9d8518dc4ea31&lt;/p&gt;

&lt;p&gt;Looks similar to issue reported on &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-3889&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-3889&lt;/a&gt;.&lt;/p&gt;</description>
                <environment></environment>
        <key id="23025">LU-4593</key>
            <summary>Lustre OSTs on one of the login nodes are evicted but never reconnect.</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="orentas">Oz Rentas</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Thu, 6 Feb 2014 04:58:26 +0000</created>
                <updated>Thu, 23 Nov 2017 21:39:53 +0000</updated>
                            <resolved>Wed, 14 May 2014 08:50:14 +0000</resolved>
                                    <version>Lustre 2.4.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>12</watches>
                                                                            <comments>
                            <comment id="76335" author="pjones" created="Thu, 6 Feb 2014 12:46:52 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Could you please help with this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="76846" author="orentas" created="Wed, 12 Feb 2014 16:42:37 +0000"  >&lt;p&gt;Sprig2 hung again on 6th Feb.  Log files from the client as well as the OSS and MDS nodes are attached.  The only user on the machine at that time said he was updating the working copy of a subversion repository and had a couple of editor sessions open.&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="77286" author="orentas" created="Tue, 18 Feb 2014 20:29:51 +0000"  >&lt;p&gt;Another crash event on 2/17. Attaching a fresh set of logs.  Any feedback is appreciated.&lt;/p&gt;</comment>
                            <comment id="77317" author="green" created="Wed, 19 Feb 2014 01:41:08 +0000"  >&lt;p&gt;hm, what do you mean by another &quot;crash&quot; event? Before you only had hangs of clients with the diagnostics, right? What was the crash?&lt;/p&gt;</comment>
                            <comment id="77318" author="bobijam" created="Wed, 19 Feb 2014 01:55:38 +0000"  >&lt;p&gt;Saw several communication errors &lt;/p&gt;

&lt;p&gt;Feb  6 16:18:56 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;621533.228954&amp;#93;&lt;/span&gt; LustreError: 11-0: scratch-OST0003-osc-ffff881f4fe08400: Communicating with 11.3.0.14@o2ib, operation obd_ping failed with -107.&lt;br/&gt;
Feb  6 16:27:41 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;622057.364356&amp;#93;&lt;/span&gt; LustreError: 11-0: scratch-MDT0000-mdc-ffff881f4fe08400: Communicating with 11.3.0.13@o2ib, operation obd_ping failed with -107.&lt;br/&gt;
Feb  6 16:29:46 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;622182.158450&amp;#93;&lt;/span&gt; LustreError: 11-0: scratch-OST0002-osc-ffff881f4fe08400: Communicating with 11.3.0.14@o2ib, operation obd_ping failed with -107.&lt;br/&gt;
Feb  6 16:31:26 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;622281.993824&amp;#93;&lt;/span&gt; LustreError: 11-0: scratch-OST0004-osc-ffff881f4fe08400: Communicating with 11.3.0.14@o2ib, operation obd_ping failed with -107.&lt;br/&gt;
Feb  6 16:39:21 sprig2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;622756.211587&amp;#93;&lt;/span&gt; LustreError: 11-0: scratch-OST0009-osc-ffff881f4fe08400: Communicating with 11.3.0.15@o2ib, operation obd_ping failed with -107.&lt;/p&gt;

&lt;p&gt;oss1-messages:Feb 17 13:04:45 sprig-oss1 kernel: : LustreError: 138-a: scratch-OST0003: A client on nid 11.3.0.47@o2ib was evicted due to a lock blocking callback time out: rc -107&lt;br/&gt;
oss1-messages:Feb 17 13:05:00 sprig-oss1 kernel: : LustreError: 138-a: scratch-OST0004: A client on nid 11.3.0.47@o2ib was evicted due to a lock blocking callback time out: rc -107&lt;br/&gt;
oss1-messages:Feb 17 13:05:15 sprig-oss1 kernel: : LustreError: 138-a: scratch-OST0001: A client on nid 11.3.0.47@o2ib was evicted due to a lock blocking callback time out: rc -107&lt;br/&gt;
oss2-messages:Feb 17 13:04:53 sprig-oss2 kernel: : LustreError: 138-a: scratch-OST0006: A client on nid 11.3.0.47@o2ib was evicted due to a lock blocking callback time out: rc -107&lt;br/&gt;
oss2-messages:Feb 17 13:05:08 sprig-oss2 kernel: : LustreError: 138-a: scratch-OST0007: A client on nid 11.3.0.47@o2ib was evicted due to a lock blocking callback time out: rc -107&lt;br/&gt;
oss2-messages:Feb 17 13:05:22 sprig-oss2 kernel: : LustreError: 138-a: scratch-OST000a: A client on nid 11.3.0.47@o2ib was evicted due to a lock blocking callback time out: rc -107&lt;br/&gt;
oss2-messages:Feb 17 13:05:30 sprig-oss2 kernel: : LustreError: 138-a: scratch-OST000b: A client on nid 11.3.0.47@o2ib was evicted due to a lock blocking callback time out: rc -107&lt;/p&gt;

&lt;p&gt;Is the lustre network working during the client hung?&lt;/p&gt;</comment>
                            <comment id="77320" author="bobijam" created="Wed, 19 Feb 2014 03:42:58 +0000"  >&lt;p&gt;You can use &quot;lctl ping &amp;lt;NID&amp;gt;&quot; to test lustre network, NID can be MDS nid or OSS nid, like 11.3.0.47@o2ib&lt;/p&gt;</comment>
                            <comment id="78255" author="orentas" created="Mon, 3 Mar 2014 18:47:23 +0000"  >&lt;p&gt;Hi, this came in from the customer on 2/26. The SR owner asked me to update the notes with the new logs but I missed that original request.  Here&apos;s it is:&lt;/p&gt;

&lt;p&gt;Please find attached &#8220;lctl ping&#8221; test results as well as the dmesg, Lustre log and messages file produced by a Lustre OST eviction event that occurred this morning on sprig5.&lt;/p&gt;

&lt;p&gt;This time we were the ones who triggered the bug rather than our users. We were recursively deleting a number of directories which contained a large number of small files on sprig5 which we use as a &#8220;scheduler node&#8221;. We therefore the only users logged in and the deletion was performed as the root user.&lt;/p&gt;

&lt;p&gt;I have since been able to recreate the issue on a compute node using a simple bash script that creates a large number of small files and then recursively deletes them using the rm command. Unfortunately, the bug isn&#8217;t 100% reproducible, but I hope this information may help you debug the issue further.&lt;/p&gt;</comment>
                            <comment id="78298" author="bobijam" created="Tue, 4 Mar 2014 01:04:17 +0000"  >&lt;p&gt;Still see sprig5 lost connection to OSS (11.3.0.14@o2ib) for unknown reason&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Feb 26 10:50:44 sprig5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;7247556.277424&amp;#93;&lt;/span&gt; LustreError: 11-0: scratch-OST0001-osc-ffff880800c76400: Communicating with 11.3.0.14@o2ib, operation obd_ping failed with -107.&lt;br/&gt;
Feb 26 10:50:44 sprig5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;7247556.277432&amp;#93;&lt;/span&gt; Lustre: scratch-OST0003-osc-ffff880800c76400: Connection to scratch-OST0003 (at 11.3.0.14@o2ib) was lost; in progress operations using this service will wait for recovery to complete&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;and this makes lots of ongoing IO error out, as&lt;/p&gt;

&lt;p&gt;Feb 26 10:51:40 sprig5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;7247612.124147&amp;#93;&lt;/span&gt; LustreError: 129994:0:(cl_lock.c:1420:cl_unuse_try()) result = -108, this is unlikely!&lt;/p&gt;


</comment>
                            <comment id="78327" author="rganesan@ddn.com" created="Tue, 4 Mar 2014 11:56:28 +0000"  >&lt;p&gt;Received the latest Logs. &lt;/p&gt;

&lt;p&gt;The eviction has happened again on a login node.  We&#8217;ve been able to get the output from both dump files, dmesg and the messages file.  We tried the lctl ping and were able to ping each of the mds and oss nodes.&lt;/p&gt;</comment>
                            <comment id="78328" author="rganesan@ddn.com" created="Tue, 4 Mar 2014 11:58:04 +0000"  >&lt;p&gt;The eviction has happened again on a login node.  We&#8217;ve been able to get the output from both dump files, dmesg and the messages file.  We tried the lctl ping and were able to ping each of the mds and oss nodes.&lt;/p&gt;</comment>
                            <comment id="78329" author="bobijam" created="Tue, 4 Mar 2014 12:44:42 +0000"  >&lt;p&gt;from lustre-lost.1393924031.9592 and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4593&quot; title=&quot;Lustre OSTs on one of the login nodes are evicted but never reconnect.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4593&quot;&gt;&lt;del&gt;LU-4593&lt;/del&gt;&lt;/a&gt;-lustre-log.1393924031.9587&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;00000100:00020000:29.0F:1393927331.111498:0:137200:0:(import.c:324:ptlrpc_invalidate_import()) scratch-OST0007_UUID: rc = -110 waiting for callback (1 != 0)&lt;br/&gt;
00000100:00020000:28.0F:1393927331.111498:0:141486:0:(import.c:324:ptlrpc_invalidate_import()) scratch-OST0004_UUID: rc = -110 waiting for callback (1 != 0)&lt;br/&gt;
00000100:00020000:4.0F:1393927331.111501:0:141487:0:(import.c:324:ptlrpc_invalidate_import()) scratch-OST0001_UUID: rc = -110 waiting for callback (1 != 0)&lt;/p&gt;&lt;/blockquote&gt;

&lt;blockquote&gt;
&lt;p&gt;00000100:00020000:37.0F:1393926731.111467:0:137200:0:(import.c:324:ptlrpc_invalidate_import()) scratch-OST0007_UUID: rc = -110 waiting for callback (1 != 0)&lt;br/&gt;
00000100:00020000:20.0F:1393926731.111476:0:141487:0:(import.c:324:ptlrpc_invalidate_import()) scratch-OST0001_UUID: rc = -110 waiting for callback (1 != 0)&lt;br/&gt;
00000100:00020000:0.0F:1393926731.111480:0:141486:0:(import.c:324:ptlrpc_invalidate_import()) scratch-OST0004_UUID: rc = -110 waiting for callback (1 != 0)&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Is the OSS which holds OST0001/4/7 very busy during the eviction?&lt;/p&gt;</comment>
                            <comment id="78591" author="rganesan@ddn.com" created="Thu, 6 Mar 2014 15:31:08 +0000"  >&lt;p&gt;We have checked the system load, but they look normal, here is the update received.. &lt;/p&gt;

&lt;p&gt;OST 1 &amp;amp; 4 are hosted on oss1, OST 7 is hosted on oss2.&lt;/p&gt;



&lt;p&gt;We have sysstat installed on the OSSs.&lt;/p&gt;



&lt;p&gt;The OSS machines don&#8217;t seem heavily loaded.  Most samples report 99.90% idle using the sar command.  The eviction happened at 23:08 last night but the OSSs reported no real change to the load at that time.&lt;/p&gt;



&lt;p&gt;Looking at the load on the machine yesterday we see peaks of SYSTEM use and IOWAIT times at 10:30 and 14:30 on both OSSs.  These peaks increase from the normal 0.07% system use to 0.30% and 0.01% iowait to 0.27 at 14:30.  I wouldn&#8217;t imagine this is considered heavily loaded?&lt;/p&gt;

</comment>
                            <comment id="78597" author="bobijam" created="Thu, 6 Mar 2014 16:41:09 +0000"  >&lt;p&gt;Would you mind trying to rehit the issue for another time with client/OSS/MDS debug setting as &lt;/p&gt;

&lt;p&gt;#lctl set_param debug=&quot;+rpctrace +dlmtrace&quot; &lt;/p&gt;

&lt;p&gt;and try to setup bigger debug buffer&lt;/p&gt;

&lt;p&gt;#lctl set_param debug_mb=&amp;lt;# of megabytes&amp;gt;&lt;/p&gt;

&lt;p&gt;after the eviction happens, &quot;lctl dk &amp;gt; log_file&quot; on client, OSS and MDS nodes.&lt;/p&gt;

&lt;p&gt;I try to understand where the eviction come from. The existing log is brief and does not cover the troublesome lock&apos;s lifespan. Thanks in advance.&lt;/p&gt;</comment>
                            <comment id="79005" author="rganesan@ddn.com" created="Tue, 11 Mar 2014 15:23:03 +0000"  >&lt;p&gt;I have added new set of logs into FTP site, ftp.whamcloud.com/uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4593&quot; title=&quot;Lustre OSTs on one of the login nodes are evicted but never reconnect.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4593&quot;&gt;&lt;del&gt;LU-4593&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The filenames are&lt;br/&gt;
2014-03-10-SR29797_AWE_luster_client.tar&#8232;&lt;br/&gt;
2014-03-10-SR29797_AWE_luster_mds1.tar&#8232;&lt;br/&gt;
2014-03-10-SR29797_AWE_luster_mds2.tar.gz&#8232;&lt;br/&gt;
2014-03-10-SR29797_AWE_luster_oss1.tar.gz&#8232;&lt;br/&gt;
2014-03-10-SR29797_AWE_luster_oss2.tar.gz&lt;/p&gt;</comment>
                            <comment id="79009" author="rganesan@ddn.com" created="Tue, 11 Mar 2014 15:35:24 +0000"  >&lt;p&gt; Sprig3 has just been evicted (by user activity) &#8211; that was 07 March 2014 16:41&lt;/p&gt;

</comment>
                            <comment id="79210" author="bobijam" created="Thu, 13 Mar 2014 03:35:56 +0000"  >&lt;p&gt;the client logs uploaded (2014-03-10-SR29797_AWE_luster_client.tar&#8232;) only covers between Fri Mar  7 10:08:16 to Fri Mar  7 10:08:46, while the eviction happened after Mar  7 15:41:18. And I could not find error message in the lustre-log.xxxx.xxx files, I think you&apos;ve set /proc/sys/lustre/dump_on_eviction to 1 that when eviction happens it will dump a log automatically, but the problem is when the eviction happens again and again, the same log file for the same thread will be overwritten. Please put 0 to /proc/sys/lustre/dump_on_eviction. I&apos;m still checking server logs.&lt;/p&gt;

&lt;p&gt;PS. Can I know what &quot;cat /proc/sys/lustre/dump_on_eviction&quot; and &quot;lctl get_param debug&quot; command output? Thanks.&lt;/p&gt;</comment>
                            <comment id="79212" author="bobijam" created="Thu, 13 Mar 2014 04:55:29 +0000"  >&lt;p&gt;Also can you help find whether network works when high load happens? What&apos;s the network topology and setup here? I strongly suspect it could be network problem since logs keep on showing lost connection among each other and reconnection afterwards and repeat the cycle.&lt;/p&gt;</comment>
                            <comment id="79493" author="bobijam" created="Mon, 17 Mar 2014 03:35:43 +0000"  >&lt;p&gt;Let me clarify the situation of this issue.&lt;br/&gt;
1. Client got evicted from OST and never got reconnect back, does the client hung? If it hangs, can you collect/upload the stacktrace of it?&lt;br/&gt;
2. When the eviction happens, MDS and OSS are not highly loaded, and they can serve other clients well, are they?&lt;/p&gt;</comment>
                            <comment id="79577" author="rganesan@ddn.com" created="Tue, 18 Mar 2014 16:29:06 +0000"  >&lt;p&gt;1-sprig3 was evicted and none of the OSTs reconnected. The node was not hung as we could log into it to grab the Lustre logs etc. In the end we rebooted to get Lustre back into service on the node.&lt;br/&gt;
2-Looking at the system accounting logs for the time period the OSS and MDS nodes show no high loads. &lt;/p&gt;</comment>
                            <comment id="79657" author="bobijam" created="Wed, 19 Mar 2014 13:40:51 +0000"  >&lt;p&gt;update for what I&apos;ve observed so far.&lt;/p&gt;

&lt;p&gt;From the evicted client log (lustre-log.1394206866.9593.gz), from the begining (Mar 7 08:47:46) to the end of the log (Mar 7 17:07:46), there are several enqueue (o101) RPCs always in-flight to OSS1, while didn&apos;t get response from OSS1.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar  7 15:42:46 sprig3 kernel: [1562020.535941] LustreError: 27830:0:(import.c:324:ptlrpc_invalidate_import()) scratch-OST0005_UUID: rc = -110 waiting for callback (2 != 0)
Mar  7 15:42:46 sprig3 kernel: [1562020.535961] LustreError: 27830:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff883ac3c15800 x1460291788607144/t0(0) o101-&amp;gt;scratch-OST0005-osc-ffff883f77c95800@11.3.0.14@o2ib:28/4 lens 328/368 e 0 to 0 dl 1394205665 ref 1 fl Interpret:RE/0/0 rc -5/0
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;the sprig3 is using 11.3.0.47@o2ib&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;dmesg.sprig3&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[  222.193940] LNet: Added LNI 11.3.0.47@o2ib [8/512/0/180]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And oss1 shows corresponding communication with sprig3&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;grep -n &quot;11.3.0.47&quot; messages.oss1&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;45:Mar  7 15:40:54 sprig-oss1 kernel: : LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 1406s: evicting client at 11.3.0.47@o2ib  ns: filter-scratch-OST0000_UUID lock: ffff88014fbf1480/0x7385d82236277793 lrc: 3/0,0 mode: PW/PW res: [0xc943f9:0x0:0x0].0 rrc: 2 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;8191) flags: 0x10020 nid: 11.3.0.47@o2ib remote: 0x9f2b96c814304d30 expref: 75858 pid: 32611 timeout: 13979851913 lvb_type: 0
49:Mar  7 15:40:54 sprig-oss1 kernel: : LustreError: 32600:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88073901c800 x1451777805607860/t0(0) o104-&amp;gt;scratch-OST0003@11.3.0.47@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1
51:Mar  7 15:40:54 sprig-oss1 kernel: : LustreError: 32600:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 11.3.0.47@o2ib) returned 0 from blocking AST ns: filter-scratch-OST0003_UUID lock: ffff8801a74f3480/0x7385d8223628a2da lrc: 4/0,0 mode: PR/PR res: [0xc94b3e:0x0:0x0].0 rrc: 2 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x10020 nid: 11.3.0.47@o2ib remote: 0x9f2b96c814336365 expref: 70975 pid: 32641 timeout: 13979952124 lvb_type: 1
...
87:Mar  7 15:41:06 sprig-oss1 kernel: : LustreError: 478:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 11.3.0.47@o2ib arrived at 1394206866 with bad export cookie 8324297127499047490
88:Mar  7 15:41:06 sprig-oss1 kernel: : LustreError: 478:0:(ldlm_lock.c:2433:ldlm_lock_dump_handle()) ### ### ns: filter-scratch-OST0005_UUID lock: ffff88059e1dab40/0x7385d82234639d0a lrc: 3/0,0 mode: PR/PR res: [0x933e33:0x0:0x0].0 rrc: 1 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x0 nid: 11.3.0.47@o2ib remote: 0x9f2b96c80fb8a3f9 expref: 41230 pid: 13975 timeout: 0 lvb_type: 1
91:Mar  7 15:44:53 sprig-oss1 kernel: : Lustre: scratch-OST0005: haven&apos;t heard from client ec8e7f8a-ffc8-0d43-aa6a-b5bdfd07799b (at 11.3.0.47@o2ib) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff880848a7d800, cur 1394207093 expire 1394206943 last 1394206866
93:Mar  7 15:44:55 sprig-oss1 kernel: : Lustre: scratch-OST0002: haven&apos;t heard from client ec8e7f8a-ffc8-0d43-aa6a-b5bdfd07799b (at 11.3.0.47@o2ib) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff8804d2b2fc00, cur 1394207095 expire 1394206945 last 1394206868
95:Mar  7 15:45:08 sprig-oss1 kernel: : Lustre: scratch-OST0003: haven&apos;t heard from client ec8e7f8a-ffc8-0d43-aa6a-b5bdfd07799b (at 11.3.0.47@o2ib) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff880582cb6800, cur 1394207108 expire 1394206958 last 1394206881
101:Mar  7 15:45:20 sprig-oss1 kernel: : Lustre: scratch-OST0004: haven&apos;t heard from client ec8e7f8a-ffc8-0d43-aa6a-b5bdfd07799b (at 11.3.0.47@o2ib) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff880848a7d000, cur 1394207120 expire 1394206970 last 1394206893
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="79815" author="bobijam" created="Thu, 20 Mar 2014 06:39:38 +0000"  >&lt;p&gt;from the client forever waiting request status&lt;/p&gt;

&lt;p&gt;Mar  7 15:42:46 sprig3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1562020.535961&amp;#93;&lt;/span&gt; LustreError: 27830:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff883ac3c15800 x1460291788607144/t0(0) o101-&amp;gt;scratch-OST0005-osc-ffff883f77c95800@11.3.0.14@o2ib:28/4 lens 328/368 e 0 to 0 dl 1394205665 ref 1 fl Interpret:RE/0/0 rc -5/0&lt;/p&gt;

&lt;p&gt;it&apos;s rq_phase is &quot;Interpret&quot; and it&apos;s been replied and contains error &quot;RE&quot;, so that drag my eyes to ptlrpc_check_set()&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;ptlrpc_check_set()&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;...
        interpret:
                LASSERT(req-&amp;gt;rq_phase == RQ_PHASE_INTERPRET);

                /* This moves to &lt;span class=&quot;code-quote&quot;&gt;&quot;unregistering&quot;&lt;/span&gt; phase we need to wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt;
                 * reply unlink. */
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!unregistered &amp;amp;&amp;amp; !ptlrpc_unregister_reply(req, 1)) {
                        &lt;span class=&quot;code-comment&quot;&gt;/* start async bulk unlink too */&lt;/span&gt;
                        ptlrpc_unregister_bulk(req, 1);
                        &lt;span class=&quot;code-keyword&quot;&gt;continue&lt;/span&gt;;
                }

                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!ptlrpc_unregister_bulk(req, 1))
                        &lt;span class=&quot;code-keyword&quot;&gt;continue&lt;/span&gt;;

                /* When calling interpret receiving already should be
                 * finished. */
                LASSERT(!req-&amp;gt;rq_receiving_reply);

                ptlrpc_req_interpret(env, req, req-&amp;gt;rq_status);

                ptlrpc_rqphase_move(req, RQ_PHASE_COMPLETE);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I suspect there is a bug here.&lt;/p&gt;</comment>
                            <comment id="79824" author="bobijam" created="Thu, 20 Mar 2014 10:42:18 +0000"  >&lt;p&gt;Hi Rajeshwaran,&lt;/p&gt;

&lt;p&gt;This is a debug patch &lt;a href=&quot;http://review.whamcloud.com/#/c/9732/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/9732/&lt;/a&gt;, only client need it, please remember to set debug level including rpctrace and dlmtrace, current uploaded logs do not have these debug messages. Please use this debug patch and turn on these debug level and try to rehit the issue, thank you.&lt;/p&gt;</comment>
                            <comment id="79967" author="rganesan@ddn.com" created="Fri, 21 Mar 2014 06:55:56 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;Is it possible to get the patched lustre source rpm. We had some issues while compling source rpm with the patches. it would be great, if you can give us the patched lustre client. &lt;/p&gt;


&lt;p&gt;Thanks,&lt;/p&gt;</comment>
                            <comment id="79968" author="bobijam" created="Fri, 21 Mar 2014 07:05:41 +0000"  >&lt;p&gt;&lt;a href=&quot;http://build.whamcloud.com/job/lustre-reviews/22597/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-reviews/22597/&lt;/a&gt; could possible contains what you need, you can check your system from the build matrix, and in their &quot;Build Artifacts&quot; you can corresponding source rpm.&lt;/p&gt;</comment>
                            <comment id="80285" author="rganesan@ddn.com" created="Wed, 26 Mar 2014 09:56:32 +0000"  >&lt;p&gt;We have applied the patch on the client and monitoring it for any eviction. &lt;/p&gt;</comment>
                            <comment id="80307" author="bfaccini" created="Wed, 26 Mar 2014 16:57:15 +0000"  >&lt;p&gt;Here are some news/updates for this ticket after the conf-call :&lt;br/&gt;
       _ an IB network audit/test (ibdiagnet from the subnet manager/node) has been run and did not show any problem.&lt;br/&gt;
       _ debug-patch #9732 has been applied on Clients.&lt;br/&gt;
       _ a known reproducer/script is currently being run, and we are waiting for a new occurrence.&lt;br/&gt;
       _ this reproducer/script will be provided, along with any details to reproduce and also the Lustre configuration (OSSs/OSTs number, ...) being used on-site.&lt;br/&gt;
       _ as already requested, customer is running with D_RPCTRACE and D_DLMTRACE traces enabled on their Clients.&lt;br/&gt;
       _ debug buffer size has also been increased (exact value to be provided).&lt;br/&gt;
       _ dump_on_timeout and dump_on_eviction have been unset.&lt;br/&gt;
       _ when the eviction will re-occur, the Lustre debug-log will be manually taken on Client, OSS and MDS nodes.&lt;br/&gt;
       _ would be also of interest to have the exact Lustre version being run on Clients/Servers along with the list/detail of any additional patches applied.&lt;/p&gt;</comment>
                            <comment id="80819" author="rganesan@ddn.com" created="Wed, 2 Apr 2014 10:02:10 +0000"  >&lt;p&gt;The ibdiagnet logs in the ftp server under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4593&quot; title=&quot;Lustre OSTs on one of the login nodes are evicted but never reconnect.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4593&quot;&gt;&lt;del&gt;LU-4593&lt;/del&gt;&lt;/a&gt; folder. &lt;/p&gt;


&lt;p&gt;The clients are running SLES 11 SP3: &lt;br/&gt;
Lustre  On Clients -2.4.3&lt;br/&gt;
Kernel:3.0.93-0.8-default&lt;/p&gt;

&lt;p&gt;Server Info:&lt;br/&gt;
MDS/OSSes are in CentOS and Lustre 2.4.1&lt;br/&gt;
Kernel:kernel-2.6.32-358.18.1.&lt;/p&gt;</comment>
                            <comment id="80835" author="bfaccini" created="Wed, 2 Apr 2014 12:55:59 +0000"  >&lt;p&gt;Hello Rajeshwaran,&lt;br/&gt;
Thanks for the ibdiagnet traces already, but is there anything else you can provide in the items I listed in my previous update ??&lt;/p&gt;
</comment>
                            <comment id="80937" author="bfaccini" created="Thu, 3 Apr 2014 14:19:06 +0000"  >&lt;p&gt;After a joint conf-call with DDN/SGI/AWE, it appears that they do not trigger evictions anymore, running Lustre 2.4.3 (including debug patch #9732) on Clients. So we need to determine if this new behavior is caused by the debug patch or due to changes between 2.4.1 and 2.4.3.&lt;/p&gt;</comment>
                            <comment id="81284" author="bfaccini" created="Wed, 9 Apr 2014 14:37:28 +0000"  >&lt;p&gt;As requested during last conf-call, I am adding here the Commit-msg content from patch #9732 which qualifies its non-debug/fix change :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LU-4593 ptlrpc: req could loop forever in sending list

In ptlrpc_check_set(), when a request runs into
RQ_PHASE_UNREGISTERING phase, if the reply is successfully unlinked,
the unregistered flag should be set accordingly, otherwise the
request could possibly run into interpret section and repeat the reply
unregister.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="81737" author="bfaccini" created="Wed, 16 Apr 2014 14:47:32 +0000"  >&lt;p&gt;Rajeshwaran: I forgot to ask you if, when running with debug version of #9732, you have found any of the &quot;Handled 1000 reqs in the set 0xxxxx during yyyy seconds.&quot; added debug msg and also a long list of ptlrpc_request(s) being dumped for the set, in the debug-log (with dlmtrace+rpctrace set) of the Client where you reproduced the problem ? &lt;/p&gt;

&lt;p&gt;Bobi: now that you pushed a &quot;fix only&quot; version/patch-set of #9732, do you think that the same fix is required for b2_5/master too ?&lt;/p&gt;</comment>
                            <comment id="81738" author="bobijam" created="Wed, 16 Apr 2014 14:52:51 +0000"  >&lt;p&gt;yes, I&apos;d push patches for master and b2_5.&lt;/p&gt;</comment>
                            <comment id="81837" author="bobijam" created="Thu, 17 Apr 2014 15:43:15 +0000"  >&lt;p&gt;patch for master tracking at &lt;a href=&quot;http://review.whamcloud.com/#/c/9975/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/9975/&lt;/a&gt;, for b2_5 at &lt;a href=&quot;http://review.whamcloud.com/#/c/9992&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/9992&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="84074" author="rganesan@ddn.com" created="Wed, 14 May 2014 07:57:41 +0000"  >&lt;p&gt;Hello, &lt;/p&gt;

&lt;p&gt;Please close the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4593&quot; title=&quot;Lustre OSTs on one of the login nodes are evicted but never reconnect.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4593&quot;&gt;&lt;del&gt;LU-4593&lt;/del&gt;&lt;/a&gt;, we issue didnt occur after the patch.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Rajesh&lt;/p&gt;</comment>
                            <comment id="84099" author="jfc" created="Wed, 14 May 2014 16:25:32 +0000"  >&lt;p&gt;Thank you for the update and confirmation Rajesh &amp;#8211; very helpful! &lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="14219" name="LU-4593-lustre-log.1393924031.9587.gz" size="239893" author="rganesan@ddn.com" created="Tue, 4 Mar 2014 11:58:04 +0000"/>
                            <attachment id="14216" name="dmesg-20140304" size="216901" author="rganesan@ddn.com" created="Tue, 4 Mar 2014 11:56:28 +0000"/>
                            <attachment id="14203" name="lctl_ping.txt" size="12810" author="orentas" created="Mon, 3 Mar 2014 18:47:23 +0000"/>
                            <attachment id="14266" name="lustre-evict-7mar-client.tar" size="2007040" author="rganesan@ddn.com" created="Tue, 11 Mar 2014 15:24:21 +0000"/>
                            <attachment id="14267" name="lustre-evict-7mar14-mds1.tar" size="153600" author="rganesan@ddn.com" created="Tue, 11 Mar 2014 15:24:21 +0000"/>
                            <attachment id="14217" name="lustre-log.1393924031.9592" size="189151" author="rganesan@ddn.com" created="Tue, 4 Mar 2014 11:56:28 +0000"/>
                            <attachment id="14131" name="mds1-messages" size="958" author="orentas" created="Tue, 18 Feb 2014 20:29:51 +0000"/>
                            <attachment id="14132" name="mds2-messages" size="2689" author="orentas" created="Tue, 18 Feb 2014 20:29:51 +0000"/>
                            <attachment id="14218" name="messages-20140304" size="121122" author="rganesan@ddn.com" created="Tue, 4 Mar 2014 11:56:28 +0000"/>
                            <attachment id="14102" name="messages-mds1" size="4514" author="orentas" created="Wed, 12 Feb 2014 16:42:37 +0000"/>
                            <attachment id="14103" name="messages-mds2" size="4865" author="orentas" created="Wed, 12 Feb 2014 16:42:37 +0000"/>
                            <attachment id="14104" name="messages-oss1" size="7435" author="orentas" created="Wed, 12 Feb 2014 16:42:37 +0000"/>
                            <attachment id="14105" name="messages-oss2" size="7595" author="orentas" created="Wed, 12 Feb 2014 16:42:37 +0000"/>
                            <attachment id="14106" name="messages-sprig2" size="28892" author="orentas" created="Wed, 12 Feb 2014 16:42:37 +0000"/>
                            <attachment id="14133" name="messages-sprig3" size="41219" author="orentas" created="Tue, 18 Feb 2014 20:29:51 +0000"/>
                            <attachment id="14134" name="oss1-messages" size="5932" author="orentas" created="Tue, 18 Feb 2014 20:29:51 +0000"/>
                            <attachment id="14135" name="oss2-messages" size="5218" author="orentas" created="Tue, 18 Feb 2014 20:29:51 +0000"/>
                            <attachment id="14055" name="sprig-mds1-messages.txt" size="694" author="orentas" created="Thu, 6 Feb 2014 04:58:26 +0000"/>
                            <attachment id="14056" name="sprig-mds2-messages.txt" size="448" author="orentas" created="Thu, 6 Feb 2014 04:58:26 +0000"/>
                            <attachment id="14057" name="sprig-oss1-messages.txt" size="4389" author="orentas" created="Thu, 6 Feb 2014 04:58:26 +0000"/>
                            <attachment id="14058" name="sprig-oss2-messages.txt" size="4624" author="orentas" created="Thu, 6 Feb 2014 04:58:26 +0000"/>
                            <attachment id="14053" name="sprig2-lfs.txt" size="1536" author="orentas" created="Thu, 6 Feb 2014 04:58:26 +0000"/>
                            <attachment id="14054" name="sprig2-messages-20140130.txt" size="28048" author="orentas" created="Thu, 6 Feb 2014 04:58:26 +0000"/>
                            <attachment id="14204" name="sprig5-dmesg-20140226.txt" size="20787" author="orentas" created="Mon, 3 Mar 2014 18:47:23 +0000"/>
                            <attachment id="14205" name="sprig5-lustre-log.1393411844.12537" size="276847" author="orentas" created="Mon, 3 Mar 2014 18:47:23 +0000"/>
                            <attachment id="14206" name="sprig5-messages-20140226.txt" size="20526" author="orentas" created="Mon, 3 Mar 2014 18:47:23 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzweev:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>12546</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>