<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:38:18 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3946] Frequent client eviction</title>
                <link>https://jira.whamcloud.com/browse/LU-3946</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Dear support,&lt;/p&gt;

&lt;p&gt;our customer is experiencing an eviction problem on the login/client nodes in his lustre cluster.&lt;br/&gt;
The eviction seems too frequent and, since seems to be no other way to recover then reboot the node, this interrupts the users work heavily.&lt;br/&gt;
The cause of eviction seems to be that the login nodes may be temporarily stuck by different applications for fractions of seconds and consequently the oss does not see the client for minutes (according to logs).&lt;br/&gt;
The customer attempted to solve the issue by booting the login nodes with kernel options &quot;notsc&quot; and &quot;clocksource=hpet&quot; to no avail.&lt;br/&gt;
Also, they mounted lustre over tcp instead of IB, which also did not help. &lt;/p&gt;

&lt;p&gt;Infrastructure info:&lt;br/&gt;
2 node for MDS/MGS with pacemaker/corosync cluster&lt;br/&gt;
8 node for OSS in a 2 node cluster configuration with pacemaker/corosync cluster with a dothill storage controller per pair.&lt;br/&gt;
~1000 client&lt;/p&gt;

&lt;p&gt;We attach the messages log file for several days.&lt;br/&gt;
For example we found about 280 clients evicted in 5 days, it is normal?&lt;/p&gt;

&lt;p&gt;For clarity login nodes have names brutus&lt;span class=&quot;error&quot;&gt;&amp;#91;1-4&amp;#93;&lt;/span&gt; and IB IP 10.201.32.31-34 and Ethernet IP 10.201.0.31-34.&lt;/p&gt;

&lt;p&gt;Is it possible that this issue arise from the different version of lustre software btw clients and servers?&lt;/p&gt;

&lt;p&gt;Many thanks in advance for your help.&lt;/p&gt;</description>
                <environment>Lustre servers:&lt;br/&gt;
CentOS 6.2 with Linux version 2.6.32-220.4.2.el6_lustre.x86_64 (&lt;a href=&apos;mailto:jenkins@client&apos;&gt;jenkins@client&lt;/a&gt; 31.lab.whamcloud.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 14 13:03:47 PDT 2012&lt;br/&gt;
&lt;br/&gt;
Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64&lt;br/&gt;
&lt;br/&gt;
HW: 2xIntel(R) Xeon(R) CPU E5645  @ 2.40GHz, RAM 48GB&lt;br/&gt;
&lt;br/&gt;
Lustre clients:&lt;br/&gt;
CentOS 6.4 with Linux version 2.6.32-358.6.2.el6.x86_64 (&lt;a href=&apos;mailto:mockbuild@c6b8.bsys.dev.centos.org&apos;&gt;mockbuild@c6b8.bsys.dev.centos.org&lt;/a&gt;) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Thu May 16 20:59:36 UTC 2013&lt;br/&gt;
&lt;br/&gt;
Build Version: 2.4.0-RC2--CHANGED-2.6.32-358.6.2.el6.x86_64&lt;br/&gt;
&lt;br/&gt;
HW: Login nodes are of type 12-core AMD Opteron 6174&lt;br/&gt;
Compute nodes are a mix of: &lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;12-core AMD Opteron 6174&lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;quad-core AMD Opteron 8380 &lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;quad-core AMD Opteron 8384 &lt;br/&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;8-core Intel Xeon E7-8837</environment>
        <key id="20949">LU-3946</key>
            <summary>Frequent client eviction</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="matteo.piccinini">Matteo Piccinini</reporter>
                        <labels>
                    </labels>
                <created>Fri, 13 Sep 2013 16:48:11 +0000</created>
                <updated>Mon, 28 Apr 2014 13:41:53 +0000</updated>
                            <resolved>Mon, 28 Apr 2014 13:41:21 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="66656" author="pjones" created="Fri, 13 Sep 2013 21:29:18 +0000"  >&lt;p&gt;Matteo&lt;/p&gt;

&lt;p&gt;It is certainly true that 2.2 servers with 2.4 clients is not a combination that we officially test or support. However, rather than jump to conclusions, we should have an engineer review the evidence and see if this is indeed the reason for the problems.&lt;/p&gt;

&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Could you please review the information provided and make an assessment&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="66675" author="hongchao.zhang" created="Sat, 14 Sep 2013 06:23:49 +0000"  >&lt;p&gt;Hi Matteo&lt;/p&gt;

&lt;p&gt;what is the format of the second log file &quot;lustre_and_login_nodes_logs.tar.bz2ab&quot;, which isn&apos;t a bzip2 file just like the first one.&lt;br/&gt;
the logs in the first file &quot;lustre_and_login_nodes_logs.tar.bz2aa&quot; only contains log of the clients, and the server logs are needed to&lt;br/&gt;
get the cause of eviction.&lt;/p&gt;

&lt;p&gt;BTW, does the &quot;stuck&quot; of login nodes affect the whole system or only the applications? how long is the maximum duration of the stuck?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;</comment>
                            <comment id="66694" author="matteo.piccinini" created="Sun, 15 Sep 2013 20:41:56 +0000"  >&lt;p&gt;Hi Peter and Hongchao,&lt;/p&gt;

&lt;p&gt;thanks for your further investigations.&lt;br/&gt;
I&apos;m sorry, I forgot to mention that I attached an archive splitted in two pieces (GNU coreutils split) because of the 10MB limits, if necessary I can upload a new one.&lt;/p&gt;

&lt;p&gt;The maximun duration of the &quot;stuck&quot; is variable, from seconds to several minutes and it affects only the login node itself requiring a manual system reboot.&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="66708" author="hongchao.zhang" created="Mon, 16 Sep 2013 08:54:12 +0000"  >&lt;p&gt;Hi Matteo,&lt;/p&gt;

&lt;p&gt;the eviction is related to the &quot;stuck&quot; as you mentioned above, for there is even no &quot;ping&quot; request for more than 227 (the ping request is sent&lt;br/&gt;
by an individual thread started alongside of the Lustre). the time interval before eviction can be extended as followings,&lt;/p&gt;

&lt;p&gt;echo &quot;newtimeout&quot; &amp;gt; /proc/sys/lustre/timeout&lt;/p&gt;

&lt;p&gt;the interval will be &quot;newtimeout * 9 / 4&quot;&lt;/p&gt;

&lt;p&gt;if the stuck can&apos;t be avoided, how about increasing the timeout value?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;</comment>
                            <comment id="66818" author="matteo.piccinini" created="Tue, 17 Sep 2013 08:17:21 +0000"  >&lt;p&gt;Hi Hongchao,&lt;/p&gt;

&lt;p&gt;we considered to change the timeout values.&lt;br/&gt;
We will do tests and let you know as soon as possible if it solves the problem.&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;

&lt;p&gt;P.S. I don&apos;t know how to change the ticket status&lt;/p&gt;</comment>
                            <comment id="66819" author="pjones" created="Tue, 17 Sep 2013 08:30:38 +0000"  >&lt;p&gt;Hi Matteo&lt;/p&gt;

&lt;p&gt;It is ok - you do not need to change the ticket status. Please just let us know how you get on with your tests.&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="68573" author="matteo.piccinini" created="Tue, 8 Oct 2013 13:52:37 +0000"  >&lt;p&gt;Log of eviction with timeout value of 200 seconds&lt;/p&gt;</comment>
                            <comment id="68574" author="matteo.piccinini" created="Tue, 8 Oct 2013 13:52:56 +0000"  >&lt;p&gt;Hi Hongchao,&lt;/p&gt;

&lt;p&gt;we configure the timeout from the default 100 seconds to 200 seconds with the following command but the nodes was evicted around 457 seconds.&lt;/p&gt;

&lt;p&gt;lctl conf_param nero.sys.timeout=200&lt;/p&gt;

&lt;p&gt;We try to set a timeout value of 300 seconds and again found eviction around 667 seconds.&lt;br/&gt;
Where i can find some best practises for setting the static timeout in proportion to the number of node?&lt;/p&gt;

&lt;p&gt;Do you have any other useful advice to try to solve this problem?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Matteo&lt;/p&gt;

&lt;p&gt;P.S. We attached the log file with timeout value of 200s &lt;span class=&quot;error&quot;&gt;&amp;#91;logs_timeout_200s.tar.bz2&amp;#93;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="72826" author="bfaccini" created="Wed, 4 Dec 2013 17:17:30 +0000"  >&lt;p&gt;Hello Matteo,&lt;/p&gt;

&lt;p&gt;Sorry for the late update.&lt;br/&gt;
Could it be possible that you boot at least one of the impacted login nodes with &quot;nohz=off&quot; boot-parameter and see how it runs with it ??&lt;/p&gt;</comment>
                            <comment id="73347" author="gabriele.paciucci" created="Thu, 12 Dec 2013 09:31:56 +0000"  >&lt;p&gt;Comment from the customer:&lt;/p&gt;

&lt;p&gt;&amp;lt;&amp;lt;Dear Gabriele&lt;/p&gt;

&lt;p&gt;I have set nohz=off on two out of the four problematic servers at 5.12.&lt;br/&gt;
Yesterday evening, one of them has been evicted again (nodename brutus2).&lt;/p&gt;

&lt;p&gt;So, this did not help. &lt;/p&gt;

&lt;p&gt;BTW: the other node has nohz=off as well as notsc and clocksource=hpet set.&amp;gt;&amp;gt;&lt;/p&gt;</comment>
                            <comment id="73349" author="bfaccini" created="Thu, 12 Dec 2013 10:36:21 +0000"  >&lt;p&gt;Thanks to attach the update Gabriele. I also attach both files/logs (brutus2_eviction.txt, messages_brutus2.txt) that were provided with this update.&lt;/p&gt;

&lt;p&gt;Matteo, does this mean that nohz=off helped to avoid the issue for some time ?&lt;/p&gt;

&lt;p&gt;Also, I am afraid I don&apos;t fully understand what you mean by &quot;stuck&quot;, is it a scheduling issue caused by the heavy load on the login nodes ?&lt;/p&gt;

&lt;p&gt;Don&apos;t you run with some special ptlrpcd module multi-thread/numa-policy parameter ?&lt;/p&gt;

&lt;p&gt;I see in the logs that the evictions almost always start from OSS side with the &quot;LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after &#8230;.&quot; msg, so can you provide the lustre debug-log (with full debug enabled and biggest debug buffer) at the time of the evict and from Client and OSSs sides ?&lt;/p&gt;
</comment>
                            <comment id="73658" author="bfaccini" created="Tue, 17 Dec 2013 08:28:02 +0000"  >&lt;p&gt;Adding latest infos/exchanges from/with customer :&lt;/p&gt;

&lt;p&gt;     _ Since it was never provided, I requested to get Client/OSS Lustre debug log during an evict. Since debug mask current settings was only &quot;ioctl neterror warning error emerg ha config console&quot; on all Clients/Servers, I requested &quot;+dlmtrace +rpctrace&quot;. This is now setup on one login/Client node on-site.&lt;/p&gt;

&lt;p&gt;     _ I requested to get current ptlrpcd Numa settings. On all Clients/Servers nodes nothing specific is actually configured, max_ptlrpcds=0 and ptlrpcd_bind_policy=3.&lt;/p&gt;

&lt;p&gt;     _ I also requested to get current vm.zone_reclaim_mode setting, and since it was 1/on, I requested to have it changed to 0/off at least on one login/Client node. This has been done by customer.&lt;/p&gt;

&lt;p&gt;I also attach the latest Lustre+Numa configurations infos provided by customer (requested_outputs.tar.bz2).&lt;/p&gt;</comment>
                            <comment id="74176" author="bfaccini" created="Mon, 30 Dec 2013 22:12:42 +0000"  >&lt;p&gt;Customer provided more infos (Client lustre debug-log, Client/OSS syslog, ...) taken during a new occurence. It is &quot;ETHZ_client_eviction_brutus3_n-oss07_20131220.tar.bz2&quot; attachment I just uploaded.&lt;/p&gt;

&lt;p&gt;Here are my 1st analysis comments about these new infos.&lt;/p&gt;

&lt;p&gt;The following similar stack dumps for multiple ldlm_bl_xx threads before the evict :&lt;br/&gt;
================================================================&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: INFO: task ldlm_bl_43:12517 blocked for more than 120 seconds.&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: ldlm_bl_43    D 000000000000000c     0 12517      2 0x00000080&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: ffff880ae35a3d50 0000000000000046 0000000000000000 ffffffffa0582967&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: 0000000100000000 0000000000000000 ffff880ae35a3cf0 ffffffffa0581f22&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: ffff880ad9e7c5f8 ffff880ae35a3fd8 000000000000fb88 ffff880ad9e7c5f8&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: Call Trace:&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0582967&amp;gt;&amp;#93;&lt;/span&gt; ? cfs_hash_bd_lookup_intent+0x37/0x130 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0581f22&amp;gt;&amp;#93;&lt;/span&gt; ? cfs_hash_bd_add_locked+0x62/0x90 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8150f1ee&amp;gt;&amp;#93;&lt;/span&gt; __mutex_lock_slowpath+0x13e/0x180&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8150f08b&amp;gt;&amp;#93;&lt;/span&gt; mutex_lock+0x2b/0x50&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06d5ebf&amp;gt;&amp;#93;&lt;/span&gt; cl_lock_mutex_get+0x6f/0xd0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0a4495a&amp;gt;&amp;#93;&lt;/span&gt; osc_ldlm_blocking_ast+0x7a/0x350 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa057d2c1&amp;gt;&amp;#93;&lt;/span&gt; ? libcfs_debug_msg+0x41/0x50 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07f1f00&amp;gt;&amp;#93;&lt;/span&gt; ldlm_handle_bl_callback+0x130/0x400 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07f2451&amp;gt;&amp;#93;&lt;/span&gt; ldlm_bl_thread_main+0x281/0x3d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81063310&amp;gt;&amp;#93;&lt;/span&gt; ? default_wake_function+0x0/0x20&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07f21d0&amp;gt;&amp;#93;&lt;/span&gt; ? ldlm_bl_thread_main+0x0/0x3d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c0ca&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x20&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07f21d0&amp;gt;&amp;#93;&lt;/span&gt; ? ldlm_bl_thread_main+0x0/0x3d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07f21d0&amp;gt;&amp;#93;&lt;/span&gt; ? ldlm_bl_thread_main+0x0/0x3d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Dec 20 10:00:54 brutus3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c0c0&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0x0/0x20&lt;br/&gt;
================================================================&lt;br/&gt;
looks like an indication of a problem/dead-lock on the Client-side.&lt;/p&gt;

&lt;p&gt;And the associated Lustre debug-traces sequence is also always the same and like :&lt;br/&gt;
==================================================================&lt;br/&gt;
00000100:00100000:33.0:1387529928.782898:0:3546:0:(events.c:352:request_in_callback()) peer: 12345-10.201.62.37@o2ib&lt;br/&gt;
00000100:00100000:25.0:1387529928.782936:0:3640:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1444519122204514&lt;br/&gt;
00000100:00080000:25.0:1387529928.782961:0:3640:0:(service.c:1079:ptlrpc_update_export_timer()) updating export LOV_OSC_UUID at 1387529928 exp ffff8820303d4000&lt;br/&gt;
00000100:00100000:25.0:1387529928.782985:0:3640:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-10.201.62.37@o2ib, seq: 15852&lt;br/&gt;
00000100:00100000:25.0:1387529928.783003:0:3640:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb03_002:LOV_OSC_UUID+4:4310:x1444519122204514:12345-10.201.62.37@o2ib:106&lt;br/&gt;
00010000:00010000:25.0:1387529928.783029:0:3640:0:(ldlm_lockd.c:1882:ldlm_handle_gl_callback()) ### client glimpse AST callback handler ns: nero-OST000e-osc-ffff88203a24a800 lock: ffff88091b327a00/0x30c32bca9aefeafc lrc: 8/0,0 mode: PW/PW res: 119352871/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0-&amp;gt;18446744073709551615) flags: 0x29400000000 nid: local remote: 0x87d5d77ddb4e8fb0 expref: -99 pid: 45673 timeout: 0 lvb_type: 1&lt;br/&gt;
00000020:00010000:25.0:1387529928.783063:0:3640:0:(cl_object.c:305:cl_object_glimpse()) header@ffff88035bd6a6a0[0x0, 5, &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20001c158:0xdd90:0x0&amp;#93;&lt;/span&gt; hash]&lt;br/&gt;
00000020:00010000:25.0:1387529928.783065:0:3640:0:(cl_object.c:305:cl_object_glimpse()) size: 348 mtime: 1387455174 atime: 1387456028 ctime: 1387455174 blocks: 8&lt;br/&gt;
00000020:00010000:25.0:1387529928.783068:0:3640:0:(cl_object.c:305:cl_object_glimpse()) header@ffff880334d7e7f0[0x0, 3, &lt;span class=&quot;error&quot;&gt;&amp;#91;0x1000e0000:0x71d2e27:0x0&amp;#93;&lt;/span&gt; hash]&lt;br/&gt;
00000020:00010000:25.0:1387529928.783069:0:3640:0:(cl_object.c:305:cl_object_glimpse()) size: 348 mtime: 1387455174 atime: 1387456028 ctime: 1387455174 blocks: 8&lt;br/&gt;
00000100:00100000:25.0:1387529928.783125:0:3640:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb03_002:LOV_OSC_UUID+4:4310:x1444519122204514:12345-10.201.62.37@o2ib:106 Request procesed in 135us (231us total) trans 0 rc 0/0&lt;br/&gt;
00000100:00100000:25.0:1387529928.783131:0:3640:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.201.62.37@o2ib, seq: 15852&lt;br/&gt;
00010000:00010000:12.0:1387529928.783139:0:12517:0:(ldlm_lockd.c:1696:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: nero-OST000e-osc-ffff88203a24a800 lock: ffff88091b327a00/0x30c32bca9aefeafc lrc: 8/0,0 mode: PW/PW res: 119352871/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0-&amp;gt;18446744073709551615) flags: 0x29400000000 nid: local remote: 0x87d5d77ddb4e8fb0 expref: -99 pid: 45673 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:12.0:1387529928.783150:0:12517:0:(ldlm_lockd.c:1709:ldlm_handle_bl_callback()) Lock ffff88091b327a00 already unused, calling callback (ffffffffa0a448e0)&lt;br/&gt;
==================================================================&lt;/p&gt;

&lt;p&gt;At a first look, problem seems related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2683&quot; title=&quot;Client deadlock in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2683&quot;&gt;&lt;del&gt;LU-2683&lt;/del&gt;&lt;/a&gt;, but all the concerned patches have been landed to b2_4 so it is unlikely to be the case finally.&lt;br/&gt;
A good thing would be to get a crash-dump at the time of the evict or at least a full threads stack-traces, to understand who owns the cl_lock mutex preventing ldlm_bl_xx threads to complete/answer the blocking-AST.&lt;/p&gt;

&lt;p&gt;I will try to come back with a procedure to get more infos during other occurrences or may be a with specific debug patch.&lt;/p&gt;</comment>
                            <comment id="74187" author="bfaccini" created="Tue, 31 Dec 2013 08:40:08 +0000"  >&lt;p&gt;Humm, in fact I wonder if this problem could not be a new facet of the same problem than &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2683&quot; title=&quot;Client deadlock in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2683&quot;&gt;&lt;del&gt;LU-2683&lt;/del&gt;&lt;/a&gt; but under 2.2/2.4 hybrid un-tested/supported configuration ??&#8230;&lt;/p&gt;</comment>
                            <comment id="74401" author="bfaccini" created="Mon, 6 Jan 2014 15:15:36 +0000"  >&lt;p&gt;Could it be possible to install+run with an instrumented version of Lustre-Client modules, at least on one of the impacted login nodes, that will allow to get a crash-dump (instead of lustre-debug log dump!) upon eviction ??&lt;/p&gt;</comment>
                            <comment id="74736" author="bfaccini" created="Fri, 10 Jan 2014 13:54:27 +0000"  >&lt;p&gt;Just created a set of Client RPMs for the same Kernel 2.6.32-358.6.2.el6.x86_64 and Lustre 2.4.0-RC2 versions used on-site, allowing for a LBUG() (in ptlrpc_invalidate_import_thread()) to occur instead of a debug-log dump (if obd_dump_on_eviction exactly = -1) upon eviction. This should allow to get a crash-dump (if panic_on_lbug is also != 0) upon eviction.&lt;br/&gt;
I am currently testing its functionality in-house, will provide an update soon.&lt;/p&gt;

&lt;p&gt;On the other hand, I am also trying to setup a similar Lustre-Server/Lustre-Client Version (2.2.0-RC2-&lt;del&gt;PRISTINE/2.4.0-RC2&lt;/del&gt;-CHANGED) platform to try reproduce the issue with the prog from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="75264" author="bfaccini" created="Mon, 20 Jan 2014 11:20:02 +0000"  >&lt;p&gt;My RPMs, including the LBUG() upon eviction patch/instrumentation, testing has been successful. The instrumented RPMs have been pushed on customer upload for installation and in order to allow for a crash-dump to be taken during an eviction.&lt;/p&gt;

&lt;p&gt;Customer will also try to run the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt; reproducer on its platform.&lt;/p&gt;</comment>
                            <comment id="75693" author="aneeser" created="Mon, 27 Jan 2014 18:10:39 +0000"  >&lt;p&gt;We have the RPMs on half of the login nodes in place, checked our kdump config with sysrq-trigger and waiting for the first crash-dump during an eviction.&lt;/p&gt;

&lt;p&gt;We run reproducer &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt; and also &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4112&quot; title=&quot;Random eviction of clients on lock callback timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4112&quot;&gt;&lt;del&gt;LU-4112&lt;/del&gt;&lt;/a&gt; aggressively but no eviction nor anything else happened.&lt;/p&gt;

&lt;p&gt;Cheers&lt;br/&gt;
Allen&lt;/p&gt;</comment>
                            <comment id="75748" author="bfaccini" created="Tue, 28 Jan 2014 09:36:31 +0000"  >&lt;p&gt;Allen, thanks for helping so much.&lt;/p&gt;</comment>
                            <comment id="75838" author="eric.mueller@id.ethz.ch" created="Wed, 29 Jan 2014 06:59:21 +0000"  >&lt;p&gt;crash dump text file uploaded to jira and vmcore dump provided to Gabriele Paciucci.&lt;/p&gt;</comment>
                            <comment id="75920" author="eric.mueller@id.ethz.ch" created="Thu, 30 Jan 2014 06:18:05 +0000"  >&lt;p&gt;Another dump file for reference. &lt;br/&gt;
Full coredump uploaded to Gabriele.&lt;/p&gt;</comment>
                            <comment id="78071" author="bfaccini" created="Fri, 28 Feb 2014 01:03:11 +0000"  >&lt;p&gt;Sorry for the delay to complete crash-dumps from instrumented RPMs analysis.&lt;br/&gt;
Now, after spending some time doing so, I think the problem you face is the same than the one already tracked as part of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4300&quot; title=&quot;ptlrpcd threads deadlocked in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4300&quot;&gt;&lt;del&gt;LU-4300&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
To confirm this, could it be possible that you disable ELC (with &quot;echo 0 &amp;gt; /proc/fs/lustre/ldlm/namespaces/*/early_lock_cancel&quot;) on one (all?) of the login nodes that trigger the evictions ?&lt;/p&gt;</comment>
                            <comment id="78081" author="eric.mueller@id.ethz.ch" created="Fri, 28 Feb 2014 07:42:56 +0000"  >&lt;p&gt;Thanks for the analysis of the crash dump, Bruno.&lt;br/&gt;
I have disabled ELC on all the involved login nodes now. Lets see, what happens.&lt;/p&gt;</comment>
                            <comment id="78693" author="bfaccini" created="Fri, 7 Mar 2014 13:44:09 +0000"  >&lt;p&gt;Hello Eric, can you give some feedback if running with ELC disabled helped or not?&lt;/p&gt;</comment>
                            <comment id="78845" author="eric.mueller@id.ethz.ch" created="Mon, 10 Mar 2014 08:32:24 +0000"  >&lt;p&gt;I cannot tell if it is successful as we had no eviction prior to set ELC=0 for at least a fortnight.&lt;br/&gt;
I have tried to stress the login nodes more than usual, but I could not force an eviction. But this is not unusual,&lt;br/&gt;
as we never managed to trigger the eviction with a certain usage pattern&#8230;&lt;/p&gt;

&lt;p&gt;I will of course update asap we have a new eviction, although I hope that it never ever occurs again (best case).&lt;/p&gt;</comment>
                            <comment id="80661" author="bfaccini" created="Mon, 31 Mar 2014 21:41:22 +0000"  >&lt;p&gt;Eric, can you give some update ?? Thanks in advance.&lt;/p&gt;</comment>
                            <comment id="81250" author="eric.mueller@id.ethz.ch" created="Wed, 9 Apr 2014 06:20:07 +0000"  >&lt;p&gt;We had no eviction on the 4 nodes until 02.04.2014. At this time, we rebooted one of these nodes with the longest uptime of all (53 days) and let the default early_lock_cancel (1) configured. This node run for 6 days until 08.04.2014 and then has been evicted by first a single OSS and rapidly by others. I have seen users doing many rsync operations on Lustre at that time (apart from other usual load). &lt;br/&gt;
Lustre logs showed the following:&lt;/p&gt;

&lt;p&gt;00010000:00010000:45.0:1396944724.065662:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881853511200/0x5d1861be72b7c5eb lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138708343/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5df055 expref: -99 pid: 43945 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065672:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881d59ae9a00/0x5d1861be72b7b7f2 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138708334/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5def52 expref: -99 pid: 43945 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065678:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8818726e8600/0x5d1861be72b7a728 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138708324/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5dee41 expref: -99 pid: 43945 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065684:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881a88e77400/0x5d1861be72cf6d93 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711793/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbaff expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065703:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881e9e464c00/0x5d1861be72cfd56d lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711859/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fc1b2 expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065709:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881c15923e00/0x5d1861be72cf1691 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711753/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb5cd expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065716:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881cab0b1400/0x5d1861be72b78afe lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138708312/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5debd2 expref: -99 pid: 43945 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065722:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff880b360b0c00/0x5d1861be72cfc8b6 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711851/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fc0cb expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065728:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8815907c4200/0x5d1861be72cfac3f lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711835/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbee8 expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065734:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8815399ed400/0x5d1861be72cf7232 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711798/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbb53 expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065749:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881915410400/0x5d1861be72cf90cb lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711819/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbd52 expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065759:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff88146b2d0c00/0x5d1861be72cf1ef6 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711760/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb63d expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065766:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8813f3483600/0x5d1861be72b78333 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138708303/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5deac8 expref: -99 pid: 43945 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065772:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff880725999600/0x5d1861be72cfb752 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711843/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbfb3 expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065779:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8819f2a30200/0x5d1861be72cf3084 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711769/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb708 expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065785:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8814e99b5600/0x5d1861be72cf69bf lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711791/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fba9d expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065791:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8804d6b30200/0x5d1861be72d291cf lrc: 0/0,0 mode: -&lt;del&gt;/PW res: 138711739/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869480000000 nid: local remote: 0x35306a32ed604737 expref: -99 pid: 5386 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065801:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881902ad2200/0x5d1861be72cf58f5 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711785/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb99a expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065808:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff88201e820600/0x5d1861be72cf4449 lrc: 0/0,0 mode: -&lt;del&gt;/PR res: 138711776/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb85f expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;br/&gt;
00010000:00010000:45.0:1396944724.065818:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881108e46000/0x5d1861be72cf6e34 lrc: 0/0,0 mode: -&lt;del&gt;/PW res: 138711795/0 rrc: 1 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0&lt;/del&gt;&amp;gt;4095) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbb06 expref: -99 pid: 44920 timeout: 0 lvb_type: 1&lt;/p&gt;

&lt;p&gt;We have now set ELC on all the nodes again. &lt;br/&gt;
So, there might be a reasonable chance that ELC will work out (hopefully!).&lt;/p&gt;</comment>
                            <comment id="81593" author="bfaccini" created="Tue, 15 Apr 2014 09:35:04 +0000"  >&lt;p&gt;Hello Eric,&lt;br/&gt;
Can you give us some feedback on how &quot;worked&quot; ELC ?? I hope we will definitely confirm that this ticket is a dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4300&quot; title=&quot;ptlrpcd threads deadlocked in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4300&quot;&gt;&lt;del&gt;LU-4300&lt;/del&gt;&lt;/a&gt; and we can close it accordingly ...&lt;/p&gt;</comment>
                            <comment id="82263" author="eric.mueller@id.ethz.ch" created="Wed, 23 Apr 2014 12:42:29 +0000"  >&lt;p&gt;Hello Bruno&lt;br/&gt;
All nodes are up for at least 45 days now without any new evictions. I guess we can say that ELC works and may close this ticket.&lt;br/&gt;
Thanks for collaboration!&lt;/p&gt;</comment>
                            <comment id="82595" author="pjones" created="Mon, 28 Apr 2014 13:41:21 +0000"  >&lt;p&gt;Thanks Matteo!&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="13913" name="brutus2_eviction.txt" size="3164" author="bfaccini" created="Thu, 12 Dec 2013 10:39:19 +0000"/>
                            <attachment id="13597" name="logs_timeout_200s.tar.bz2" size="361545" author="matteo.piccinini" created="Tue, 8 Oct 2013 13:52:37 +0000"/>
                            <attachment id="13464" name="lustre_and_login_nodes_logs.tar.bz2aa" size="275" author="matteo.piccinini" created="Fri, 13 Sep 2013 16:48:11 +0000"/>
                            <attachment id="13465" name="lustre_and_login_nodes_logs.tar.bz2ab" size="275" author="matteo.piccinini" created="Fri, 13 Sep 2013 16:48:11 +0000"/>
                            <attachment id="13914" name="messages_brutus2.txt" size="105485" author="bfaccini" created="Thu, 12 Dec 2013 10:39:19 +0000"/>
                            <attachment id="13926" name="requested_outputs.tar.bz2" size="1268" author="bfaccini" created="Tue, 17 Dec 2013 08:28:26 +0000"/>
                            <attachment id="14038" name="vmcore-dmesg.txt" size="141056" author="eric.mueller@id.ethz.ch" created="Thu, 30 Jan 2014 06:18:05 +0000"/>
                            <attachment id="14030" name="vmcore-dmesg.txt" size="136930" author="eric.mueller@id.ethz.ch" created="Wed, 29 Jan 2014 06:59:21 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw2db:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10466</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>