<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:43:54 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4570] Metadata slowdowns on production filesystem at ORNL</title>
                <link>https://jira.whamcloud.com/browse/LU-4570</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Clients:&lt;br/&gt;
 Some 1.8.6, 1.8.9 clients&lt;br/&gt;
 Majority 2.4.X clients&lt;/p&gt;

&lt;p&gt;Server (noted above)&lt;/p&gt;

&lt;p&gt;Users are seeing several minute metadata operation times (interactive performance is usually the canary in the coal mine). We are unable to enable DNE due to 1.8 client interactions and the need to have the data available. &lt;/p&gt;

&lt;p&gt;We run periodic lctl dk with +rpctrace and I can provide information. &lt;/p&gt;

&lt;p&gt;To date we&apos;ve found one application that was issuing ~30k LDLM_ENQUEUE RPC&apos;s in a 30-45s range (lctl dk overflowed, so only got 45/60s of trace data). We&apos;ve put that application on hold, but are still seeing some slowdowns. &lt;/p&gt;

&lt;p&gt;We have some &quot;ls timers&quot; scattered around compute clusters and have been seeing some responses in the 60-80s ranges for an ls of a pre-determined directory. &lt;/p&gt;

&lt;p&gt;Questions:&lt;/p&gt;

&lt;p&gt;1. My current guess is that we are not getting any of the metadata improvements in 2.4. Can you confirm?&lt;/p&gt;

&lt;p&gt;2. We are running with options mdt mds_num_threads=1024. We are thinking of changing that to 2048. Recommendations?&lt;/p&gt;

&lt;p&gt;3. What further data would be of interest in tracking this down? I have the raw rpctrace data. Would it be useful to add +nettrace or +dlmtrace and send that along? (I think +dlmtrace, but others may be good too)&lt;/p&gt;

&lt;p&gt;4. If we were to get rid of all the 1.8 clients would the 2.4 metadata improvements be realized without a reboot of the servers? We are waiting on Cray to provide a software stack for some of our login servers that supports 2.4 clients, and are highly blocked on this. &lt;/p&gt;

&lt;p&gt;5. If we cannot get rid of the 1.8 clients, are there any other suggestions for increasing the metadata throughput? We are not blocking on host IO to the MDT device (less than 6MB/s as recorded by iostat); we have had some past memory pressure on the MDS (have recently reduced ldlm.namespaces.*.lru_size on cluster (non-Cray) clients from default of unlimited to 128 on compute 2000 on login nodes), but in general the mds host seems responsive and &quot;normal&quot;. &lt;/p&gt;

&lt;p&gt;Happy to provide more information, just ask and we&apos;ll get it to you. &lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;

&lt;p&gt;&amp;#8211;&lt;br/&gt;
-Jason&lt;/p&gt;</description>
                <environment>RHEL 6.4, 2.6.32-358.18.1.el6.x86_64 kernel</environment>
        <key id="22966">LU-4570</key>
            <summary>Metadata slowdowns on production filesystem at ORNL</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="hilljjornl">Jason Hill</reporter>
                        <labels>
                    </labels>
                <created>Sat, 1 Feb 2014 03:15:50 +0000</created>
                <updated>Wed, 9 Jul 2014 22:25:24 +0000</updated>
                            <resolved>Wed, 9 Jul 2014 22:25:24 +0000</resolved>
                                    <version>Lustre 2.4.1</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>25</watches>
                                                                            <comments>
                            <comment id="76028" author="pjones" created="Sat, 1 Feb 2014 04:10:48 +0000"  >&lt;p&gt;Oleg&lt;/p&gt;

&lt;p&gt;Please can you advise how to proceed here?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="76029" author="green" created="Sat, 1 Feb 2014 04:27:34 +0000"  >&lt;p&gt;I guess the most important question here would be - when you experience these issues - are all MDT threads are busy (and blocked on something most likely at the same time)?&lt;br/&gt;
Like the big job that is ow disabled for example, did it happen to open a single file (or small number of files) from a lot of clients with O_CREAT? &lt;/p&gt;</comment>
                            <comment id="76030" author="hilljjornl" created="Sat, 1 Feb 2014 04:38:05 +0000"  >&lt;p&gt;Oleg,&lt;/p&gt;

&lt;p&gt;I&apos;ve not been on the system during one of these events. I can work on getting another job to run tomorrow sometime, or maybe Monday. Currently on the MDS there are no processes (mdt*) in any state other than sleeping or running. I do see a high (both by number and CPU utilization) of kiblnd processes (from my historical memory, nothing concrete). We&apos;ve put the offending user/code into scheduler jail, so they can&apos;t run and we&apos;re seeing if the pitchforks and torches go away for the weekend. Started seeing issues on Wednesday and our traditional tools identified a particular code that was issuing ~50k LDLM_ENQUEUE&apos;s every 30 seconds. The app developer changed some IO and dropped it to ~25K LDLM_ENQUEUE&apos;s every 30s, but we were still getting complaints. Not much interactive use at midnight on a Friday, so I don&apos;t have a good measuring stick just yet. Working on automating some more of that. &lt;/p&gt;

&lt;p&gt;We don&apos;t know much about what the application is doing, just what we see from our homegrown tools. Will try to find out more about the app over the weekend.&lt;/p&gt;

&lt;p&gt;Will keep you posted. Thanks to you and Peter for commenting at this hour on a Friday. &lt;/p&gt;</comment>
                            <comment id="76031" author="green" created="Sat, 1 Feb 2014 05:13:33 +0000"  >&lt;p&gt;So those 25K enqueues - what are they for? Opens? what mode, same file/different files?&lt;br/&gt;
Specific details here would help&lt;/p&gt;</comment>
                            <comment id="76032" author="hilljjornl" created="Sat, 1 Feb 2014 05:23:35 +0000"  >&lt;p&gt;Specifics. Here&apos;s the data I have from during a period when the application we identified was running:&lt;/p&gt;

&lt;p&gt;00000100:00100000:1.0:1391197263.389662:0:12671:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn02_007:69482ed8-e0da-f1ce-52fa-1f94fb17f79b+248664:16552:x1457850727013322:12345-10.36.207.211@o2ib:103&lt;br/&gt;
00000100:00100000:1.0:1391197263.389676:0:12671:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn02_007:69482ed8-e0da-f1ce-52fa-1f94fb17f79b+248663:16552:x1457850727013322:12345-10.36.207.211@o2ib:103 Request procesed in 17us (37us total) trans 0 rc 0/0&lt;br/&gt;
00010000:00080000:6.0:1391197263.391911:0:20306:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately&lt;br/&gt;
00010000:00080000:6.0:1391197263.394777:0:20306:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately&lt;br/&gt;
00000100:00100000:10.0:1391197263.399170:0:12670:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn01_006:7affdafd-903d-c4e0-829e-93e7deb30326+236769:19570:x1457850685779849:12345-10.36.207.212@o2ib:103&lt;br/&gt;
00000100:00100000:10.0:1391197263.399185:0:12670:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn01_006:7affdafd-903d-c4e0-829e-93e7deb30326+236768:19570:x1457850685779849:12345-10.36.207.212@o2ib:103 Request procesed in 18us (39us total) trans 0 rc 0/0&lt;br/&gt;
00010000:00080000:20.0:1391197263.399332:0:20469:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately&lt;br/&gt;
00010000:00080000:6.0:1391197263.401981:0:20306:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately&lt;br/&gt;
00000100:00100000:2.0:1391197263.405698:0:19999:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn00_000:f77d78c9-ff95-ba2a-857d-91ec128ac2b0+732343:10304:x1457850698872414:12345-10.36.207.210@o2ib:103&lt;br/&gt;
00000100:00100000:2.0:1391197263.405712:0:19999:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn00_000:f77d78c9-ff95-ba2a-857d-91ec128ac2b0+732342:10304:x1457850698872414:12345-10.36.207.210@o2ib:103 Request procesed in 18us (37us total) trans 0 rc 0/0&lt;br/&gt;
00010000:00080000:20.0:1391197263.406692:0:20469:0:(ldlm_lib.c:2515:target_send_reply()) Schedule reply immediately&lt;/p&gt;

&lt;p&gt;I don&apos;t have specific knowledge of mapping those sorts of requests back to operations (open, close, stat, unlink, etc.). Happy to get educated more on this. Again, data generated with +rpctrace (no dlmtrace). and I just grepped for ldlm from that output to get the snippet above. Snippet not guaranteed to be coherent or consistent. Uploading the entire rpctrace now. &lt;/p&gt;</comment>
                            <comment id="76037" author="green" created="Sat, 1 Feb 2014 06:02:58 +0000"  >&lt;p&gt;opc 103 is cancel, so a lot of cancel activity.&lt;/p&gt;

&lt;p&gt;There&apos;s no way to easily see file mode and operation in mode details from this sort of trace, but since you isolated the app, you can just ask the app author what sort of things does he do &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="76038" author="ezell" created="Sat, 1 Feb 2014 06:15:39 +0000"  >&lt;p&gt;$ grep &quot;Handling RPC&quot; rpctrace-mds1|awk -F&apos;:&apos; &apos;&lt;/p&gt;
{print $20}
&lt;p&gt;&apos; | sort -n |uniq -c |sort -nr&lt;br/&gt;
52301 101&lt;br/&gt;
8037 35&lt;br/&gt;
6469 36&lt;br/&gt;
1255 103&lt;br/&gt;
 245 37&lt;br/&gt;
 181 400&lt;br/&gt;
  25 49&lt;br/&gt;
  11 41&lt;br/&gt;
   1 44&lt;/p&gt;

&lt;p&gt;101 = LDLM_ENQUEUE&lt;br/&gt;
35 = MDS_CLOSE&lt;br/&gt;
36 = MDS_REINT&lt;br/&gt;
103 = LDLM_CANCEL&lt;br/&gt;
37 = MDS_READPAGE&lt;br/&gt;
400 = OBD_PING&lt;/p&gt;

&lt;p&gt;$ grep &quot;Handling RPC&quot; rpctrace-mds1|awk -F&apos;:&apos; &apos;&lt;/p&gt;
{print $19}
&lt;p&gt;&apos;|sort|uniq -c|sort -nr|head -10&lt;br/&gt;
5298 12345-10.36.205.208@o2ib&lt;br/&gt;
5241 12345-10.36.205.201@o2ib&lt;br/&gt;
3151 12345-10.36.205.207@o2ib&lt;br/&gt;
2618 12345-10.36.205.200@o2ib&lt;br/&gt;
2324 12345-10.36.207.213@o2ib&lt;br/&gt;
1583 12345-10.36.202.130@o2ib&lt;br/&gt;
1169 12345-10.36.207.211@o2ib&lt;br/&gt;
1104 12345-10.36.207.210@o2ib&lt;br/&gt;
 589 12345-10.36.207.212@o2ib&lt;br/&gt;
 411 12345-10.36.205.198@o2ib&lt;/p&gt;

&lt;p&gt;10.36.205.208@o2ib = titan-ext7&lt;br/&gt;
10.36.205.201@o2ib = titan-ext4&lt;br/&gt;
10.36.205.207@o2ib = titan-ext6&lt;br/&gt;
10.36.205.200@o2ib = titan-ext3&lt;br/&gt;
10.36.207.213@o2ib = lens38&lt;br/&gt;
10.36.202.130@o2ib = dtn03&lt;br/&gt;
10.36.207.211@o2ib = lens36&lt;br/&gt;
10.36.207.210@o2ib = lens35&lt;br/&gt;
10.36.207.212@o2ib = lens37&lt;br/&gt;
10.36.205.198@o2ib = titan-ext1&lt;/p&gt;

&lt;p&gt;Looks like this is coming mostly from interactive use on the external login nodes.&lt;/p&gt;</comment>
                            <comment id="76040" author="adilger" created="Sat, 1 Feb 2014 08:36:50 +0000"  >&lt;p&gt;Is this filesystem upgraded from 1.x or is it newly formatted with 2.x?  Please check what the output from the following command is:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mds# dumpe2fs -h | grep feature
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The reason I ask is because the dir_data feature (which should be in the above list) is not automatically enabled on upgrade filesystems ( it would cause problems if the filesystem needed to downgrade) but improves. 2.x performance significantly.  Note that this affects files created with 2.x most significantly, and particularly &quot;ls -l&quot; and other readdir() workloads. &lt;/p&gt;

&lt;p&gt;Please see &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1248&quot; title=&quot;all mdt_rdpg_* threads busy in osd_ea_fid_get()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1248&quot;&gt;&lt;del&gt;LU-1248&lt;/del&gt;&lt;/a&gt; for more details. Not sure if this is the case, but it was my first thought when I read this bug. &lt;/p&gt;</comment>
                            <comment id="76046" author="ezell" created="Sat, 1 Feb 2014 23:27:31 +0000"  >&lt;p&gt;It was formatted under 2.4.&lt;/p&gt;

&lt;p&gt;We have some clients that are acting up. I tried to return all unused locks (by echo&apos;ing clear into lru_size)&lt;/p&gt;

&lt;p&gt;I see the following messages:&lt;br/&gt;
2014-02-01T18:21:06.280508-05:00 c24-0c0s6n0 LustreError: 11222:0:(import.c:324:ptlrpc_invalidate_import()) atlas2-MDT0000_UUID: rc = -110 waiting for callback (4 != 0)&lt;br/&gt;
2014-02-01T18:21:06.310918-05:00 c24-0c0s6n0 LustreError: 11222:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880306cacc00 x1458495644509504/t0(0) o101-&amp;gt;atlas2-MDT0000-mdc-ffff88033e1f1800@10.36.226.74@o2ib200:12/10 lens 592/25176 e 0 to 0 dl 1391272868 ref 1 fl Rpc:RE/0/ffffffff rc -5/-1&lt;br/&gt;
2014-02-01T18:21:06.310968-05:00 c24-0c0s6n0 LustreError: 11222:0:(import.c:350:ptlrpc_invalidate_import()) Skipped 3 previous similar messages&lt;br/&gt;
2014-02-01T18:21:06.310984-05:00 c24-0c0s6n0 LustreError: 11222:0:(import.c:366:ptlrpc_invalidate_import()) atlas2-MDT0000_UUID: RPCs in &quot;Unregistering&quot; phase found (0). Network is sluggish? Waiting them to error out.&lt;/p&gt;</comment>
                            <comment id="76049" author="hilljjornl" created="Sun, 2 Feb 2014 03:32:03 +0000"  >&lt;p&gt;Answering Andreas for completeness:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-mds1 ~&amp;#93;&lt;/span&gt;# dumpe2fs -h /dev/mapper/atlas1-mdt1 | grep feature&lt;br/&gt;
dumpe2fs 1.42.7.wc1 (12-Apr-2013)&lt;br/&gt;
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg ea_inode dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota&lt;/p&gt;

&lt;p&gt;Tonight I have seen two messages in the logs that make me think there is something definitely wrong; the first on the MDS:&lt;/p&gt;


&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;962845.511495&amp;#93;&lt;/span&gt; LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 5269s: evicting client at 9307@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881a2c51fd80/0x62d1aa23e9a027df lrc: 3/0,0 mode: PR/PR res: &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20005c272:0x2:0x0&amp;#93;&lt;/span&gt;.0 bits 0x1b rrc: 2 type: IBT flags: 0x200000000020 nid: 9307@gni100 remote: 0xca436432b3a155cc expref: 32 pid: 20112 timeout: 5258943031 lvb_type: 0&lt;/p&gt;

&lt;p&gt;So it took us 5269 seconds to try to cancel a lock and then we evicted the client (am I reading that correct?)&lt;/p&gt;

&lt;p&gt;Next on an OSS:&lt;/p&gt;

&lt;p&gt;Feb  1 22:28:52 atlas-oss2g3 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1881119.843613&amp;#93;&lt;/span&gt; LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 10034s: evicting client at 9307@gni102  ns: filter-atlas1-OST02ba_UUID lock: ffff88064926b480/0x6acfe1d5cba7b850 lrc: 3/0,0 mode: PR/PR res: &lt;span class=&quot;error&quot;&gt;&amp;#91;0x111e85:0x0:0x0&amp;#93;&lt;/span&gt;.0 rrc: 2 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 0-&amp;gt;18446744073709551615) flags: 0x10020 nid: 9307@gni102 remote: 0xca436432b3a15682 expref: 4 pid: 26076 timeout: 6179874642 lvb_type: 1&lt;/p&gt;

&lt;p&gt;Whoa. 10,034s. &lt;/p&gt;

&lt;p&gt;Any thoughts on where to go digging in greater detail to help guide us more?&lt;/p&gt;</comment>
                            <comment id="76050" author="hilljjornl" created="Sun, 2 Feb 2014 03:37:52 +0000"  >&lt;p&gt;Also of use from Matt&apos;s analysis last night (sorry, just reading this now), those top 10 clients issuing RPC&apos;s are all Lustre 1.8.X clients (titan-ext* is Cray 1.8.6, lens* and dtn03 are Intel 1.8.9 compiled @ ORNL). Thanks for the analysis Matt.&lt;/p&gt;</comment>
                            <comment id="76051" author="hilljjornl" created="Sun, 2 Feb 2014 05:01:49 +0000"  >&lt;p&gt;Just for posterity, it looks like Cray has not put &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-532&quot; title=&quot;2.1 does not filter extended attributes list based on permissions, showing entire list at all times&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-532&quot;&gt;&lt;del&gt;LU-532&lt;/del&gt;&lt;/a&gt; into their base 1.8.6 client as we see messages like:&lt;/p&gt;

&lt;p&gt;Feb  1 23:53:56 atlas-mds1 kernel: [ 4357.493520] Lustre: 14082:0:(mdt_xattr.c:363:mdt_reint_setxattr()) atlas1-MDT0000: client miss to set OBD_MD_FLCTIME when setxattr system.posix_acl_access: [object &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20005b506:0x19b79:0x0&amp;#93;&lt;/span&gt;] &lt;span class=&quot;error&quot;&gt;&amp;#91;valid 68719476736&amp;#93;&lt;/span&gt;&lt;br/&gt;
Feb  1 23:53:56 atlas-mds1 kernel: [ 4357.514700] Lustre: 14082:0:(mdt_xattr.c:363:mdt_reint_setxattr()) Skipped 204 previous similar messages&lt;/p&gt;

&lt;p&gt;A lot. From reading the bug it&apos;s a client issue, and my guess is that our non-Intel clients don&apos;t have the patch. &lt;/p&gt;</comment>
                            <comment id="76055" author="simmonsja" created="Sun, 2 Feb 2014 15:42:29 +0000"  >&lt;p&gt;A new ticket was opened for the port of the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-532&quot; title=&quot;2.1 does not filter extended attributes list based on permissions, showing entire list at all times&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-532&quot;&gt;&lt;del&gt;LU-532&lt;/del&gt;&lt;/a&gt; patches to b1_8. That ticket is &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1838&quot; title=&quot;1.8 client does not filter 2.x server extended attributes list based on permissions&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1838&quot;&gt;&lt;del&gt;LU-1838&lt;/del&gt;&lt;/a&gt; and those patches have landed to 1.8.9.&lt;/p&gt;</comment>
                            <comment id="76126" author="hilljjornl" created="Mon, 3 Feb 2014 20:49:24 +0000"  >&lt;p&gt;I wrote some bash to go look for the longest RPC from each event where we noted that it took longer than 1s to perform a metadata operation since 02-01-2014. In almost every case the operation that took the longest was a lock enquque:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lustre-mgmt1&amp;#93;&lt;/span&gt;# /tmp/mdstrace-parse.sh &amp;amp;&amp;gt; /tmp/rpclist.txt &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lustre-mgmt1&amp;#93;&lt;/span&gt;# cat /tmp/rpclist.txt | awk -F\: &apos;&lt;/p&gt;
{print $21}
&lt;p&gt;&apos; | awk &apos;&lt;/p&gt;
{print $1}
&lt;p&gt;&apos; | sort -u&lt;br/&gt;
101&lt;br/&gt;
36&lt;/p&gt;
</comment>
                            <comment id="76138" author="green" created="Mon, 3 Feb 2014 21:46:19 +0000"  >&lt;p&gt;If you have lock timeouts in thousands of seconds, that&apos;s sure to block any competing lock enqueues for as long as well.&lt;/p&gt;

&lt;p&gt;So I wonder if these events are correlated?&lt;/p&gt;</comment>
                            <comment id="76145" author="jamesanunez" created="Tue, 4 Feb 2014 00:08:41 +0000"  >&lt;p&gt;Jason, &lt;/p&gt;

&lt;p&gt;I&apos;ve spent the day researching and speaking with our engineers about this issue. We consider the most fruitful approach is to start by understanding the workload that experienced the slowdown. Do you have any information on the application, the one that was issuing the 25 - 30K LDLM_ENQUEUE RPCs, and what it is doing in terms of metadata? For example is it opening a single file or a small number of files from many processes, opening multiple files, what modes, etc.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
James&lt;/p&gt;</comment>
                            <comment id="76152" author="hilljjornl" created="Tue, 4 Feb 2014 02:26:28 +0000"  >&lt;p&gt;James,&lt;/p&gt;

&lt;p&gt;The code that was mentioned on Friday has been held since we discovered that the change the developer made did not result in lower impact. We are continuing to see long response times for metadata operations in the absence of that code &amp;#8211; so I think we have something more systemic going on. &lt;/p&gt;

&lt;p&gt;Cray has an open bug (I mentioned this in the email to you and Peter) for their client (in Cray&apos;s bugzilla), and also thinks that &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4509&quot; title=&quot;clio can be stuck in osc_extent_wait&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4509&quot;&gt;&lt;del&gt;LU-4509&lt;/del&gt;&lt;/a&gt; might be something to look at for their client. We are actively working to get the 1.8 clients off the filesystem but that will take some time still. &lt;/p&gt;

&lt;p&gt;-Jason&lt;/p&gt;</comment>
                            <comment id="76154" author="hilljjornl" created="Tue, 4 Feb 2014 02:59:18 +0000"  >&lt;p&gt;On the advice of Oleg I&apos;m going to spend some time tonight trying to correlate the long response times to lock timeout and eviction notices on the clients. This will take some time as I&apos;ll need to basically create a coherent stream of log data from disparate sources and then try to follow messages. Hoping to have something by morning.&lt;/p&gt;</comment>
                            <comment id="76159" author="ezell" created="Tue, 4 Feb 2014 04:00:30 +0000"  >&lt;p&gt;After reading &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4579&quot; title=&quot;Timeout system horribly broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4579&quot;&gt;&lt;del&gt;LU-4579&lt;/del&gt;&lt;/a&gt;, I think our issues may be similar.&lt;/p&gt;

&lt;p&gt;Jason, I look forward to seeing your analysis - I think it will be really helpful.&lt;/p&gt;

&lt;p&gt;We have lots of threads being hung:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;$ grep -c &lt;span class=&quot;code-quote&quot;&gt;&quot;was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt;&quot;&lt;/span&gt; kern.log 
47
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Feb  3 22:40:17 atlas-mds1 kernel: [172488.242493] LNet: Service thread pid 15401 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
Feb  3 22:40:17 atlas-mds1 kernel: [172488.261927] LNet: Skipped 4 previous similar messagesFeb  3 22:40:17 atlas-mds1 kernel: [172488.267771] Pid: 15401, comm: mdt03_237
Feb  3 22:40:17 atlas-mds1 kernel: [172488.272252] 
Feb  3 22:40:17 atlas-mds1 kernel: [172488.272253] Call Trace:
Feb  3 22:40:17 atlas-mds1 kernel: [172488.277084]  [&amp;lt;ffffffffa06948b5&amp;gt;] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.281355] Lustre: lock timed out (enqueued at 1391484017, 1200s ago)
Feb  3 22:40:17 atlas-mds1 kernel: [172488.281357] Lustre: Skipped 5 previous similar messages
Feb  3 22:40:17 atlas-mds1 kernel: [172488.298432]  [&amp;lt;ffffffff8150f120&amp;gt;] ? _cond_resched+0x30/0x40
Feb  3 22:40:17 atlas-mds1 kernel: [172488.304870]  [&amp;lt;ffffffffa03b265e&amp;gt;] ? cfs_cond_resched+0xe/0x10 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.312380]  [&amp;lt;ffffffffa03c85bf&amp;gt;] ? cfs_hash_for_each_relax+0x17f/0x360 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.320973]  [&amp;lt;ffffffffa069d850&amp;gt;] ? ldlm_res_hash_dump+0x0/0x60 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.328685]  [&amp;lt;ffffffffa069d850&amp;gt;] ? ldlm_res_hash_dump+0x0/0x60 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.336381]  [&amp;lt;ffffffffa03c9f6f&amp;gt;] ? cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.344975]  [&amp;lt;ffffffffa069da6c&amp;gt;] ? ldlm_namespace_dump+0xec/0x1c0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.352978]  [&amp;lt;ffffffffa06b4a3f&amp;gt;] ? ldlm_expired_completion_wait+0x2af/0x390 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.362048]  [&amp;lt;ffffffffa06b9038&amp;gt;] ? ldlm_completion_ast+0x508/0x960 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.370144]  [&amp;lt;ffffffffa06b4790&amp;gt;] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.378995]  [&amp;lt;ffffffff81063600&amp;gt;] ? default_wake_function+0x0/0x20
Feb  3 22:40:17 atlas-mds1 kernel: [172488.386125]  [&amp;lt;ffffffffa06b8758&amp;gt;] ? ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.394614]  [&amp;lt;ffffffffa06b8b30&amp;gt;] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.402507]  [&amp;lt;ffffffffa0d11a90&amp;gt;] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.409813]  [&amp;lt;ffffffffa0d17c7b&amp;gt;] ? mdt_object_lock0+0x33b/0xaf0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.417312]  [&amp;lt;ffffffffa0d11a90&amp;gt;] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.424640]  [&amp;lt;ffffffffa06b8b30&amp;gt;] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.432533]  [&amp;lt;ffffffffa0d184f4&amp;gt;] ? mdt_object_lock+0x14/0x20 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.439741]  [&amp;lt;ffffffffa0d275a9&amp;gt;] ? mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.447861]  [&amp;lt;ffffffffa06e1135&amp;gt;] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.455290]  [&amp;lt;ffffffffa0709646&amp;gt;] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.463197]  [&amp;lt;ffffffffa06e33c4&amp;gt;] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.471184]  [&amp;lt;ffffffffa0d283ad&amp;gt;] ? mdt_intent_getattr+0x29d/0x490 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.478879]  [&amp;lt;ffffffffa0d14f1e&amp;gt;] ? mdt_intent_policy+0x39e/0x720 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.486491]  [&amp;lt;ffffffffa0699831&amp;gt;] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.494398]  [&amp;lt;ffffffffa06c01ef&amp;gt;] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.502771]  [&amp;lt;ffffffffa0d153a6&amp;gt;] ? mdt_enqueue+0x46/0xe0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.509582]  [&amp;lt;ffffffffa0d1ba97&amp;gt;] ? mdt_handle_common+0x647/0x16d0 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.517282]  [&amp;lt;ffffffffa0d553f5&amp;gt;] ? mds_regular_handle+0x15/0x20 [mdt]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.524805]  [&amp;lt;ffffffffa06f23c8&amp;gt;] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.533864]  [&amp;lt;ffffffffa03b25de&amp;gt;] ? cfs_timer_arm+0xe/0x10 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.541076]  [&amp;lt;ffffffffa03c3d9f&amp;gt;] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.548893]  [&amp;lt;ffffffffa06e9729&amp;gt;] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.556704]  [&amp;lt;ffffffffa06f375e&amp;gt;] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.564121]  [&amp;lt;ffffffffa06f2c90&amp;gt;] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.571324]  [&amp;lt;ffffffff8100c0ca&amp;gt;] ? child_rip+0xa/0x20
Feb  3 22:40:17 atlas-mds1 kernel: [172488.577289]  [&amp;lt;ffffffffa06f2c90&amp;gt;] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.584510]  [&amp;lt;ffffffffa06f2c90&amp;gt;] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb  3 22:40:17 atlas-mds1 kernel: [172488.591709]  [&amp;lt;ffffffff8100c0c0&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We are getting evictions due to lock callback timers expiring&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;$ grep -c &lt;span class=&quot;code-quote&quot;&gt;&quot;lock callback timer expired&quot;&lt;/span&gt; kern.log 
52
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I understand these big numbers below may be bogus. Some of these are showing over 7 hours.  Does it even make sense for a client to keep a metadata lock that long?&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Feb  3 22:09:31 atlas-mds1 kernel: [170644.624185] LustreError: 13226:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 25769s: evicting client at 108@gni3  ns: mdt-atlas1-MDT0000_UUID lock: ffff883f0ca506c0/0xce18f06f09f073de lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 409 type: IBT flags: 0x200400000020 nid: 108@gni3 remote: 0x115740db98f79fb1 expref: 117 pid: 14806 timeout: 4465563044 lvb_type: 0
Feb  3 22:20:01 atlas-mds1 kernel: [171273.688384] LustreError: 14898:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 26399s: evicting client at 181@gni3  ns: mdt-atlas1-MDT0000_UUID lock: ffff883ae3910480/0xce18f06f09f0741d lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 404 type: IBT flags: 0x200400000020 nid: 181@gni3 remote: 0xddf404e7006e98bb expref: 117 pid: 15497 timeout: 4466193960 lvb_type: 0
Feb  3 22:30:51 atlas-mds1 kernel: [171922.722297] LustreError: 15176:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 27049s: evicting client at 179@gni3  ns: mdt-atlas1-MDT0000_UUID lock: ffff88356b380900/0xce18f06f09f07582 lrc: 3/0,0 mode: PR/PR res: [0x200044aee:0x52d6:0x0].0 bits 0x13 rrc: 401 type: IBT flags: 0x200400000020 nid: 179@gni3 remote: 0x4c53c41d524a00b6 expref: 117 pid: 14857 timeout: 4466843371 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Many of these are from gni3, which is our XC30 that uses Lustre 2.3 clients.&lt;/p&gt;</comment>
                            <comment id="76160" author="green" created="Tue, 4 Feb 2014 04:35:49 +0000"  >&lt;p&gt;Clients hold locks as long as they wish if there is no contention, since it&apos;s essentially free, but saves an RPC if it&apos;s actually needed again. Cache is tied to locks.&lt;br/&gt;
I believe the long timeouts printed are bogus - it&apos;s time since successful enqueue, as I discovered. We probably want to fix it to have a better idea of how long the waiting actually was.&lt;/p&gt;

&lt;p&gt;the thread sitting there idle for 1200 seconds IS real time. The stack trace does indicate it was waiting for some lock to be canceled. What&apos;s your at_max?&lt;/p&gt;</comment>
                            <comment id="76161" author="ezell" created="Tue, 4 Feb 2014 04:52:18 +0000"  >&lt;p&gt;All of our clients should be setting at_max to 600 seconds.  This also appears to be the default, so it&apos;s probably safe to say that&apos;s a global setting.&lt;/p&gt;</comment>
                            <comment id="76162" author="ezell" created="Tue, 4 Feb 2014 05:06:50 +0000"  >&lt;p&gt;&amp;gt; Clients hold locks as long as they wish if there is no contention, since it&apos;s essentially free, but saves an RPC if it&apos;s actually needed again.&lt;/p&gt;

&lt;p&gt;Sure, if the same node needs it again.  But if a &lt;b&gt;different&lt;/b&gt; node needs the lock, it has to wait on the blocking ast.  Our Cray clients have lru_max_age set to 9000000 (104 days) which is better than the default value 36000000 (416 days!).  We might want to look into reducing this, but really they will be returned earlier because of lru_size.  We set the MDT lru_size to 2000 to (try to) minimize memory usage on the MDS.&lt;/p&gt;</comment>
                            <comment id="76163" author="green" created="Tue, 4 Feb 2014 05:11:20 +0000"  >&lt;p&gt;I see. Enqueue timeout is defined as ldlm_get_enq_timeout(lock) * 2. ldlm_get_enq_timeout would likely return at_max in your case, hence 1200 seconds.&lt;br/&gt;
It&apos;s probable that the lock it was trying to get conflicted on some other lock held by some other server thread too and you might have gotten a whole lot of traces like that in the end.&lt;br/&gt;
Now, we should have evicted a client much sooner, at around 600 seconds since we sent the client (you can probably see an eviction about -600 seconds from this dump I&apos;d hazard to guess), but if there were several locks like that - we&apos;d exceed the 1200 seconds like what happened here, I guess.&lt;/p&gt;

&lt;p&gt;Understanding why the clients don&apos;t send the cancels in (or they also might be just dropped somewhere along the way) might be the key here, I guess.&lt;br/&gt;
Stuff should not be stuck in ELC cancels either since then we&apos;d have ASTs return with EINVAL and server would just drop such locks effortlessly.&lt;/p&gt;</comment>
                            <comment id="76164" author="green" created="Tue, 4 Feb 2014 05:13:13 +0000"  >&lt;p&gt;If differnt node needs the lock, that&apos;s ok, it needs to send an RPC anyway and it&apos;s not fast either, we might allow it to holt the lock as long as it needs.&lt;/p&gt;

&lt;p&gt;Currently all heuristic is assuming the resources are not really contended. What we need to do better eventually is see which resources are actually highly contended and only release those locks (possibly as soon as we are done with them too).&lt;/p&gt;</comment>
                            <comment id="76186" author="hilljjornl" created="Tue, 4 Feb 2014 17:13:32 +0000"  >&lt;p&gt;Tangential to the current analysis of timeouts, evictions, and long responses; the application we identified had ~2000 MPI ranks finishing at the same time and opening a file each (in the same directory, not pre-created files) for writing output. The mitigation we have put in place is to only allow 100 ranks to write at a time, and iterate until all ranks have written. There are other tactics to advise the user to take, but this is one we&apos;re grabbing for a short term test. &lt;/p&gt;

&lt;p&gt;This by no way resolves the issues (as I stated above the user has been in scheduler jail since Friday and we continued to see issues), but I did want to follow up as some were interested in how the application was able to generate ~50K lock enqueues in 30s.&lt;/p&gt;</comment>
                            <comment id="76248" author="green" created="Wed, 5 Feb 2014 04:50:42 +0000"  >&lt;p&gt;Copyignmy comment from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4579&quot; title=&quot;Timeout system horribly broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4579&quot;&gt;&lt;del&gt;LU-4579&lt;/del&gt;&lt;/a&gt; here as well as what you have here seems to be very related and might make sense to run with the same patch:&lt;/p&gt;

&lt;p&gt;Ok, I uploaded &lt;a href=&quot;http://review.whamcloud.com/9127&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9127&lt;/a&gt; that I tested lightly here.&lt;/p&gt;

&lt;p&gt;It fixes lock eviction message to actually specify how long was the wait since sending the RPC.&lt;br/&gt;
Adds a few timeout checks just in case (I see some refresh calls from e.g. ofd may not use at_max.&lt;br/&gt;
Prints number of busy lock prolongations (a bit hackish, only for ibit locks - i.e. MDS)&lt;/p&gt;

&lt;p&gt;This only makes sense to install on servers, primarily mds. Though I tested and clients don&apos;t die either in case you install it there by mistake too.&lt;/p&gt;

&lt;p&gt;I imagine it would be interesting to run with this and once the first 1200 timeout hits, have a log of this timeout + preceeding hour of messages.&lt;/p&gt;</comment>
                            <comment id="76250" author="ezell" created="Wed, 5 Feb 2014 05:54:04 +0000"  >&lt;p&gt;Thanks Oleg.&lt;/p&gt;

&lt;p&gt;Right now, our theory is that clients are getting stuck with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4300&quot; title=&quot;ptlrpcd threads deadlocked in cl_lock_mutex_get&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4300&quot;&gt;&lt;del&gt;LU-4300&lt;/del&gt;&lt;/a&gt;.  So clients are actually holding locks for an extended period of time and fail to respond to blocking ASTs.&lt;/p&gt;

&lt;p&gt;We have two possible workarounds that we are trying; one today and the second if the first doesn&apos;t help:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Enabling DLMTRACE to perturb timing. Even if it doesn&apos;t prevent the problem, at least we get some good logs to look at&lt;/li&gt;
	&lt;li&gt;Disable early_lock_cancel.  This was recommended by DDN, but I don&apos;t yet understand how it affects the situation&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We hope Alexey will have a client-side patch we can try in the next day or two.&lt;/p&gt;

&lt;p&gt;Jason Hill: I think it makes sense to watch for a couple days with the above workarounds before we consider taking a MDS patch.  Do you agree?&lt;/p&gt;</comment>
                            <comment id="76269" author="hilljjornl" created="Wed, 5 Feb 2014 15:56:34 +0000"  >&lt;p&gt;All,&lt;/p&gt;

&lt;p&gt;Based on the findings by Cray in their bugs for the client and matt&apos;s information above I highly recommend changing the state from Major to Minor. Thank you all for your detailed attention. We have talked about possibly enabling the debugging patches when Titan gets an updated client from Cray/Xyratex; we will keep you informed.&lt;/p&gt;

&lt;p&gt;&amp;#8211;&lt;br/&gt;
-Jason&lt;/p&gt;</comment>
                            <comment id="76270" author="jamesanunez" created="Wed, 5 Feb 2014 16:00:01 +0000"  >&lt;p&gt;Jason, &lt;/p&gt;

&lt;p&gt;Thanks for the update. I&apos;ve changed the priority of this ticket at your request.&lt;/p&gt;

&lt;p&gt;James&lt;/p&gt;</comment>
                            <comment id="76275" author="green" created="Wed, 5 Feb 2014 16:47:21 +0000"  >&lt;p&gt;Matt, the 4300 might explain long lock cancel times for OSTs, but not for MDTs, so I&apos;d still want to better understand what goes on with MDT locks here.&lt;/p&gt;</comment>
                            <comment id="76534" author="hilljjornl" created="Sat, 8 Feb 2014 03:51:54 +0000"  >&lt;p&gt;Oleg,&lt;/p&gt;

&lt;p&gt;We are rolling titan&apos;s clients back to 1.8.6 from Cray on Monday. If we integrate your debugging patch and roll the client back, will you get the data you need to analyze?&lt;/p&gt;

&lt;p&gt;&amp;#8211;&lt;br/&gt;
-Jason&lt;/p&gt;</comment>
                            <comment id="76536" author="green" created="Sat, 8 Feb 2014 04:06:25 +0000"  >&lt;p&gt;yes.&lt;br/&gt;
The patch is server side.&lt;br/&gt;
As long as you still have the same problem, the data with the patch should be more useful than not having it.&lt;/p&gt;</comment>
                            <comment id="76723" author="ezell" created="Tue, 11 Feb 2014 15:07:53 +0000"  >&lt;p&gt;Debug patch was installed during maintenance yesterday.  Many hits of the new &quot;Long-waiting lock&quot; message.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Feb 11 05:34:45 atlas-mds1 kernel: [66331.830426] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 172 sec
Feb 11 05:44:05 atlas-mds1 kernel: [66892.365498] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 213 sec
Feb 11 05:54:12 atlas-mds1 kernel: [67499.945964] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 213 sec
Feb 11 06:16:29 atlas-mds1 kernel: [68838.222714] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 78 sec
Feb 11 06:24:04 atlas-mds1 kernel: [69293.656216] LustreError: 20107:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 224 sec
Feb 11 06:34:05 atlas-mds1 kernel: [69895.230302] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 224 sec
Feb 11 06:44:06 atlas-mds1 kernel: [70496.803999] LustreError: 0:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 224 sec
Feb 11 06:54:21 atlas-mds1 kernel: [71112.391365] LustreError: 18817:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 210 sec
Feb 11 07:04:28 atlas-mds1 kernel: [71719.971048] LustreError: 18852:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 204 sec
Feb 11 07:15:48 atlas-mds1 kernel: [72400.620205] LustreError: 19317:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 125 sec
Feb 11 07:24:43 atlas-mds1 kernel: [72936.132036] LustreError: 18544:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 191 sec
Feb 11 07:34:41 atlas-mds1 kernel: [73534.702299] LustreError: 20151:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 195 sec
Feb 11 07:44:15 atlas-mds1 kernel: [74109.251117] LustreError: 20446:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 222 sec
Feb 11 07:54:16 atlas-mds1 kernel: [74710.824274] LustreError: 18539:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 222 sec
Feb 11 08:04:39 atlas-mds1 kernel: [75334.420048] LustreError: 20352:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 219 sec
Feb 11 08:15:47 atlas-mds1 kernel: [76003.057160] LustreError: 19125:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 161 sec
Feb 11 08:26:26 atlas-mds1 kernel: [76642.667237] LustreError: 18540:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 123 sec
Feb 11 08:34:55 atlas-mds1 kernel: [77152.154357] LustreError: 19405:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 216 sec
Feb 11 08:45:50 atlas-mds1 kernel: [77807.778867] LustreError: 19036:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 162 sec
Feb 11 08:55:17 atlas-mds1 kernel: [78375.320158] LustreError: 25518:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 209 sec
Feb 11 09:05:03 atlas-mds1 kernel: [78961.879719] LustreError: 19181:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 224 sec
Feb 11 09:15:04 atlas-mds1 kernel: [79563.453708] LustreError: 19691:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 224 sec
Feb 11 09:25:12 atlas-mds1 kernel: [80172.034167] LustreError: 25536:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 217 sec
Feb 11 09:35:07 atlas-mds1 kernel: [80767.602473] LustreError: 12379:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 223 sec
Feb 11 09:45:07 atlas-mds1 kernel: [81368.175242] LustreError: 19396:0:(ldlm_lockd.c:318:waiting_locks_callback()) ### &lt;span class=&quot;code-object&quot;&gt;Long&lt;/span&gt;-waiting lock 224 sec
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;No hits of &quot;requested timeout %d, more than at_max&quot; or &quot;Adding a lock, but the front position is scheduled in %d seconds&quot;&lt;/p&gt;

&lt;p&gt;Also seeing messages like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Feb 11 09:48:52 atlas-mds1 kernel: [81593.435989] LNet: Service thread pid 19926 completed after 16665.22s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Feb 11 09:56:36 atlas-mds1 kernel: [82058.639480] Lustre: atlas1-MDT0000: Client c7c41ec7-373b-1a4a-8f79-d741cf79e22a (at 12589@gni100) refused reconnection, still busy with 2 active RPCs
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="76745" author="green" created="Tue, 11 Feb 2014 17:13:18 +0000"  >&lt;p&gt;Hm, so only the extra waiting, huh. And the numbers are not all that far off.&lt;/p&gt;

&lt;p&gt;The slow service thread and the still busy messages could not be new ones, you should have gotten them in the past too, right?&lt;br/&gt;
Also I imagine you still get lock timeouts on mdt with corresponding client evictions, can you paste a few here, so that we can see what is the actual timeout time now?&lt;/p&gt;</comment>
                            <comment id="76752" author="ezell" created="Tue, 11 Feb 2014 17:52:18 +0000"  >&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Feb 11 10:18:55 atlas-mds1 kernel: [83398.111851] LustreError: 17613:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 18469s: evicting client at 14346@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881463276000/0xd030cd07cae2a16b lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1847 type: IBT flags: 0x200400000020 nid: 14346@gni100 remote: 0xc3ce954c318ff7ca expref: 37 pid: 18996 timeout: 4377983092 lvb_type: 0 used 0
Feb 11 10:28:56 atlas-mds1 kernel: [83999.685683] LustreError: 19235:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 19070s: evicting client at 14349@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881f796cb000/0xd030cd07cae2a205 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1833 type: IBT flags: 0x200400000020 nid: 14349@gni100 remote: 0x4c3f68e1e5538822 expref: 36 pid: 19097 timeout: 4378584127 lvb_type: 0 used 0
Feb 11 10:38:57 atlas-mds1 kernel: [84601.260349] LustreError: 17617:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 19671s: evicting client at 16378@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff88155ad90480/0xd030cd07cae2a259 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1828 type: IBT flags: 0x200400000020 nid: 16378@gni100 remote: 0xc7d8717e54253dc2 expref: 36 pid: 19226 timeout: 4379185137 lvb_type: 0 used 0
Feb 11 10:48:58 atlas-mds1 kernel: [85202.833436] LustreError: 20236:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 20272s: evicting client at 16310@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881a9a7d46c0/0xd030cd07cae2a2b4 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1824 type: IBT flags: 0x200400000020 nid: 16310@gni100 remote: 0xbd5c384bff16401 expref: 36 pid: 19337 timeout: 4379786098 lvb_type: 0 used 0
Feb 11 10:58:59 atlas-mds1 kernel: [85804.407052] LustreError: 19336:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 20873s: evicting client at 13309@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883fb36c7480/0xd030cd07cae2a3e8 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1820 type: IBT flags: 0x200400000020 nid: 13309@gni100 remote: 0x33c7e22dc8d25f25 expref: 36 pid: 19745 timeout: 4380387245 lvb_type: 0 used 0
Feb 11 11:09:00 atlas-mds1 kernel: [86405.980993] LustreError: 19342:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 21474s: evicting client at 16366@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883ba9de5240/0xd030cd07cae2a489 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1811 type: IBT flags: 0x200400000020 nid: 16366@gni100 remote: 0x1e95102493397b80 expref: 37 pid: 20353 timeout: 4380988521 lvb_type: 0 used 0
Feb 11 11:19:02 atlas-mds1 kernel: [87008.555529] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 22076s: evicting client at 10380@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881f7f49f900/0xd030cd07cae2a577 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1788 type: IBT flags: 0x200400000020 nid: 10380@gni100 remote: 0x758d3352cfc91da4 expref: 36 pid: 19368 timeout: 4381590703 lvb_type: 0 used 0
Feb 11 11:29:04 atlas-mds1 kernel: [87611.130362] LustreError: 17619:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 22678s: evicting client at 11922@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881448cdad80/0xd030cd07cae2a61f lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1780 type: IBT flags: 0x200400000020 nid: 11922@gni100 remote: 0x9f490f3250dcfa26 expref: 40 pid: 19083 timeout: 4382192237 lvb_type: 0 used 0
Feb 11 11:39:05 atlas-mds1 kernel: [88212.705133] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 23279s: evicting client at 16399@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881a19db7d80/0xd030cd07cae2a657 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1775 type: IBT flags: 0x200400000020 nid: 16399@gni100 remote: 0xae3b55b3b0e8fcaf expref: 37 pid: 19294 timeout: 4382793129 lvb_type: 0 used 0
Feb 11 11:49:06 atlas-mds1 kernel: [88814.277996] LustreError: 18622:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 23880s: evicting client at 13288@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883fb1568b40/0xd030cd07cae2a6a4 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1772 type: IBT flags: 0x200400000020 nid: 13288@gni100 remote: 0xbc7520cd6a29b2c2 expref: 37 pid: 19662 timeout: 4383394123 lvb_type: 0 used 0
Feb 11 11:59:07 atlas-mds1 kernel: [89415.852733] LustreError: 25491:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 24481s: evicting client at 12596@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883fac92e900/0xd030cd07cae2a745 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1769 type: IBT flags: 0x200400000020 nid: 12596@gni100 remote: 0xb235a780d20e5531 expref: 38 pid: 20296 timeout: 4383995131 lvb_type: 0 used 0
Feb 11 12:09:08 atlas-mds1 kernel: [90017.426664] LustreError: 20036:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 25082s: evicting client at 11054@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff883fa39f1d80/0xd030cd07cae2a895 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1755 type: IBT flags: 0x200400000020 nid: 11054@gni100 remote: 0x39056ad1e7dcf9ff expref: 39 pid: 20374 timeout: 4384596118 lvb_type: 0 used 0
Feb 11 12:19:11 atlas-mds1 kernel: [90621.002347] LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 25684s: evicting client at 12598@gni100  ns: mdt-atlas1-MDT0000_UUID lock: ffff881aac5e0480/0xd030cd07cae2ab12 lrc: 3/0,0 mode: PR/PR res: [0x20007d38d:0xf5:0x0].0 bits 0x13 rrc: 1742 type: IBT flags: 0x200400000020 nid: 12598@gni100 remote: 0xbc132fc49289524 expref: 40 pid: 18871 timeout: 4385199738 lvb_type: 0 used 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="76755" author="ezell" created="Tue, 11 Feb 2014 18:24:15 +0000"  >&lt;p&gt;Of the 48 evictions I&apos;ve seen today, most of them refer to the same FID:&lt;/p&gt;

&lt;p&gt;$ grep &quot;lock callback&quot; kern.log |awk &apos;&lt;/p&gt;
{print $29}
&lt;p&gt;&apos;|sort|uniq -c | sort -nr&lt;br/&gt;
     46 &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20007d38d:0xf5:0x0&amp;#93;&lt;/span&gt;.0&lt;br/&gt;
      1 &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200064129:0x1e493:0x0&amp;#93;&lt;/span&gt;.0&lt;br/&gt;
      1 &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200064129:0x1e45f:0x0&amp;#93;&lt;/span&gt;.0&lt;/p&gt;

&lt;p&gt;Using &apos;lfs fid2path&apos;, all 48 of these correspond to a directory. &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20007d38d:0xf5:0x0&amp;#93;&lt;/span&gt; is a directory with 1262 files in it.&lt;/p&gt;</comment>
                            <comment id="76760" author="green" created="Tue, 11 Feb 2014 18:55:07 +0000"  >&lt;p&gt;hot long is the timeout listed in the lock eviction message?&lt;/p&gt;

&lt;p&gt;1262 files is not all that many too.&lt;/p&gt;</comment>
                            <comment id="76761" author="green" created="Tue, 11 Feb 2014 19:01:53 +0000"  >&lt;p&gt;oh, I see you pasted the eviction messages too. the time is still in 20k seconds which cannot be right, sigh.&lt;/p&gt;</comment>
                            <comment id="76762" author="ezell" created="Tue, 11 Feb 2014 19:05:53 +0000"  >&lt;p&gt;It&apos;s logging every 10 minutes.  The time listed increases by about 600s (10 minutes) each time.  It seems like the lock is bouncing between clients.  If a client gets evicted due to a callback timer expiring, and that lock is re-granted to the next waiting client, do you update l_last_activity? &lt;/p&gt;</comment>
                            <comment id="76766" author="green" created="Tue, 11 Feb 2014 19:19:24 +0000"  >&lt;p&gt;Yes, the re-granting is actually entirely new enqueue so it gets counted from zero of course.&lt;br/&gt;
It&apos;s almost like the last_activity is somehow zeroed-out instead.&lt;/p&gt;

&lt;p&gt;I also note that all expired locks listed are of PR mode. Since this is a difectory, this could only be a result of readdir.&lt;br/&gt;
I wonder if the lock is held up because the client that got this lock tries to obtain the directory pages, but cannot for some reason, hence the lock is never relased. Do you happen to get messages like &quot;error reading dir ...&quot; on clients or any other errors from llite/dir.c ?&lt;/p&gt;
</comment>
                            <comment id="76775" author="ezell" created="Tue, 11 Feb 2014 20:40:43 +0000"  >&lt;p&gt;Here&apos;s the logs from dir.c from the Cray clients.  c0-0c0s2n2 is 90@gni100 and c24-0c0s6n0 is 9228@gni100.  Both of those are login nodes.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;console-20140130:2014-01-30T18:23:55.673423-05:00 c0-0c0s2n2 LustreError: 4961:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200037b5b:0x307f:0x0] at 0: rc -108
console-20140130:2014-01-30T18:23:55.673430-05:00 c0-0c0s2n2 LustreError: 4961:0:(dir.c:595:ll_dir_read()) error reading dir [0x200037b5b:0x307f:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:11.186576-05:00 c24-0c0s6n0 LustreError: 14610:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:11.186640-05:00 c24-0c0s6n0 LustreError: 14610:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:13.317079-05:00 c24-0c0s6n0 LustreError: 14611:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:13.317127-05:00 c24-0c0s6n0 LustreError: 14611:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:16.408287-05:00 c24-0c0s6n0 LustreError: 14612:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at 0: rc -108
console-20140201:2014-02-01T18:10:16.408328-05:00 c24-0c0s6n0 LustreError: 14612:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -108
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here are some client logs from 14346@gni100 (the first client evicted in the log above).  10 minutes before the eviction, it dropped a PUT from the MDS.  How do I debug why it was dropped?&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;console-20140211:2014-02-11T09:08:07.891822-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16
console-20140211:2014-02-11T09:08:07.891830-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message
console-20140211:2014-02-11T09:26:32.479117-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16
console-20140211:2014-02-11T09:26:32.479133-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message
console-20140211:2014-02-11T09:44:52.493914-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16
console-20140211:2014-02-11T09:44:52.493935-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message
console-20140211:2014-02-11T10:03:13.998980-05:00 c13-5c0s5n0 LustreError: 11-0: an error occurred &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; communicating with 10.36.226.72@o2ib200. The mds_connect operation failed with -16
console-20140211:2014-02-11T10:03:13.999004-05:00 c13-5c0s5n0 LustreError: Skipped 1 previous similar message
console-20140211:2014-02-11T10:08:54.976373-05:00 c13-5c0s5n0 LNet: 15569:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.226.72@o2ib200 portal 10 match 1459678641612461 offset
 192 length 536: 2
console-20140211:2014-02-11T10:21:38.458816-05:00 c13-5c0s5n0 LustreError: 167-0: This client was evicted by atlas1-MDT0000; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail.
console-20140211:2014-02-11T10:21:38.458837-05:00 c13-5c0s5n0 LustreError: Skipped 10 previous similar messages
console-20140211:2014-02-11T10:21:38.458844-05:00 c13-5c0s5n0 Lustre: Server atlas1-MDT0000_UUID version (2.4.2.0) is much newer than client version (1.8.6)
console-20140211:2014-02-11T10:21:38.458851-05:00 c13-5c0s5n0 Lustre: Skipped 10 previous similar messages
console-20140211:2014-02-11T10:21:38.458859-05:00 c13-5c0s5n0 Lustre: atlas1-MDT0000-mdc-ffff88080dd5c800: Connection restored to service atlas1-MDT0000 using nid 10.36.226.72@o2ib200.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="76803" author="green" created="Wed, 12 Feb 2014 03:24:30 +0000"  >&lt;p&gt;Hm, the mds_connect failing with -16 before the dropped put means the client is already disconnected from MDS and MDS does not let it in because it believes it is already serving a request from this client (server side messages are &quot;refused reconnection, still busy with X active RPCs&quot;).&lt;/p&gt;

&lt;p&gt;Dropping PUT is because something was sent to a client that the client did not expect. Portal 10 is MDC_REPLY. What likely happened here is that the request on MDS took really-really long time, enough for a client to timeout it&apos;s original request and try to reconnect (only to be rejected by the server as above), then eventually MDS thread completed (will be visible as a &quot;This indicates the system was overloaded&quot; message) and sent a response, alas the client is no longer listening for that response. But because the server is no longer processing anything from this client - next reconnect succeeds (and client notices it actually was evicted).&lt;/p&gt;

&lt;p&gt;Earlier on in that same log you&apos;ll see there was a request timed out message of some sort, what was it and when (or if more than one - all of them)? How synced up are the clocks on the servers and clients (for log purposes)? 10:18:55 is before the dropped put reply. Could it be we had a deadlock-like situation where a client holds the lock, then readdir (or whatever was the request, but it must have had the lock held to prevent cancellation) had MDS to perform an operation that wanted that lock canceled.&lt;br/&gt;
In the MDS log before the 10:08 time stamp &quot;This indicates the system was overloaded&quot; likely there was a watchdog trigger for the same pid, what was the stacktrace?&lt;/p&gt;

&lt;p&gt;Additionally, this seems to be a fairly recent problem, I believe that ORNL was running 2.4 servers + 1.8 clients for quite a while before, so can you briefly describe what have changed before the problems appeared? Just fs reformat?&lt;/p&gt;

&lt;p&gt;if you can dump a sysrq-t from the first node that develops &apos;The mds_connect operation failed with -16&apos; symptom, that might be interesting to see what other stuff was happening there, though I guess the problem might be due to several nodes interoperations.&lt;/p&gt;

&lt;p&gt;I wonder if something like this is a possiblity:&lt;br/&gt;
thread 1: lock the lock, issue readdir&lt;br/&gt;
thread 2: do some sort of chmod/create that would invalidate directory lock.&lt;br/&gt;
....&lt;br/&gt;
On server: create arrived/started to be executed to the point of getting locks first. Sent blocking ast back to the client that&apos;s waiting for readdir to finish.&lt;br/&gt;
readdir arrived and blocked on something from create (because reddir rpc does not take any ldlm locks). (another option, readdir is just stuck in the list waiting to be serviced, but cannot be due to lack of free threads? Should not be possible because readdir goes to a separate portal.).&lt;br/&gt;
Can you also grab sysrq-t from mdt when this happens (the refused reconnection, still busy with X active RPCs)&lt;/p&gt;</comment>
                            <comment id="76879" author="ezell" created="Wed, 12 Feb 2014 19:41:24 +0000"  >&lt;p&gt;I&apos;ve been having problems stitching together the logs.  The MDS suppressed dumping stack traces to 3 every 300 seconds, so it seems like I&apos;m missing different pieces for different occurrences at different times.  I&apos;ll look into it some more tonight.&lt;/p&gt;

&lt;p&gt;Interestingly, it looks like this behavior has calmed down today.  I&apos;m guessing it&apos;s workload-related.&lt;/p&gt;

&lt;p&gt;Time should be pretty-well synchronized.  Everything uses NTP.&lt;/p&gt;

&lt;p&gt;ORNL was running with (Cray) 1.8.6 clients against multiple file systems.  Most users were still on Widow (1.8.x based) and we brought in our new Atlas system (2.4.x).  In December and January, users migrated from Widow to Atlas.  As more users moved to Atlas, it seemed like our slowdowns got more common/worse.  A couple weeks ago we decommissioned Widow and moved most clients to 2.4 (since we only had 2.4 servers).  We hit some client-side issues and decided to downgrade clients to (Cray) 1.8.6 this Monday.&lt;/p&gt;</comment>
                            <comment id="76925" author="green" created="Thu, 13 Feb 2014 00:54:13 +0000"  >&lt;p&gt;It well could be lock related.&lt;br/&gt;
If MDC readpage/readdir RPC is slowed down for some reason, it might lead to locks timing out, this feeds onto itself since while a client holds that readdir lock, other RPCs (like creates in that same dir) would be blocked and would clog the incoming request queue for the normal rpcs portal at the very least.&lt;/p&gt;

&lt;p&gt;If you can do sysrq-t when you are in this situation, it should be useful, at least we should be able to get some idea of what&apos;s waiting where (and esp. where do readdir threads wait).&lt;/p&gt;</comment>
                            <comment id="77111" author="hilljjornl" created="Fri, 14 Feb 2014 20:06:01 +0000"  >&lt;p&gt;We attempted to crash dump the MDS today (memory pressure was indicated in the logs by messages like &quot;page allocation failure&quot;). Last time I looked there was 175GB in slabs; I did not grab a slabtop to see where we were holding memory. The crash dump process was stuck for ~20 minutes and the admin decided to kill the crash dump (ipmi power off), so we don&apos;t have additional data. The performance for metadata operations continues to be a large complaint from our user base, so we&apos;d like to raise the priority back. We ran with the 1.8.6 client for 1 week and the logs on titan are much quieter. Please indicate the next steps that we should take in terms of gathering data. Current status:&lt;/p&gt;

&lt;p&gt;Server: Lustre 1.4.2 + &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4006&quot; title=&quot;LNET Messages staying in Queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4006&quot;&gt;&lt;del&gt;LU-4006&lt;/del&gt;&lt;/a&gt; + &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4597&quot; title=&quot;inconsistent file size&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4597&quot;&gt;&lt;del&gt;LU-4597&lt;/del&gt;&lt;/a&gt;( patch 9127)&lt;br/&gt;
Clients:&lt;br/&gt;
   18,696 Cray Lustre 1.8.6&lt;br/&gt;
      747 Cray Lustre 2.3.?&lt;br/&gt;
      316 Intel Lustre 2.4.2&lt;br/&gt;
       12 Intel Lustre 1.8.9&lt;br/&gt;
        1 Intel Lustre 2.4.1&lt;/p&gt;

&lt;p&gt;Can we raise the priority?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
-Jason&lt;/p&gt;
</comment>
                            <comment id="77115" author="jamesanunez" created="Fri, 14 Feb 2014 20:32:22 +0000"  >&lt;p&gt;Jason, &lt;/p&gt;

&lt;p&gt;I&apos;ve returned the priority of this ticket to Major. &lt;/p&gt;

&lt;p&gt;James&lt;/p&gt;</comment>
                            <comment id="77120" author="green" created="Fri, 14 Feb 2014 20:48:30 +0000"  >&lt;p&gt;Jason: &lt;br/&gt;
So, sysrq-t from mdt when you have locks timeouts would be great so we know what waits where. this should be saef for your mdt compared to a crashdump too.&lt;/p&gt;

&lt;p&gt;The previous analysis where most of lock timeouts were attributed to 2.3 clients was also interesting. Is this still the pattern?&lt;/p&gt;

&lt;p&gt;Page allocation failure, was that &quot;tage&quot; allocation failure by any chance, have a sample if not?&lt;/p&gt;</comment>
                            <comment id="77187" author="hilljjornl" created="Mon, 17 Feb 2014 16:44:25 +0000"  >&lt;p&gt;Oleg,&lt;/p&gt;

&lt;p&gt;We&apos;ve not seen the &quot;refused connection&quot; messages since 02/11/14. However since introducing the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4579&quot; title=&quot;Timeout system horribly broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4579&quot;&gt;&lt;del&gt;LU-4579&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4006&quot; title=&quot;LNET Messages staying in Queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4006&quot;&gt;&lt;del&gt;LU-4006&lt;/del&gt;&lt;/a&gt; (minor upgrade to the kernel in the RHEL 6.4 stream also), we&apos;ve been having memory issues. Nothing like &quot;tage&quot; allocation, but we&apos;re consuming ~100GB (of 256GB) in ldiskfs_inode_cache. MDS was rebooted on Friday (noted above), and since then we&apos;ve consumed 95GB in ldiskfs_inode_cache. Output from slabtop below. Could there be any reason in those two patches for this behavior to have started?&lt;/p&gt;


&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-mds1 ~&amp;#93;&lt;/span&gt;# uname -a &lt;br/&gt;
Linux atlas-mds1 2.6.32-358.23.2.el6.atlas.x86_64 #1 SMP Thu Jan 23 08:28:06 EST 2014 x86_64 x86_64 x86_64 GNU/Linux&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;slabtop&amp;#93;&lt;/span&gt;&lt;br/&gt;
  Active / Total Objects (% used)    : 464456162 / 554753575 (83.7%)&lt;br/&gt;
  Active / Total Slabs (% used)      : 47894092 / 47894105 (100.0%)&lt;br/&gt;
  Active / Total Caches (% used)     : 134 / 238 (56.3%)&lt;br/&gt;
  Active / Total Size (% used)       : 154353997.52K / 165485911.12K (93.3%)&lt;br/&gt;
  Minimum / Average / Maximum Object : 0.02K / 0.30K / 4096.00K&lt;/p&gt;

&lt;p&gt;   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME&lt;br/&gt;
71585049 71500585  39%    1.03K 23861683        3  95446732K&lt;br/&gt;
ldiskfs_inode_cache&lt;br/&gt;
258255738 257781953   0%    0.21K 14347541       18  57390164K osp_obj&lt;br/&gt;
33404544 19776814  59%    0.24K 2087784       16   8351136K mdt_obj&lt;br/&gt;
34789660 20228836  58%    0.19K 1739483       20   6957932K size-192&lt;br/&gt;
11802420 11362818  96%    0.56K 1686060        7   6744240K ldlm_locks&lt;br/&gt;
36565504 19776806  54%    0.12K 1142672       32   4570688K lod_obj&lt;br/&gt;
38187240 19776783  51%    0.09K 954681       40   3818724K mdd_obj&lt;br/&gt;
10246056 9800319  95%    0.31K 853838       12   3415352K ldlm_resources&lt;br/&gt;
15167743 15146729  99%    0.10K 409939       37   1639756K buffer_head&lt;br/&gt;
40185488 15123748  37%    0.03K 358799      112   1435196K size-32&lt;br/&gt;
  61638  61626  99%    8.00K  61638        1    493104K size-8192&lt;br/&gt;
2075220 1957851  94%    0.12K  69174       30    276696K size-128&lt;br/&gt;
  99638  97555  97%    2.00K  49819        2    199276K size-2048&lt;br/&gt;
347039 346726  99%    0.55K  49577        7    198308K radix_tree_node&lt;br/&gt;
274722 274712  99%    0.62K  45787        6    183148K proc_inode_cache&lt;br/&gt;
318336 310874  97%    0.50K  39792        8    159168K size-512&lt;br/&gt;
102492 102443  99%    1.02K  34164        3    136656K nfs_inode_cache&lt;br/&gt;
  98316  91419  92%    1.00K  24579        4     98316K size-1024&lt;br/&gt;
  22746  22746 100%    4.00K  22746        1     90984K size-4096&lt;br/&gt;
396660 395855  99%    0.19K  19833       20     79332K dentry&lt;br/&gt;
   2354   2354 100%   16.00K   2354        1     37664K size-16384&lt;br/&gt;
    280    280 100%  128.00K    280        1     35840K size-131072&lt;br/&gt;
492414 279596  56%    0.06K   8346       59     33384K size-64&lt;/p&gt;</comment>
                            <comment id="77188" author="hilljjornl" created="Mon, 17 Feb 2014 16:45:51 +0000"  >&lt;p&gt;for completeness:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-mds1 ~&amp;#93;&lt;/span&gt;# rpm -qi lustre&lt;br/&gt;
Name        : lustre                       Relocations: (not relocatable)&lt;br/&gt;
Version     : 2.4.2                             Vendor: (none)&lt;br/&gt;
Release     : 2.6.32_358.23.2.el6.atlas.x86_64   Build Date: Thu Feb  6 15:15:56 2014&lt;br/&gt;
Install Date: Mon Feb 10 09:39:10 2014         Build Host: lgmgmt1.ccs.ornl.gov&lt;br/&gt;
Group       : Utilities/System              Source RPM: lustre-2.4.2-2.6.32_358.23.2.el6.atlas.x86_64.src.rpm&lt;br/&gt;
Size        : 4062709                          License: GPL&lt;br/&gt;
Signature   : (none)&lt;br/&gt;
URL         : &lt;a href=&quot;http://wiki.whamcloud.com/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://wiki.whamcloud.com/&lt;/a&gt;&lt;br/&gt;
Summary     : Lustre File System&lt;br/&gt;
Description :&lt;br/&gt;
Userspace tools and files for the Lustre file system.&lt;/p&gt;</comment>
                            <comment id="77190" author="hilljjornl" created="Mon, 17 Feb 2014 17:05:42 +0000"  >&lt;p&gt;Also to correlate, we&apos;ve consumed a large number of MDT inodes for atlas1:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lfs-mgmt1 scratch&amp;#93;&lt;/span&gt;# df -i /lustre/atlas1&lt;br/&gt;
Filesystem            Inodes   IUsed   IFree IUse% Mounted on&lt;br/&gt;
10.36.226.77@o2ib:/atlas1&lt;br/&gt;
                     1073741824 427809242 645932582   40% /lustre/atlas1&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lfs-mgmt1 scratch&amp;#93;&lt;/span&gt;# df -i /lustre/atlas2&lt;br/&gt;
Filesystem            Inodes   IUsed   IFree IUse% Mounted on&lt;br/&gt;
10.36.226.77@o2ib:/atlas2&lt;br/&gt;
                     1073741824 88963619 984778205    9% /lustre/atlas2&lt;/p&gt;

&lt;p&gt;current slab usage for atlas-mds3 (atlas2&apos;s mds) is ~30GB.&lt;/p&gt;

&lt;p&gt;next test is to fire up the TDS and create ~500M files to see if the slab issue follows with the number of files in the filesystem.&lt;/p&gt;</comment>
                            <comment id="77252" author="ezell" created="Tue, 18 Feb 2014 16:20:12 +0000"  >&lt;p&gt;We ran oprofile on the mds and got the following top-consumers:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;samples  %        image name               app name symbol name
4753117  40.5645  vmlinux                  vmlinux __write_lock_failed
1276800  10.8966  libcfs.ko                libcfs.ko cfs_percpt_lock
869346    7.4193  oprofiled                oprofiled /usr/bin/oprofiled
470403    4.0146  vmlinux                  vmlinux __vmalloc_node
411024    3.5078  vmlinux                  vmlinux rb_get_reader_page
277246    2.3661  vmlinux                  vmlinux ring_buffer_consume
276266    2.3577  vmlinux                  vmlinux poll_idle
239955    2.0478  lod.ko                   lod.ko lod_alloc_qos.clone.0
211548    1.8054  vmlinux                  vmlinux intel_idle
198649    1.6953  vmlinux                  vmlinux remove_vm_area
146320    1.2487  vmlinux                  vmlinux smp_call_function_many
141106    1.2042  vmlinux                  vmlinux generic_smp_call_function_interrupt
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That looks a lot like the results from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4008&quot; title=&quot;Reduce vmalloc usage on MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4008&quot;&gt;&lt;del&gt;LU-4008&lt;/del&gt;&lt;/a&gt;, focusing on vmalloc.  I know vmalloc has a lock, is the __write_lock_failed indicating lock contention in vmalloc?&lt;/p&gt;</comment>
                            <comment id="77306" author="green" created="Tue, 18 Feb 2014 23:40:37 +0000"  >&lt;p&gt;Yes, you are definitely seeing lock contention on some lock, but it could be any lock that&apos;s compiled into the kernel. Without calltrace analysis  it&apos;s hard to tell more.&lt;/p&gt;

&lt;p&gt;Patch from lu4579 should not really affect any memory allocations whatsoever.&lt;/p&gt;

&lt;p&gt;As for slab usage, are you sure this did not happen before? I would imagine the cache will grow to fill entire available memory as you access more and more files. This is the purpose of the cache.&lt;/p&gt;</comment>
                            <comment id="77310" author="ezell" created="Wed, 19 Feb 2014 00:09:53 +0000"  >&lt;p&gt;We&apos;ll do some more analysis tomorrow to try to get call graphs to see what locks we are contending on.  We think its vmalloc at this point.  We plan to try the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3338&quot; title=&quot;IOC_MDC_GETFILESTRIPE can abuse vmalloc()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3338&quot;&gt;&lt;del&gt;LU-3338&lt;/del&gt;&lt;/a&gt; patch (&lt;a href=&quot;http://review.whamcloud.com/#/c/6339&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/6339&lt;/a&gt;) in testbed to see how it impacts behavior.  Also turn on +malloc to watch as things are allocated and freed.&lt;/p&gt;

&lt;p&gt;What are the implications of increasing OBD_ALLOC_BIG?  Just more likely to hit fragmentation/allocation failures/slowdowns?  Is the following a really bad idea:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h
index 24671c3..83d3c62 100644
--- a/lustre/include/obd_support.h
+++ b/lustre/include/obd_support.h
@@ -704,7 +704,8 @@ &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; {                                                                              \
  * since vmalloc in Linux doesn&apos;t perform well on multi-cores system, calling
  * vmalloc in critical path would hurt peformance badly. See LU-66.
  */
-#define OBD_ALLOC_BIG (4 * PAGE_CACHE_SIZE)
+#define OBD_ALLOC_BIG_ORDER 4
+#define OBD_ALLOC_BIG (PAGE_CACHE_SIZE &amp;lt;&amp;lt; OBD_ALLOC_BIG_ORDER)
 
 #define OBD_ALLOC_LARGE(ptr, size)                                            \
 &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; {                                                                          \
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Or instead try to kmalloc without GFP_WAIT and fall back to vmalloc if it fails?  Then we can compare the pointer to see if its between VMALLOC_START and VMALLOC_END to know which deallocation function to use (except it looks like those are being deprecated in the kernel).  Might only want to do that if the size is over 4*PAGE_CACHE_SIZE to avoid extra checks in the fast path for small allocations.&lt;/p&gt;</comment>
                            <comment id="77324" author="hilljjornl" created="Wed, 19 Feb 2014 05:54:04 +0000"  >&lt;p&gt;Oleg - slab usage got a separate issue &amp;#8211; &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4641&quot; title=&quot;ldiskfs_inode_cache slab high usage&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4641&quot;&gt;&lt;del&gt;LU-4641&lt;/del&gt;&lt;/a&gt;. We think we have this tracked down to a handful of nodes running mlocate inside of Lustre. As Matt said we should have some more data for you in the morning.&lt;/p&gt;</comment>
                            <comment id="77326" author="hilljjornl" created="Wed, 19 Feb 2014 06:40:31 +0000"  >&lt;p&gt;Oleg:&lt;/p&gt;

&lt;p&gt;We did get a few trips of your debug patch in the last two days. I failed to look for it yesterday. Sorry about that. Here&apos;s the messages, I will work on getting you more info from around these times:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;grep &quot;Adding a lock, but the front position is scheduled in&quot; lustrekernel&lt;br/&gt;
Feb 18 15:43:54 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708001.133886&amp;#93;&lt;/span&gt; LustreError: 30384:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 15:43:55 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708001.265487&amp;#93;&lt;/span&gt; LustreError: 24590:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 15:44:11 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708017.670420&amp;#93;&lt;/span&gt; LustreError: 86222:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:12:36 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;709728.444442&amp;#93;&lt;/span&gt; LustreError: 27317:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:12:38 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;709730.285546&amp;#93;&lt;/span&gt; LustreError: 27317:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:20:00 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710167.876157&amp;#93;&lt;/span&gt; LustreError: 69582:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:20:00 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710168.014783&amp;#93;&lt;/span&gt; LustreError: 85586:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 16:20:09 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710175.926957&amp;#93;&lt;/span&gt; LustreError: 30398:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:20:23 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710190.363610&amp;#93;&lt;/span&gt; LustreError: 25260:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:40:10 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711378.566730&amp;#93;&lt;/span&gt; LustreError: 70567:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:40:52 atlas-oss2b1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711423.942634&amp;#93;&lt;/span&gt; LustreError: 24667:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:40:52 atlas-oss2b1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711424.073057&amp;#93;&lt;/span&gt; LustreError: 21012:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 16:41:01 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711430.285640&amp;#93;&lt;/span&gt; LustreError: 20357:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:41:08 atlas-oss2b1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711440.138592&amp;#93;&lt;/span&gt; LustreError: 21114:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:41:10 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711438.859467&amp;#93;&lt;/span&gt; LustreError: 21096:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:41:15 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711443.961101&amp;#93;&lt;/span&gt; LustreError: 20357:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:44:54 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711666.257479&amp;#93;&lt;/span&gt; LustreError: 27271:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:45:01 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711672.979860&amp;#93;&lt;/span&gt; LustreError: 27246:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:45:04 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711673.026049&amp;#93;&lt;/span&gt; LustreError: 26061:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:45:04 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711676.191623&amp;#93;&lt;/span&gt; LustreError: 23342:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 16:45:17 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711688.908916&amp;#93;&lt;/span&gt; LustreError: 56747:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 16:45:32 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711701.477851&amp;#93;&lt;/span&gt; LustreError: 26410:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 17:02:02 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;712041.677197&amp;#93;&lt;/span&gt; LustreError: 20599:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 17:16:59 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;713596.079487&amp;#93;&lt;/span&gt; LustreError: 25774:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 17:17:00 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;713596.219754&amp;#93;&lt;/span&gt; LustreError: 87438:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:09:59 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716774.401901&amp;#93;&lt;/span&gt; LustreError: 23027:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:09:59 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716774.571574&amp;#93;&lt;/span&gt; LustreError: 23068:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:10:05 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716126.241513&amp;#93;&lt;/span&gt; LustreError: 24944:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:10:05 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716126.372856&amp;#93;&lt;/span&gt; LustreError: 24732:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:10:06 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716127.799754&amp;#93;&lt;/span&gt; LustreError: 20009:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:10:07 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716777.904758&amp;#93;&lt;/span&gt; LustreError: 3700:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### &lt;br/&gt;
Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:10:13 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716787.846448&amp;#93;&lt;/span&gt; LustreError: 27319:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:10:13 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716787.372402&amp;#93;&lt;/span&gt; LustreError: 23223:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:10:14 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716135.339937&amp;#93;&lt;/span&gt; LustreError: 21260:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:10:20 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716793.891680&amp;#93;&lt;/span&gt; LustreError: 24121:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:10:21 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716795.708684&amp;#93;&lt;/span&gt; LustreError: 23071:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:10:28 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716149.801314&amp;#93;&lt;/span&gt; LustreError: 20529:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:10:29 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716802.920675&amp;#93;&lt;/span&gt; LustreError: 23217:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:10:30 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716805.563884&amp;#93;&lt;/span&gt; LustreError: 23266:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:11:01 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716835.642508&amp;#93;&lt;/span&gt; LustreError: 56747:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:11:16 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716849.686177&amp;#93;&lt;/span&gt; LustreError: 22306:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:11:31 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716865.522612&amp;#93;&lt;/span&gt; LustreError: 70981:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:11:32 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716862.873513&amp;#93;&lt;/span&gt; LustreError: 26426:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:11:33 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716868.256716&amp;#93;&lt;/span&gt; LustreError: 23017:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:11:40 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716221.117935&amp;#93;&lt;/span&gt; LustreError: 20598:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:11:47 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716880.889169&amp;#93;&lt;/span&gt; LustreError: 39989:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:12:18 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716912.213317&amp;#93;&lt;/span&gt; LustreError: 24105:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:44:56 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718874.544200&amp;#93;&lt;/span&gt; LustreError: 25633:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:44:56 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718874.674722&amp;#93;&lt;/span&gt; LustreError: 25737:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 18:45:02 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718880.495318&amp;#93;&lt;/span&gt; LustreError: 26485:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:45:04 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718882.996827&amp;#93;&lt;/span&gt; LustreError: 26485:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 18:45:40 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718919.070183&amp;#93;&lt;/span&gt; LustreError: 25664:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 21:45:19 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729697.635596&amp;#93;&lt;/span&gt; LustreError: 70926:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 21:45:26 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729704.727897&amp;#93;&lt;/span&gt; LustreError: 56654:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 21:45:36 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729715.381797&amp;#93;&lt;/span&gt; LustreError: 24092:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 22:29:39 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732361.024093&amp;#93;&lt;/span&gt; LustreError: 24155:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 22:29:47 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732368.605072&amp;#93;&lt;/span&gt; LustreError: 30981:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 22:30:02 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732384.000878&amp;#93;&lt;/span&gt; LustreError: 30989:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 22:43:57 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733219.848018&amp;#93;&lt;/span&gt; LustreError: 25273:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 22:43:58 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733220.017588&amp;#93;&lt;/span&gt; LustreError: 29909:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 22:44:00 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733222.307693&amp;#93;&lt;/span&gt; LustreError: 30989:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 22:44:23 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733245.326600&amp;#93;&lt;/span&gt; LustreError: 24198:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 23:21:33 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735476.569174&amp;#93;&lt;/span&gt; LustreError: 30970:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 23:21:35 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735478.669772&amp;#93;&lt;/span&gt; LustreError: 30985:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 23:21:42 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735485.443866&amp;#93;&lt;/span&gt; LustreError: 24422:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 23:52:02 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737305.746429&amp;#93;&lt;/span&gt; LustreError: 30970:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 23:52:02 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737305.882243&amp;#93;&lt;/span&gt; LustreError: 84274:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 18 23:52:12 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737316.153863&amp;#93;&lt;/span&gt; LustreError: 24458:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 23:52:20 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737323.616670&amp;#93;&lt;/span&gt; LustreError: 30098:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 23:52:28 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737332.071392&amp;#93;&lt;/span&gt; LustreError: 29882:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 23:56:00 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737544.361153&amp;#93;&lt;/span&gt; LustreError: 24445:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 18 23:56:16 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737559.817256&amp;#93;&lt;/span&gt; LustreError: 24228:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 19 00:22:15 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;739119.386062&amp;#93;&lt;/span&gt; LustreError: 30683:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 19 00:22:23 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;739127.548279&amp;#93;&lt;/span&gt; LustreError: 70024:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 755 seconds&lt;br/&gt;
Feb 19 00:22:30 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;739134.417534&amp;#93;&lt;/span&gt; LustreError: 30853:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 754 seconds&lt;br/&gt;
Feb 19 01:11:26 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742068.910961&amp;#93;&lt;/span&gt; LustreError: 44224:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 673 seconds&lt;br/&gt;
Feb 19 01:11:27 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742069.285045&amp;#93;&lt;/span&gt; LustreError: 26648:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 673 seconds&lt;br/&gt;
Feb 19 01:11:28 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742070.594347&amp;#93;&lt;/span&gt; LustreError: 28404:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 671 seconds&lt;br/&gt;
Feb 19 01:11:29 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742071.713368&amp;#93;&lt;/span&gt; LustreError: 25522:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 670 seconds&lt;br/&gt;
Feb 19 01:11:31 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742073.793994&amp;#93;&lt;/span&gt; LustreError: 25522:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 668 seconds&lt;br/&gt;
Feb 19 01:26:13 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742955.826446&amp;#93;&lt;/span&gt; LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 752 seconds&lt;br/&gt;
Feb 19 01:26:18 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742960.947155&amp;#93;&lt;/span&gt; LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 747 seconds&lt;br/&gt;
Feb 19 01:26:19 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742962.066280&amp;#93;&lt;/span&gt; LustreError: 27899:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 746 seconds&lt;br/&gt;
Feb 19 01:26:21 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742964.134639&amp;#93;&lt;/span&gt; LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 744 seconds&lt;br/&gt;
Feb 19 01:26:29 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742971.895279&amp;#93;&lt;/span&gt; LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 736 seconds&lt;br/&gt;
Feb 19 01:26:37 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742980.276772&amp;#93;&lt;/span&gt; LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 728 seconds&lt;br/&gt;
Feb 19 01:26:58 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;743000.696535&amp;#93;&lt;/span&gt; LustreError: 25531:0:(ldlm_lockd.c:484:__ldlm_add_waiting_lock()) ### Adding a lock, but the front position is scheduled in 708 seconds&lt;/li&gt;
&lt;/ol&gt;


&lt;ol&gt;
	&lt;li&gt;grep &quot;more than at_max&quot; lustrekernel&lt;br/&gt;
Feb 18 15:43:46 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;707992.551979&amp;#93;&lt;/span&gt; LustreError: 69652:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:43:46 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;707992.687373&amp;#93;&lt;/span&gt; LustreError: 69652:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:43:48 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;707994.249556&amp;#93;&lt;/span&gt; LustreError: 69726:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:43:54 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708000.981415&amp;#93;&lt;/span&gt; LustreError: 49076:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:43:57 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708003.467150&amp;#93;&lt;/span&gt; LustreError: 49076:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:44:11 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708017.513082&amp;#93;&lt;/span&gt; LustreError: 69652:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:51:17 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708448.700510&amp;#93;&lt;/span&gt; LustreError: 117487:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:54:23 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708634.942513&amp;#93;&lt;/span&gt; LustreError: 117487:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:55:43 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708715.089797&amp;#93;&lt;/span&gt; LustreError: 56030:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 15:57:43 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;708834.978812&amp;#93;&lt;/span&gt; LustreError: 56028:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:05:22 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;709289.540122&amp;#93;&lt;/span&gt; LustreError: 69652:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:12:30 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;709722.059256&amp;#93;&lt;/span&gt; LustreError: 68203:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:12:36 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;709728.287823&amp;#93;&lt;/span&gt; LustreError: 47664:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:12:38 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;709730.128986&amp;#93;&lt;/span&gt; LustreError: 47664:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:16:35 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;709967.619831&amp;#93;&lt;/span&gt; LustreError: 42467:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:17:54 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710041.296582&amp;#93;&lt;/span&gt; LustreError: 49907:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:17:54 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710041.480052&amp;#93;&lt;/span&gt; LustreError: 49907:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:18:56 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710108.933421&amp;#93;&lt;/span&gt; LustreError: 42467:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:20:00 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710167.759432&amp;#93;&lt;/span&gt; LustreError: 69701:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:20:08 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710175.774219&amp;#93;&lt;/span&gt; LustreError: 69620:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:20:13 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710180.416937&amp;#93;&lt;/span&gt; LustreError: 69701:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:20:23 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710190.212341&amp;#93;&lt;/span&gt; LustreError: 69701:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:23:23 atlas-oss2b2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710370.936123&amp;#93;&lt;/span&gt; LustreError: 41623:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:27:25 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;710617.716497&amp;#93;&lt;/span&gt; LustreError: 68321:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:40:10 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711378.435362&amp;#93;&lt;/span&gt; LustreError: 70646:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:40:10 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711378.706590&amp;#93;&lt;/span&gt; LustreError: 70646:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:40:52 atlas-oss2b1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711423.821480&amp;#93;&lt;/span&gt; LustreError: 75044:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:40:52 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711420.949475&amp;#93;&lt;/span&gt; LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:40:52 atlas-oss2b1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711424.203728&amp;#93;&lt;/span&gt; LustreError: 75044:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:41:00 atlas-oss2b1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711431.432813&amp;#93;&lt;/span&gt; LustreError: 75057:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:41:01 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711430.139913&amp;#93;&lt;/span&gt; LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:41:08 atlas-oss2b1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711439.988737&amp;#93;&lt;/span&gt; LustreError: 45021:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:41:10 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711438.714549&amp;#93;&lt;/span&gt; LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:41:15 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711443.816261&amp;#93;&lt;/span&gt; LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:41:15 atlas-oss2b1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711447.327967&amp;#93;&lt;/span&gt; LustreError: 45021:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:44:48 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711660.584115&amp;#93;&lt;/span&gt; LustreError: 70981:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:44:48 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711660.707221&amp;#93;&lt;/span&gt; LustreError: 56619:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:44:54 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711666.099490&amp;#93;&lt;/span&gt; LustreError: 70928:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:45:01 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711672.820288&amp;#93;&lt;/span&gt; LustreError: 70919:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:45:04 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711676.039009&amp;#93;&lt;/span&gt; LustreError: 70889:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:45:04 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711672.871923&amp;#93;&lt;/span&gt; LustreError: 70502:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:45:11 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711683.037316&amp;#93;&lt;/span&gt; LustreError: 70889:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:45:19 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711691.220293&amp;#93;&lt;/span&gt; LustreError: 56624:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 16:45:32 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;711701.329287&amp;#93;&lt;/span&gt; LustreError: 70502:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:01:25 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;712004.769014&amp;#93;&lt;/span&gt; LustreError: 66807:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:01:25 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;712004.906977&amp;#93;&lt;/span&gt; LustreError: 66729:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:02:02 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;712041.523613&amp;#93;&lt;/span&gt; LustreError: 66798:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:15:25 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;713502.163436&amp;#93;&lt;/span&gt; LustreError: 48723:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:15:26 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;713502.305135&amp;#93;&lt;/span&gt; LustreError: 48723:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:16:14 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;713550.619589&amp;#93;&lt;/span&gt; LustreError: 87465:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:16:59 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;713595.933761&amp;#93;&lt;/span&gt; LustreError: 87465:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:18:14 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;713671.242717&amp;#93;&lt;/span&gt; LustreError: 87465:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:22:40 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;713933.735218&amp;#93;&lt;/span&gt; LustreError: 40709:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:47:35 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;715429.087633&amp;#93;&lt;/span&gt; LustreError: 68321:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 17:47:50 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;715444.189061&amp;#93;&lt;/span&gt; LustreError: 68321:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:09:51 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716112.054344&amp;#93;&lt;/span&gt; LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:09:51 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716112.222414&amp;#93;&lt;/span&gt; LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:09:51 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716762.131266&amp;#93;&lt;/span&gt; LustreError: 39146:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:09:51 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716765.381385&amp;#93;&lt;/span&gt; LustreError: 22345:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:09:57 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716767.800955&amp;#93;&lt;/span&gt; LustreError: 70437:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:09:59 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716120.700564&amp;#93;&lt;/span&gt; LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:09:59 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716774.249011&amp;#93;&lt;/span&gt; LustreError: 68308:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:00 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716774.711337&amp;#93;&lt;/span&gt; LustreError: 68308:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:05 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716126.087293&amp;#93;&lt;/span&gt; LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:05 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716778.716316&amp;#93;&lt;/span&gt; LustreError: 56669:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:07 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716781.898424&amp;#93;&lt;/span&gt; LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:07 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716777.756610&amp;#93;&lt;/span&gt; LustreError: 36709:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:07 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716128.449354&amp;#93;&lt;/span&gt; LustreError: 44839:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:12 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716133.872074&amp;#93;&lt;/span&gt; LustreError: 66779:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:13 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716787.693697&amp;#93;&lt;/span&gt; LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:13 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716787.212697&amp;#93;&lt;/span&gt; LustreError: 56669:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:15 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716790.056572&amp;#93;&lt;/span&gt; LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:20 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716795.549134&amp;#93;&lt;/span&gt; LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:22 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716795.736584&amp;#93;&lt;/span&gt; LustreError: 56659:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:23 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716793.876972&amp;#93;&lt;/span&gt; LustreError: 36709:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:28 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716149.640965&amp;#93;&lt;/span&gt; LustreError: 44839:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:30 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716805.411905&amp;#93;&lt;/span&gt; LustreError: 68339:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:59 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716833.350545&amp;#93;&lt;/span&gt; LustreError: 70955:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:01 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716835.801342&amp;#93;&lt;/span&gt; LustreError: 68308:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:01 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716182.847151&amp;#93;&lt;/span&gt; LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:08 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716838.563276&amp;#93;&lt;/span&gt; LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:31 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716865.369423&amp;#93;&lt;/span&gt; LustreError: 70955:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:33 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716868.103793&amp;#93;&lt;/span&gt; LustreError: 68339:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:39 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716220.959921&amp;#93;&lt;/span&gt; LustreError: 66779:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:20:59 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;717437.327217&amp;#93;&lt;/span&gt; LustreError: 44941:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:44:56 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718874.423889&amp;#93;&lt;/span&gt; LustreError: 59983:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:44:56 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718874.813740&amp;#93;&lt;/span&gt; LustreError: 59983:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:02 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718880.336502&amp;#93;&lt;/span&gt; LustreError: 87429:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:04 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718882.844594&amp;#93;&lt;/span&gt; LustreError: 87429:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:10 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718889.094591&amp;#93;&lt;/span&gt; LustreError: 87520:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:25 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718903.807626&amp;#93;&lt;/span&gt; LustreError: 58796:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:40 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718918.917446&amp;#93;&lt;/span&gt; LustreError: 87396:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 19:29:02 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;721522.017991&amp;#93;&lt;/span&gt; LustreError: 87301:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:24:02 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;728420.566978&amp;#93;&lt;/span&gt; LustreError: 70979:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:18 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729697.474808&amp;#93;&lt;/span&gt; LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:26 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729704.569173&amp;#93;&lt;/span&gt; LustreError: 43863:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:28 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729707.282184&amp;#93;&lt;/span&gt; LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:36 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729715.224382&amp;#93;&lt;/span&gt; LustreError: 43863:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:51 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729730.293744&amp;#93;&lt;/span&gt; LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:29:30 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732351.785694&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:29:30 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732351.917490&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:29:39 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732360.901707&amp;#93;&lt;/span&gt; LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:29:46 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732368.452562&amp;#93;&lt;/span&gt; LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:30:02 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732383.839945&amp;#93;&lt;/span&gt; LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:30:09 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732391.346122&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:41:28 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733069.735891&amp;#93;&lt;/span&gt; LustreError: 84264:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:43:52 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733213.899990&amp;#93;&lt;/span&gt; LustreError: 70179:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:43:57 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733219.697281&amp;#93;&lt;/span&gt; LustreError: 70111:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:44:14 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733236.420909&amp;#93;&lt;/span&gt; LustreError: 70127:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:44:23 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733245.170739&amp;#93;&lt;/span&gt; LustreError: 70111:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:21:33 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735476.415442&amp;#93;&lt;/span&gt; LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:21:35 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735478.508845&amp;#93;&lt;/span&gt; LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:21:42 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735485.290958&amp;#93;&lt;/span&gt; LustreError: 70171:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:52:02 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737305.577723&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:15 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716790.056572&amp;#93;&lt;/span&gt; LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:20 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716795.549134&amp;#93;&lt;/span&gt; LustreError: 68227:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:22 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716795.736584&amp;#93;&lt;/span&gt; LustreError: 56659:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:23 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716793.876972&amp;#93;&lt;/span&gt; LustreError: 36709:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:28 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716149.640965&amp;#93;&lt;/span&gt; LustreError: 44839:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:30 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716805.411905&amp;#93;&lt;/span&gt; LustreError: 68339:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:10:59 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716833.350545&amp;#93;&lt;/span&gt; LustreError: 70955:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:01 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716835.801342&amp;#93;&lt;/span&gt; LustreError: 68308:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:01 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716182.847151&amp;#93;&lt;/span&gt; LustreError: 41362:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:08 atlas-oss2i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716838.563276&amp;#93;&lt;/span&gt; LustreError: 70662:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:31 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716865.369423&amp;#93;&lt;/span&gt; LustreError: 70955:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:33 atlas-oss2a1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716868.103793&amp;#93;&lt;/span&gt; LustreError: 68339:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:11:39 atlas-oss1c1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;716220.959921&amp;#93;&lt;/span&gt; LustreError: 66779:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:20:59 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;717437.327217&amp;#93;&lt;/span&gt; LustreError: 44941:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:44:56 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718874.423889&amp;#93;&lt;/span&gt; LustreError: 59983:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:44:56 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718874.813740&amp;#93;&lt;/span&gt; LustreError: 59983:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:02 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718880.336502&amp;#93;&lt;/span&gt; LustreError: 87429:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:04 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718882.844594&amp;#93;&lt;/span&gt; LustreError: 87429:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:10 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718889.094591&amp;#93;&lt;/span&gt; LustreError: 87520:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:25 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718903.807626&amp;#93;&lt;/span&gt; LustreError: 58796:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 18:45:40 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;718918.917446&amp;#93;&lt;/span&gt; LustreError: 87396:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 19:29:02 atlas-oss1i7 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;721522.017991&amp;#93;&lt;/span&gt; LustreError: 87301:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:24:02 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;728420.566978&amp;#93;&lt;/span&gt; LustreError: 70979:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:18 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729697.474808&amp;#93;&lt;/span&gt; LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:26 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729704.569173&amp;#93;&lt;/span&gt; LustreError: 43863:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:28 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729707.282184&amp;#93;&lt;/span&gt; LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:36 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729715.224382&amp;#93;&lt;/span&gt; LustreError: 43863:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 21:45:51 atlas-oss1c2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;729730.293744&amp;#93;&lt;/span&gt; LustreError: 70959:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:29:30 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732351.785694&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:29:30 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732351.917490&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:29:39 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732360.901707&amp;#93;&lt;/span&gt; LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:29:46 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732368.452562&amp;#93;&lt;/span&gt; LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:30:02 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732383.839945&amp;#93;&lt;/span&gt; LustreError: 49635:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:30:09 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;732391.346122&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:41:28 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733069.735891&amp;#93;&lt;/span&gt; LustreError: 84264:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:43:52 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733213.899990&amp;#93;&lt;/span&gt; LustreError: 70179:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:43:57 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733219.697281&amp;#93;&lt;/span&gt; LustreError: 70111:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:44:14 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733236.420909&amp;#93;&lt;/span&gt; LustreError: 70127:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 22:44:23 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;733245.170739&amp;#93;&lt;/span&gt; LustreError: 70111:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:21:33 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735476.415442&amp;#93;&lt;/span&gt; LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:21:35 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735478.508845&amp;#93;&lt;/span&gt; LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:21:42 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;735485.290958&amp;#93;&lt;/span&gt; LustreError: 70171:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:52:02 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737305.577723&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:52:12 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737315.993998&amp;#93;&lt;/span&gt; LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:52:20 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737323.457799&amp;#93;&lt;/span&gt; LustreError: 84264:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:52:28 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737331.921511&amp;#93;&lt;/span&gt; LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:56:00 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737544.208456&amp;#93;&lt;/span&gt; LustreError: 70127:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 18 23:56:18 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;737562.320020&amp;#93;&lt;/span&gt; LustreError: 70140:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 00:22:15 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;739119.233426&amp;#93;&lt;/span&gt; LustreError: 70171:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 00:22:23 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;739127.393845&amp;#93;&lt;/span&gt; LustreError: 70013:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 00:22:31 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;739135.399082&amp;#93;&lt;/span&gt; LustreError: 23539:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 00:26:56 atlas-oss1g5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;739400.814619&amp;#93;&lt;/span&gt; LustreError: 70102:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 01:10:05 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;741987.229563&amp;#93;&lt;/span&gt; LustreError: 96360:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 01:11:26 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742068.781671&amp;#93;&lt;/span&gt; LustreError: 44224:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 01:15:45 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742327.452668&amp;#93;&lt;/span&gt; LustreError: 37185:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 01:26:11 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;742953.606805&amp;#93;&lt;/span&gt; LustreError: 10346:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;br/&gt;
Feb 19 01:32:19 atlas-oss3h4 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;743322.403606&amp;#93;&lt;/span&gt; LustreError: 10346:0:(ldlm_lockd.c:460:__ldlm_add_waiting_lock()) ### requested timeout 755, more than at_max 600&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="77376" author="adilger" created="Wed, 19 Feb 2014 17:28:00 +0000"  >&lt;p&gt;For OBD_FREE_LARGE() it can use is_vmalloc_addr() to determine if a passed pointer was allocated via kmalloc() or vmalloc().  That would allow OBD_ALLOC_LARGE() to try kmalloc() first (maybe without __GFP_WAIT), and only fall back to vmalloc() in case of failure. &lt;/p&gt;</comment>
                            <comment id="77380" author="simmonsja" created="Wed, 19 Feb 2014 17:44:44 +0000"  >&lt;p&gt;I was just discussing that on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4008&quot; title=&quot;Reduce vmalloc usage on MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4008&quot;&gt;&lt;del&gt;LU-4008&lt;/del&gt;&lt;/a&gt;. I have some data to show what exactly is causing the vmalloc lock contention. The only draw back to that approach is memory fragmentation.&lt;/p&gt;</comment>
                            <comment id="77391" author="hilljjornl" created="Wed, 19 Feb 2014 18:12:34 +0000"  >&lt;p&gt;Oleg, I&apos;m attaching the Lustre logs from atlas and Titan during the timeranges from above (we hit some more this morning around 9:40 also). Curiously there are no titan entries for most of that time. I will to look over at our smaller Cray XC30 (no Lustre messages during any of these time frames), and also our 200 node linux cluster &amp;#8211; those are our three biggest consumers of Lustre &amp;#8211; to get their log data from these time frames.  &lt;/p&gt;

&lt;p&gt;hilljj@titan-smw0:/tmp$ for i in $(ls titan-lustre-2014-02-1*); do wc -l $i; done&lt;br/&gt;
114989 titan-lustre-2014-02-18-hr10.txt&lt;br/&gt;
0 titan-lustre-2014-02-18-hr15.txt&lt;br/&gt;
8 titan-lustre-2014-02-18-hr16.txt&lt;br/&gt;
0 titan-lustre-2014-02-18-hr17.txt&lt;br/&gt;
45 titan-lustre-2014-02-18-hr18.txt&lt;br/&gt;
0 titan-lustre-2014-02-18-hr21.txt&lt;br/&gt;
0 titan-lustre-2014-02-18-hr22.txt&lt;br/&gt;
0 titan-lustre-2014-02-18-hr23.txt&lt;br/&gt;
0 titan-lustre-2014-02-19-hr00.txt&lt;br/&gt;
0 titan-lustre-2014-02-19-hr01.txt&lt;br/&gt;
4 titan-lustre-2014-02-19-hr09.txt&lt;br/&gt;
hilljj@titan-smw0:/tmp$ md5sum titan-lustre-logs-LU4570.tgz &lt;br/&gt;
79571cb30d28a42e969461eeed947279  titan-lustre-logs-LU4570.tgz&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@rhea-mgmt2 tmp&amp;#93;&lt;/span&gt;# wc -l rhea-lustre-2014-02-1*&lt;br/&gt;
      0 rhea-lustre-2014-02-18-hr10.txt&lt;br/&gt;
    714 rhea-lustre-2014-02-18-hr15.txt&lt;br/&gt;
   1347 rhea-lustre-2014-02-18-hr16.txt&lt;br/&gt;
   1179 rhea-lustre-2014-02-18-hr17.txt&lt;br/&gt;
   1605 rhea-lustre-2014-02-18-hr18.txt&lt;br/&gt;
    318 rhea-lustre-2014-02-18-hr21.txt&lt;br/&gt;
    337 rhea-lustre-2014-02-18-hr22.txt&lt;br/&gt;
    364 rhea-lustre-2014-02-18-hr23.txt&lt;br/&gt;
    293 rhea-lustre-2014-02-19-hr00.txt&lt;br/&gt;
    200 rhea-lustre-2014-02-19-hr01.txt&lt;br/&gt;
    787 rhea-lustre-2014-02-19-hr09.txt&lt;br/&gt;
   7144 total&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@rhea-mgmt2 tmp&amp;#93;&lt;/span&gt;# md5sum rhea-lustre-logs-LU4570.tgz &lt;br/&gt;
cfdf017d0d5b4b7ef813b74f21f687cb  rhea-lustre-logs-LU4570.tgz&lt;/p&gt;

&lt;p&gt;hilljj@hill-desktop:~$ wc -l ccs_homedir/ldlm_lockd.c.lustrekernel.txt &lt;br/&gt;
397 ccs_homedir/ldlm_lockd.c.lustrekernel.txt&lt;/p&gt;</comment>
                            <comment id="77451" author="green" created="Thu, 20 Feb 2014 02:59:48 +0000"  >&lt;p&gt;Hm, so all the trips of my at_max message seem to be at max 755 seconds.&lt;br/&gt;
This seem to be coming from at_est2timeout() that potentially can increase the maximum timeout possible by 125% + 5 seconds. I guess clients won&apos;t really appreciate that certain lock operations would take longer than at_max when such a conflicting lock is encountered.&lt;br/&gt;
But this only happens on OSTs, so it would not explain any timeouts on MDS.&lt;/p&gt;

&lt;p&gt;On the client logs we see timeouts both to OSTs and MDTs, but sadly the reduced printing only shows xid, but not the opc so there&apos;s no way to know what requests are failing. Interesting that there&apos;s quite a bunch of timeouts that are in low range - 17s, 19s, 31s could not be explained by too long timeouts or anything like that. I wonder why those were not replied to by the server in time, if only to send an early reply?&lt;br/&gt;
Timeouts go as high as 829 seconds and more too, which is way higher than at_max interestingly (high network latency setting at play?)&lt;/p&gt;</comment>
                            <comment id="77685" author="green" created="Sat, 22 Feb 2014 05:35:11 +0000"  >&lt;p&gt;So, I&apos;ve been digging around the code and it looks like we send close RPCs to readdir portal now, so if you have heavy close traffic, and close is affected by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4008&quot; title=&quot;Reduce vmalloc usage on MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4008&quot;&gt;&lt;del&gt;LU-4008&lt;/del&gt;&lt;/a&gt; making it to use tons of CPU time, it is probably quite feasible that all of those are consumed?&lt;br/&gt;
This would probably also mean that you have quite a ton of them present once the problem hit (and it only exacerbates the lock contention on vmalloc).&lt;br/&gt;
Hopefully with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4008&quot; title=&quot;Reduce vmalloc usage on MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4008&quot;&gt;&lt;del&gt;LU-4008&lt;/del&gt;&lt;/a&gt; we are going to hit two birds with one stone as general close processing time should improve, at the same time improving readdir processing time that should free readdir locks much faster.&lt;/p&gt;</comment>
                            <comment id="78090" author="hilljjornl" created="Fri, 28 Feb 2014 13:55:12 +0000"  >&lt;p&gt;console output of echo t &amp;gt; /proc/sysrq-trigger under load.&lt;/p&gt;</comment>
                            <comment id="78092" author="simmonsja" created="Fri, 28 Feb 2014 14:15:03 +0000"  >&lt;p&gt;I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4570&quot; title=&quot;Metadata slowdowns on production filesystem at ORNL&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4570&quot;&gt;&lt;del&gt;LU-4570&lt;/del&gt;&lt;/a&gt;/mds1.20140227.console.log.bz2&lt;/p&gt;</comment>
                            <comment id="78249" author="green" created="Mon, 3 Mar 2014 17:46:26 +0000"  >&lt;p&gt;Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs.&lt;br/&gt;
355 threads are waiting for locks to be cancelled.&lt;/p&gt;

&lt;p&gt;This is nothing too critical I imagine.&lt;br/&gt;
There&apos;s nothing waiting in the readdir, so that angle did not realize, but all evicted locks are still PR locks pointing at this readdir lock somehow.&lt;/p&gt;

&lt;p&gt;If you can manage to dump sysrq-t trace from a node that&apos;s trying to reconnect and get&apos;s refused with &apos;refused reconnection, still busy with X active RPCs&apos; while this is ongoing, I am interested in getting that.&lt;/p&gt;</comment>
                            <comment id="78263" author="green" created="Mon, 3 Mar 2014 19:53:08 +0000"  >&lt;p&gt;I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT.&lt;br/&gt;
Is it by any chance the same sort of job?&lt;/p&gt;</comment>
                            <comment id="79081" author="ezell" created="Wed, 12 Mar 2014 01:14:15 +0000"  >&lt;p&gt;We had two &quot;still busy&quot; messages since we last patched, but there were special circumstances:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar  8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs
Mar  8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline).
2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3  req@ffff8803ff9e3000 x1459682730665633/t0 o400-&amp;gt;MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0
2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail.
2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete.
2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear
2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2
2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message
2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200.
2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2
2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2
2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I think the clients sent requests, but the response didn&apos;t get back to them before the network timed them out.  I&apos;m not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts.  When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped.  The MDS didn&apos;t get a NACK, so it thinks it&apos;s still waiting on that request to be completed.&lt;/p&gt;

&lt;p&gt;This is probably a special situation, and likely cannot explain all the occurrences we&apos;ve seen of this previously.&lt;/p&gt;</comment>
                            <comment id="79839" author="green" created="Thu, 20 Mar 2014 12:54:45 +0000"  >&lt;p&gt;I think you are right and this instance is different from before due to network issues.&lt;/p&gt;</comment>
                            <comment id="88049" author="jamesanunez" created="Wed, 2 Jul 2014 22:15:35 +0000"  >&lt;p&gt;Jason, &lt;/p&gt;

&lt;p&gt;The two patches referenced in this ticket have landed;  &lt;a href=&quot;http://review.whamcloud.com/#/c/10601/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10601/&lt;/a&gt; and &lt;a href=&quot;http://review.whamcloud.com/#/c/6339&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/6339&lt;/a&gt; . &lt;/p&gt;

&lt;p&gt;Are you still experiencing this problem? If not, should be close this ticket?&lt;/p&gt;</comment>
                            <comment id="88267" author="hilljjornl" created="Mon, 7 Jul 2014 14:49:35 +0000"  >&lt;p&gt;James:&lt;/p&gt;

&lt;p&gt;yes, let&apos;s go ahead and resolve this issue.&lt;/p&gt;

&lt;p&gt;&amp;#8211;&lt;br/&gt;
-Jason&lt;/p&gt;</comment>
                            <comment id="88271" author="simmonsja" created="Mon, 7 Jul 2014 14:58:18 +0000"  >&lt;p&gt;As a note I back ported both of the above patches to 2.5. They are at :&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/10962&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10962&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/10997&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10997&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="88660" author="jamesanunez" created="Wed, 9 Jul 2014 22:25:24 +0000"  >&lt;p&gt;We will track the b2_5 patches separately.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="22982">LU-4579</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="14187" name="atlas-mds1-sysrq-t.tgz" size="1300562" author="hilljjornl" created="Fri, 28 Feb 2014 13:55:12 +0000"/>
                            <attachment id="14140" name="ldlm_lockd.c.lustrekernel.txt" size="70201" author="hilljjornl" created="Wed, 19 Feb 2014 18:12:51 +0000"/>
                            <attachment id="14142" name="rhea-lustre-logs-LU4570.tgz" size="104332" author="hilljjornl" created="Wed, 19 Feb 2014 18:14:17 +0000"/>
                            <attachment id="14045" name="rpctrace-mds1.bz2" size="229" author="hilljjornl" created="Sat, 1 Feb 2014 05:26:12 +0000"/>
                            <attachment id="14141" name="titan-lustre-logs-LU4570.tgz" size="1706288" author="hilljjornl" created="Wed, 19 Feb 2014 18:14:06 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwe2n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>12490</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>