<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:49:38 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12096] ldlm_run_ast_work call traces and network errors on overloaded OSS</title>
                <link>https://jira.whamcloud.com/browse/LU-12096</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Our OSS servers running 2.12.0 on Fir have been running fine until this morning. We are now seeing network errors, call traces and all servers seem overloaded. Filesystem is still reactive. I wanted to share the following logs with you just in case you see anything wrong. This really looks like a network issues but we spent some time investigating we didn&apos;t find any issues on our different IB fabrics, but lnet shows dropped packets.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 21 09:44:43 fir-io1-s1 kernel: LNet: Skipped 2 previous similar messages
Mar 21 09:44:43 fir-io1-s1 kernel: Pid: 96368, comm: ll_ost01_045 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 21 09:44:44 fir-io1-s1 kernel: Call Trace:
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0dcd890&amp;gt;] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0d8b185&amp;gt;] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0dac86b&amp;gt;] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc166b10b&amp;gt;] ofd_intent_policy+0x69b/0x920 [ofd]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0d8bec6&amp;gt;] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0db48a7&amp;gt;] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0e3b302&amp;gt;] tgt_enqueue+0x62/0x210 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0e4235a&amp;gt;] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0de692b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffc0dea25c&amp;gt;] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffff850c1c31&amp;gt;] kthread+0xd1/0xe0
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffff85774c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
Mar 21 09:44:44 fir-io1-s1 kernel:  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
Mar 21 09:44:44 fir-io1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1553186684.96368
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-io1-s1 ~]# lnetctl stats show
statistics:
    msgs_alloc: 0
    msgs_max: 16371
    rst_alloc: 283239
    errors: 0
    send_count: 2387805172
    resend_count: 0
    response_timeout_count: 203807
    local_interrupt_count: 0
    local_dropped_count: 33
    local_aborted_count: 0
    local_no_route_count: 0
    local_timeout_count: 961
    local_error_count: 13
    remote_dropped_count: 3
    remote_error_count: 0
    remote_timeout_count: 12
    network_timeout_count: 0
    recv_count: 2387455644
    route_count: 0
    drop_count: 2971
    send_length: 871207195166809
    recv_length: 477340920770381
    route_length: 0
    drop_length: 1291240
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I was able to dump the kernel tasks using sysrq, attaching that as fir-io1-s1-sysrq-t.log&lt;br/&gt;
Also attaching full kernel logs as fir-io1-s1-kern.log&lt;/p&gt;

&lt;p&gt;We use DNE,PFL and DOM. OST backend is ldiskfs on mdraid.&lt;/p&gt;

&lt;p&gt;Meanwhile, we&apos;ll keep investigating a possible network issue.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;br/&gt;
Stephane&lt;/p&gt;</description>
                <environment>Clients: 2.12.0+&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11964&quot; title=&quot;Heavy load and soft lockups on MDS with DOM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11964&quot;&gt;&lt;strike&gt;LU-11964&lt;/strike&gt;&lt;/a&gt;, Servers: 2.12.0+&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12037&quot; title=&quot;Possible DNE issue leading to hung filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12037&quot;&gt;&lt;strike&gt;LU-12037&lt;/strike&gt;&lt;/a&gt; (3.10.0-957.1.3.el7_lustre.x86_64), CentOS 7.6</environment>
        <key id="55216">LU-12096</key>
            <summary>ldlm_run_ast_work call traces and network errors on overloaded OSS</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Thu, 21 Mar 2019 18:07:22 +0000</created>
                <updated>Fri, 22 Mar 2019 17:35:50 +0000</updated>
                                            <version>Lustre 2.12.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="244442" author="sthiell" created="Thu, 21 Mar 2019 18:09:10 +0000"  >&lt;p&gt;Additional info: all OSS are running 2.12.0 w/o patch, only MDS servers have been patched with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12037&quot; title=&quot;Possible DNE issue leading to hung filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12037&quot;&gt;&lt;del&gt;LU-12037&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="244443" author="pfarrell" created="Thu, 21 Mar 2019 18:15:04 +0000"  >&lt;p&gt;Can you try to dump the ldlm namespace (instructions in previous tickets) on an affected OSS?&lt;/p&gt;</comment>
                            <comment id="244444" author="pfarrell" created="Thu, 21 Mar 2019 18:17:17 +0000"  >&lt;p&gt;Also, there are a lot of evictions here - Some of them are for lock callback timeouts, but a lot are just clients showing up as unreachable.&#160; Any odd reboots or similar?&lt;/p&gt;</comment>
                            <comment id="244445" author="pfarrell" created="Thu, 21 Mar 2019 18:23:18 +0000"  >&lt;p&gt;Ah, nevermind, the evictions are largely from previous days.&#160; Didn&apos;t notice how far back the kernel log goes...&lt;/p&gt;

&lt;p&gt;Looking at today.&lt;/p&gt;

&lt;p&gt;I am seeing some network errors in your logs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 21 09:41:30 fir-io1-s1 kernel: LNetError: 91376:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Mar 21 09:41:30 fir-io1-s1 kernel: LNetError: 91376:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.10.204@o2ib7 (0): c: 0, oc: 0, rc: 6
Mar 21 09:41:30 fir-io1-s1 kernel: LustreError: 91376:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff985178673600
Mar 21 09:41:30 fir-io1-s1 kernel: LustreError: 91376:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff985005f54800
Mar 21 09:41:36 fir-io1-s1 kernel: Lustre: fir-OST0000: Client 16c53570-4c33-b669-c4e4-d9850974da88 (at 10.8.21.11@o2ib6) reconnecting &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Have you seen errors on other OSSes or is it limited to this one?&#160; It had those errors around when things start, at ~9:41 this morning (in the kern log).&lt;/p&gt;

&lt;p&gt;That&apos;s followed shortly after by the stack traces you highlighted, and lots of delayed/lost messages (ptlrpc errors)...&lt;/p&gt;

&lt;p&gt;Then things quiet down a bit, and then we see:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 21 10:15:31 fir-io1-s1 kernel: Lustre: 96903:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1553188524/real 1553188524]  req@ffff98381887cb00 x1625508947852160/t0(0) o106-&amp;gt;fir-OST0008@10.9.104.52@o2ib4:15/16 lens 296/280 e 0 to 1 dl 1553188531 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Mar 21 10:15:31 fir-io1-s1 kernel: Lustre: 96903:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1910 previous similar messages&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;A little later, we suddenly spike and lose literally thousands of messages, then we have those glimpse callback timeouts again.&lt;/p&gt;</comment>
                            <comment id="244446" author="pfarrell" created="Thu, 21 Mar 2019 18:24:28 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ashehata&quot; class=&quot;user-hover&quot; rel=&quot;ashehata&quot;&gt;ashehata&lt;/a&gt;,&lt;/p&gt;

&lt;p&gt;Do these messages look like the LNet/o2ib bugs we&apos;ve been chasing recently?&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 21 09:41:30 fir-io1-s1 kernel: LNetError: 91376:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Mar 21 09:41:30 fir-io1-s1 kernel: LNetError: 91376:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.10.204@o2ib7 (0): c: 0, oc: 0, rc: 6
Mar 21 09:41:30 fir-io1-s1 kernel: LustreError: 91376:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff985178673600
Mar 21 09:41:30 fir-io1-s1 kernel: LustreError: 91376:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff985005f54800 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;They&apos;re reported right as the OSS starts spamming timeout and connection lost messages.&lt;/p&gt;

&lt;p&gt;Stephane,&lt;/p&gt;

&lt;p&gt;I asked this already, but just to be 100% clear - Are the errors limited to this OSS?&lt;/p&gt;</comment>
                            <comment id="244447" author="sthiell" created="Thu, 21 Mar 2019 18:25:43 +0000"  >&lt;p&gt;Hi Patrick, thanks for checking!!&lt;br/&gt;
We have been continuously redeploying compute nodes for ~2 days now due to a new CentOS kernel release for security updates.&lt;br/&gt;
I&apos;m attaching a dlmtrace dk output with ldlm namespace dump (hopefully) for fir-io1-s1. The other OSSs look like to be in the same state.&lt;/p&gt;</comment>
                            <comment id="244448" author="sthiell" created="Thu, 21 Mar 2019 18:29:21 +0000"  >&lt;p&gt;Also, just for the sake of completeness, we&apos;re using MLNX_OFED_LINUX-4.5-1.0.1.0 on both servers (Fir) and clients (Sherlock).&lt;/p&gt;</comment>
                            <comment id="244449" author="sthiell" created="Thu, 21 Mar 2019 18:38:02 +0000"  >&lt;p&gt;In &lt;tt&gt;LNetError: 91376:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.10.204@o2ib7&lt;/tt&gt;,&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;10.0.10.204@o2ib7&lt;/tt&gt; is a lnet router to Sherlock&lt;/p&gt;</comment>
                            <comment id="244453" author="sthiell" created="Thu, 21 Mar 2019 19:19:28 +0000"  >&lt;p&gt;Patrick, things are a bit better now, logs are relatively quiet again on all OSS. But I&apos;m told that we&apos;re still redeploying compute nodes.&lt;/p&gt;

&lt;p&gt;Yes, these logs go far in the past because the OSS in 2.12 have been very stable, so basically they are up since we started production on 2.12 (early Feb).&lt;/p&gt;

&lt;p&gt;Now the thing is that we&apos;re also seeing &quot; timeout on bulk READ &quot; errors on our old scratch, Regal, running Lustre 2.8. It&apos;s a system that does 99% of reads (quotas have been reduced on this system that will be decommissioned this year). It&apos;s a bit weird that similar errors appear on different systems (Regal and Fir) at the same time, but both are mounted on Sherlock running 2.12 clients. I&apos;ll keep investigating this afternoon. But if there is a LNet patch that could help in 2.12.0, could you please let me know? Thanks!&lt;/p&gt;</comment>
                            <comment id="244458" author="pfarrell" created="Thu, 21 Mar 2019 21:01:28 +0000"  >&lt;p&gt;I&apos;ll let Amir weigh in on possible issues and patches.&#160; There may be something, but Amir is much better qualified to evaluate if it&apos;s a possible match.&lt;/p&gt;

&lt;p&gt;Looking at your lock tables, there&apos;s nothing unusual currently, though it could be that it was cleaned up by the evictions.&lt;/p&gt;

&lt;p&gt;The fact that you&apos;re just now also seeing timeouts on your 2.8 system is interesting.&#160; I&apos;m still not enough of an o2ib expert to do anything with the knowledge, but it might be relevant.&lt;/p&gt;</comment>
                            <comment id="244461" author="ashehata" created="Thu, 21 Mar 2019 21:37:11 +0000"  >&lt;p&gt;We have a couple of inflight patches that are suppose to address RDMA timeouts:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/34396&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34396&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/34474/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34474/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Details on:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11931&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-11931&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12065&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-12065&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;respectively&lt;/p&gt;</comment>
                            <comment id="244462" author="sthiell" created="Thu, 21 Mar 2019 21:50:18 +0000"  >&lt;p&gt;Thanks Amir!! We&apos;re affected by both issues I guess at the moment.&lt;/p&gt;

&lt;p&gt;Looks like this ticket could be a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12065&quot; title=&quot;Client got evicted when  lock callback timer expired  on OSS &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12065&quot;&gt;&lt;del&gt;LU-12065&lt;/del&gt;&lt;/a&gt; then (our errors are more READ errors rather than WRITE though, but that&apos;s because our overall read-oriented workload I think):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-hn01 sthiell.root]# clush -w@oss &apos;journalctl -n 10000 | grep &quot;network error&quot;&apos;
fir-io1-s1: Mar 21 10:30:45 fir-io1-s1 kernel: LustreError: 96626:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff9838134df050 x1628556138350512/t0(0) o3-&amp;gt;a8709ec9-480f-5069-1ed2-459
fir-io4-s2: Mar 21 09:41:10 fir-io4-s2 kernel: LustreError: 11649:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff9f624be54450 x1628592064536240/t0(0) o3-&amp;gt;f9709d24-98d8-0cac-f2ff-a99
fir-io3-s2: Mar 21 09:42:53 fir-io3-s2 kernel: LustreError: 89904:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff8f64b9d83050 x1628590311939072/t0(0) o3-&amp;gt;48668d7e-41a9-a375-0254-f12
fir-io3-s2: Mar 21 10:31:13 fir-io3-s2 kernel: LustreError: 61500:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff8f45aaeab050 x1628590295062864/t0(0) o3-&amp;gt;539dad34-3729-b804-7e3e-a23
fir-io2-s2: Mar 21 09:42:42 fir-io2-s2 kernel: LustreError: 41983:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff939c4e0b4450 x1628574482967280/t0(0) o3-&amp;gt;0528e199-3990-4d89-45f8-08e
fir-io2-s2: Mar 21 10:31:03 fir-io2-s2 kernel: LustreError: 47646:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff937f2c7fec50 x1628593644009328/t0(0) o3-&amp;gt;df3d4e84-b52f-56d8-5336-c34
fir-io1-s2: Mar 21 09:41:53 fir-io1-s2 kernel: LustreError: 125651:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff900db6796850 x1628587142580752/t0(0) o3-&amp;gt;fdfab198-010f-1b84-313f-f3
fir-io1-s2: Mar 21 10:30:14 fir-io1-s2 kernel: LustreError: 125778:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff901c24c78c50 x1628593670375840/t0(0) o3-&amp;gt;b10270ab-2dff-3a94-2c78-6a
fir-io1-s2: Mar 21 10:30:39 fir-io1-s2 kernel: LustreError: 105343:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff900db6796850 x1628592114787056/t0(0) o3-&amp;gt;1699114c-d7be-fde3-c1c0-12
fir-io2-s1: Mar 21 09:41:43 fir-io2-s1 kernel: LustreError: 69998:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff986ba1925050 x1628592081804192/t0(0) o3-&amp;gt;71bd1389-41d9-d380-693e-d73
fir-io4-s1: Mar 21 09:42:51 fir-io4-s1 kernel: LustreError: 31413:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffffa0facc9ccc50 x1628593652265184/t0(0) o3-&amp;gt;b10270ab-2dff-3a94-2c78-6a8
fir-io3-s1: Mar 21 09:41:33 fir-io3-s1 kernel: LustreError: 30747:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff898e09130050 x1628587142580656/t0(0) o3-&amp;gt;fdfab198-010f-1b84-313f-f38
fir-io3-s1: Mar 21 10:28:55 fir-io3-s1 kernel: Lustre: 130578:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1553189329/real 1553189335]  req@ffff896e1c4da1
fir-io3-s1: Mar 21 10:29:53 fir-io3-s1 kernel: LustreError: 24124:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk READ  req@ffff898404cd6c50 x1628556112999824/t0(0) o3-&amp;gt;e93665fa-c928-d716-64a5-338
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Re: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11931&quot; title=&quot;RDMA packets sent from client to MGS are timing out &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11931&quot;&gt;&lt;del&gt;LU-11931&lt;/del&gt;&lt;/a&gt;, we have 2.12.0 clients that constantly log the following messages (MGS is 2.8 in that case):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 21 13:50:57 sh-25-08.int kernel: LustreError: 128406:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1980) refcount nonzer
Mar 21 13:50:57 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail
Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553201774, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21
Mar 21 14:01:14 sh-25-08.int kernel: LustreError: 129134:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0840) refcount nonzer
Mar 21 14:01:14 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail
Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553202391, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21
Mar 21 14:11:31 sh-25-08.int kernel: LustreError: 129843:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0600) refcount nonzer
Mar 21 14:11:31 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail
Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553203005, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21
Mar 21 14:21:45 sh-25-08.int kernel: LustreError: 130529:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1800) refcount nonzer
Mar 21 14:21:45 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 166-1: MGC10.210.34.201@o2ib1: Connection to MGS (at 10.210.34.201@o2ib1) was lost; in progress operations using this service will fail
Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 91224:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1553203621, 300s ago), entering recovery for MGS@MGC10.210.34.201@o21
Mar 21 14:32:01 sh-25-08.int kernel: LustreError: 131236:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb0f00) refcount nonzer
Mar 21 14:32:01 sh-25-08.int kernel: Lustre: MGC10.210.34.201@o2ib1: Connection restored to MGC10.210.34.201@o2ib1_0 (at 10.210.34.201@o2ib1)
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="244463" author="sthiell" created="Thu, 21 Mar 2019 22:00:28 +0000"  >&lt;p&gt;Hey, I just noticed that Andreas landed&#160;&lt;a href=&quot;https://review.whamcloud.com/34474/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34474/&lt;/a&gt;&#160;to b2_12 so I guess I can add it to our 2.12 clients to try to mitigate this issue then. Let me know if you think it&apos;s a bad idea.&lt;/p&gt;

&lt;p&gt;The issue with the MGS is not as critical as it doesn&apos;t seem to have an impact on production.&lt;/p&gt;</comment>
                            <comment id="244489" author="sthiell" created="Fri, 22 Mar 2019 05:29:19 +0000"  >&lt;p&gt;We have started an emergency rolling update of our 2.12 lustre clients on Sherlock with the patch &lt;a href=&quot;https://review.whamcloud.com/34474/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34474/&lt;/a&gt;&#160; &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12065&quot; title=&quot;Client got evicted when  lock callback timer expired  on OSS &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12065&quot;&gt;&lt;del&gt;LU-12065&lt;/del&gt;&lt;/a&gt; lnd: increase CQ entries&quot;. I hope this will fix the bulk read timeout that we see on both 2.8 and 2.12 servers.&lt;/p&gt;</comment>
                            <comment id="244490" author="ashehata" created="Fri, 22 Mar 2019 05:32:39 +0000"  >&lt;p&gt;The RDMA timeouts are error level output. Regarding:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 Mar 21 13:50:57 sh-25-08.&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; kernel: LustreError: 128406:0:(ldlm_resource.c:1146:ldlm_resource_complain()) MGC10.210.34.201@o2ib1: namespace resource [0x6c61676572:0x2:0x0].0x0 (ffff8b03bceb1980) refcount nonzer
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If you don&apos;t see RDMA timeouts, this could be an unrelated problem.&lt;/p&gt;

&lt;p&gt;Either way, I think it&apos;ll be good to try out both those patches the CQ entries one landed, but the other one I think makes sense to apply as well.&lt;/p&gt;</comment>
                            <comment id="244491" author="sthiell" created="Fri, 22 Mar 2019 05:36:54 +0000"  >&lt;p&gt;Hi Amir &#8211; Your absolutely right for the MGS, this is very likely another issue after all. We restarted this MGS today and I think things are better now. It&apos;s on our old 2.8 systems anyway so we don&apos;t want to spend too much time on that right now.&lt;/p&gt;

&lt;p&gt;But ok for the recommendation re: the second patch, thanks much!&lt;/p&gt;</comment>
                            <comment id="244543" author="sthiell" created="Fri, 22 Mar 2019 17:32:12 +0000"  >&lt;p&gt;Are CQ entries also used on LNet routers? I assume they do? All of our routers (FDR/EDR and EDR/EDR) are running 2.12.0, maybe I need to update these too.&lt;/p&gt;</comment>
                            <comment id="244544" author="pfarrell" created="Fri, 22 Mar 2019 17:34:10 +0000"  >&lt;p&gt;Yes, absolutely.&#160; Any Lustre node using an IB connection.&lt;/p&gt;</comment>
                            <comment id="244545" author="sthiell" created="Fri, 22 Mar 2019 17:35:50 +0000"  >&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="32286" name="fir-io1-s1-kern.log" size="2488293" author="sthiell" created="Thu, 21 Mar 2019 18:06:52 +0000"/>
                            <attachment id="32285" name="fir-io1-s1-sysrq-t.log" size="4072004" author="sthiell" created="Thu, 21 Mar 2019 18:07:11 +0000"/>
                            <attachment id="32287" name="fir-io1-s1.dk-dlmtrace.log.gz" size="4346346" author="sthiell" created="Thu, 21 Mar 2019 18:25:56 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00don:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>