<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:50:38 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12215] OSS network errors 2.12</title>
                <link>https://jira.whamcloud.com/browse/LU-12215</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I recently commented in several tickets regarding OSS issues. I think this is some kind of deadlock like we had for Oak in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12162&quot; title=&quot;Major issue on OSS after upgrading Oak to 2.10.7&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12162&quot;&gt;LU-12162&lt;/a&gt; or &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12018&quot; title=&quot;deadlock on OSS: quota reintegration vs memory release&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12018&quot;&gt;&lt;del&gt;LU-12018&lt;/del&gt;&lt;/a&gt;. But we run with 2.12.0+ the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12018&quot; title=&quot;deadlock on OSS: quota reintegration vs memory release&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12018&quot;&gt;&lt;del&gt;LU-12018&lt;/del&gt;&lt;/a&gt; (quota: do not start a thread under memory pressure) so I don&apos;t think this is the same issue. I tried a SysRq+t and I see a high numbers of threads blocked in ldiskfs. Because the dump took a SUPER long time, and we couldn&apos;t wait more, I tried a crash dump but it failed. So I have logs and a partial sysrq-t.&lt;/p&gt;

&lt;p&gt;What makes me think of a new deadlock is that the load keeps increasing if the server doesn&apos;t crash:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;fir-io2-s2:  19:09:23 up  8:45,  1 user,  load average: 797.66, 750.86, 519.06
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Symptoms can be either this trace like in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11644&quot; title=&quot;LNet: Service thread inactive for 300  causes client evictions &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11644&quot;&gt;LU-11644&lt;/a&gt; reported by NASA:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr 22 17:12:27 fir-io2-s2 kernel: Pid: 83769, comm: ll_ost01_036 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 22 17:12:27 fir-io2-s2 kernel: Call Trace:
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc1232640&amp;gt;] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc11effe5&amp;gt;] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc121169b&amp;gt;] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc181310b&amp;gt;] ofd_intent_policy+0x69b/0x920 [ofd]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc11f0d26&amp;gt;] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc12196d7&amp;gt;] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc12a00b2&amp;gt;] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc12a710a&amp;gt;] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc124b6db&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffc124f00c&amp;gt;] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffa90c1c31&amp;gt;] kthread+0xd1/0xe0
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffa9774c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
Apr 22 17:12:27 fir-io2-s2 kernel: [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
Apr 22 17:12:27 fir-io2-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555978347.83769
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;or these network-related messages:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr 22 17:12:47 fir-io2-s2 kernel: LustreError: 81272:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff90da74f1d200
Apr 22 17:12:53 fir-io2-s2 kernel: LustreError: 81276:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff90fedd133c00
Apr 22 17:12:53 fir-io2-s2 kernel: LustreError: 81276:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff90fedd133c00

Apr 22 18:48:53 fir-io2-s2 kernel: LustreError: 81277:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff90e22fd35000
Apr 22 18:48:53 fir-io2-s2 kernel: LustreError: 38519:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff90e109b76850 x1631329926812992/t0(0) o4-&amp;gt;95e6fd6a-706d-ff18-fa02-0b0e9d53d014@10.8.19.8@o2ib6:301/0 lens 488/448 e 0 to 0 dl 1555984331 ref 1 fl Interpret:/0/0 rc 0/0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but clearly it&apos;s not a network issue, just a server deadlock. Again, very like the issue on Oak from&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12162&quot; title=&quot;Major issue on OSS after upgrading Oak to 2.10.7&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12162&quot;&gt;LU-12162&lt;/a&gt;.&#160;I&apos;m not sure the list of the tasks provided here is enough to troubleshoot this, but it would be great if you could take a look. This should be considered at least a Sev 2 as we have been pretty much down lately due to this. At the moment, I&apos;m trying to restart without ost quota disabled to see if that&apos;s better. Thanks much.&lt;/p&gt;

&lt;p&gt;NOTE: we tried to run with &lt;tt&gt;net&lt;/tt&gt; enabled but this doesn&apos;t seem to help. also we see a spike in msgs_alloc like NASA but I think it&apos;s just due to the deadlock.&lt;/p&gt;

&lt;p&gt;Attaching kernel logs + sysrq-t (PARTIAL) as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32453/32453_fir-io2-s2_kernel%2Bsysrq_PARTIAL_20190422.log&quot; title=&quot;fir-io2-s2_kernel+sysrq_PARTIAL_20190422.log attached to LU-12215&quot;&gt;fir-io2-s2_kernel+sysrq_PARTIAL_20190422.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;</description>
                <environment>CentOS 7.6, Servers and clients 2.12.0+patches</environment>
        <key id="55467">LU-12215</key>
            <summary>OSS network errors 2.12</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="pfarrell">Patrick Farrell</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Tue, 23 Apr 2019 02:40:43 +0000</created>
                <updated>Sat, 13 Jul 2019 15:44:46 +0000</updated>
                            <resolved>Fri, 12 Jul 2019 21:12:01 +0000</resolved>
                                    <version>Lustre 2.12.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="246189" author="sthiell" created="Tue, 23 Apr 2019 03:03:33 +0000"  >&lt;p&gt;Sorry, I forgot to say an important thing regarding another symptom: when this happens, after some time, it is sometimes followed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12120&quot; title=&quot;LustreError: 15069:0:(tgt_grant.c:561:tgt_grant_incoming()) LBUG &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12120&quot;&gt;&lt;del&gt;LU-12120&lt;/del&gt;&lt;/a&gt; (LBUG tgt_grant_incoming). I have several crash dumps of these, so maybe the answer is in the crashdumps, not sure.&lt;/p&gt;</comment>
                            <comment id="246199" author="sthiell" created="Tue, 23 Apr 2019 13:07:16 +0000"  >&lt;p&gt;The issue seems te be resolved now after we decided to mount all targets in &lt;tt&gt;abort_recov&lt;/tt&gt;..., we probably lost most of the jobs on the cluster. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; But the filesystem has run all night. The only remaining suspicious messages I can see now are these:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;fir-io3-s2: Apr 23 05:32:41 fir-io3-s2 kernel: LustreError: 123049:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0019: cli 0cb811d4-9df8-f3cc-1525-eefdbc079d76 claims 3481600 GRANT, real grant 0
fir-io3-s2: Apr 23 05:32:41 fir-io3-s2 kernel: LustreError: 123049:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 87 previous similar messages
fir-io2-s2: Apr 23 05:33:03 fir-io2-s2 kernel: LustreError: 3155:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0015: cli 7ed43e69-e1c8-0e51-0bfe-f44bf39fd025 claims 36864 GRANT, real grant 0
fir-io2-s2: Apr 23 05:33:03 fir-io2-s2 kernel: LustreError: 3155:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 30 previous similar messages
fir-io2-s1: Apr 23 05:33:47 fir-io2-s1 kernel: LustreError: 2374:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0016: cli 5f0dd240-53c1-516b-7224-272e9211f8ae claims 32768 GRANT, real grant 0
fir-io2-s1: Apr 23 05:33:47 fir-io2-s1 kernel: LustreError: 2374:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 17 previous similar messages
fir-io4-s2: Apr 23 05:34:04 fir-io4-s2 kernel: LustreError: 102956:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0025: cli a0307bc6-c839-435b-9342-1c622269d753 claims 53248 GRANT, real grant 0
fir-io4-s2: Apr 23 05:34:04 fir-io4-s2 kernel: LustreError: 102956:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 124 previous similar messages
fir-io3-s1: Apr 23 05:34:33 fir-io3-s1 kernel: LustreError: 6026:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST0022: cli a7e75d11-51ad-9d9c-ce73-18ed393f55b8 claims 2203648 GRANT, real grant 0
fir-io3-s1: Apr 23 05:34:33 fir-io3-s1 kernel: LustreError: 6026:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 47 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Also saw these messages shortly after bringing the filesystem back up:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;fir-io4-s1: Apr 22 22:59:34 fir-io4-s1 kernel: LustreError: 79918:0:(tgt_grant.c:248:tgt_grant_sanity_check()) ofd_obd_disconnect: tot_granted 58720256 != fo_tot_granted 67108864
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I added another and more complete &quot;foreach bt&quot; output from a voluntary crash dump we took shortly after seeing the problem (lots of &quot;event type -5&quot; messages). Please see &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32455/32455_fir-io3-s2_foreach_bt_20190422_204054.txt&quot; title=&quot;fir-io3-s2_foreach_bt_20190422_204054.txt attached to LU-12215&quot;&gt;fir-io3-s2_foreach_bt_20190422_204054.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; with full logs in &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32454/32454_fir-io3-s2-vmcore-dmesg-20190422_204054.log&quot; title=&quot;fir-io3-s2-vmcore-dmesg-20190422_204054.log attached to LU-12215&quot;&gt;fir-io3-s2-vmcore-dmesg-20190422_204054.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;. To me, this looks like very much another instance of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12162&quot; title=&quot;Major issue on OSS after upgrading Oak to 2.10.7&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12162&quot;&gt;LU-12162&lt;/a&gt; or &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12018&quot; title=&quot;deadlock on OSS: quota reintegration vs memory release&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12018&quot;&gt;&lt;del&gt;LU-12018&lt;/del&gt;&lt;/a&gt; as I said originally. So perhaps the fix doesn&apos;t really work (because we have it)? &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=bzzz&quot; class=&quot;user-hover&quot; rel=&quot;bzzz&quot;&gt;bzzz&lt;/a&gt; what do you think?&lt;/p&gt;

&lt;p&gt;Those backtraces in the crash dump look familiar:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 77085  TASK: ffff97fdb4fd30c0  CPU: 29  COMMAND: &quot;ll_ost_io01_001&quot;
 #0 [ffff982d2a1af748] __schedule at ffffffff8bf67747
 #1 [ffff982d2a1af7d8] schedule at ffffffff8bf67c49
 #2 [ffff982d2a1af7e8] schedule_timeout at ffffffff8bf65721
 #3 [ffff982d2a1af898] io_schedule_timeout at ffffffff8bf672ed
 #4 [ffff982d2a1af8c8] io_schedule at ffffffff8bf67388
 #5 [ffff982d2a1af8d8] bit_wait_io at ffffffff8bf65d71
 #6 [ffff982d2a1af8f0] __wait_on_bit_lock at ffffffff8bf65921
 #7 [ffff982d2a1af930] __lock_page at ffffffff8b9b5b44
 #8 [ffff982d2a1af988] __find_lock_page at ffffffff8b9b6844
 #9 [ffff982d2a1af9b0] find_or_create_page at ffffffff8b9b74b4
#10 [ffff982d2a1af9f0] osd_bufs_get at ffffffffc14d7ea7 [osd_ldiskfs]
#11 [ffff982d2a1afa48] ofd_preprw at ffffffffc1624e67 [ofd]
#12 [ffff982d2a1afae8] tgt_brw_read at ffffffffc101157b [ptlrpc]
#13 [ffff982d2a1afcc8] tgt_request_handle at ffffffffc101010a [ptlrpc]
#14 [ffff982d2a1afd50] ptlrpc_server_handle_request at ffffffffc0fb46db [ptlrpc]
#15 [ffff982d2a1afdf0] ptlrpc_main at ffffffffc0fb800c [ptlrpc]
#16 [ffff982d2a1afec8] kthread at ffffffff8b8c1c31
#17 [ffff982d2a1aff50] ret_from_fork_nospec_begin at ffffffff8bf74c24
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Before performing the global remount in &lt;tt&gt;abort_recov&lt;/tt&gt;, we tried to bring the filesystem back up many times (for several hours), but either we had these event type -5 messages + spikes in msgs_alloc etc. (= OSS deadlock for me), or a clear LBUG, like this one (looks like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12120&quot; title=&quot;LustreError: 15069:0:(tgt_grant.c:561:tgt_grant_incoming()) LBUG &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12120&quot;&gt;&lt;del&gt;LU-12120&lt;/del&gt;&lt;/a&gt;):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 5351.683710] Lustre: 70393:0:(ldlm_lib.c:1771:extend_recovery_timer()) fir-OST002e: extended recovery timer reaching hard limit: 900, extend: 1
[ 5351.696503] Lustre: 70393:0:(ldlm_lib.c:1771:extend_recovery_timer()) Skipped 1 previous similar message
[ 5351.706002] Lustre: 70393:0:(ldlm_lib.c:2048:target_recovery_overseer()) fir-OST002e recovery is aborted by hard timeout
[ 5351.716900] Lustre: 70393:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[ 5352.056860] Lustre: fir-OST002e: deleting orphan objects from 0x0:2804316 to 0x0:2804385
[ 5352.077834] Lustre: 70393:0:(ldlm_lib.c:2554:target_recovery_thread()) too long recovery - read logs
[ 5352.082207] LustreError: 84286:0:(tgt_grant.c:563:tgt_grant_incoming()) fir-OST002e: cli ecaf3ea1-3d24-ab9f-6856-0bf7294c7cf4/ffff8dca263c2400 dirty 0 pend 0 grant -18752512
[ 5352.082210] LustreError: 84286:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
[ 5352.082212] Pid: 84286, comm: ll_ost_io00_063 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[ 5352.082212] Call Trace:
[ 5352.082235]  [&amp;lt;ffffffffc0a097cc&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 5352.082243]  [&amp;lt;ffffffffc0a0987c&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 5352.082304]  [&amp;lt;ffffffffc0e5b6f0&amp;gt;] tgt_grant_prepare_read+0x0/0x3b0 [ptlrpc]
[ 5352.082352]  [&amp;lt;ffffffffc0e5b7fb&amp;gt;] tgt_grant_prepare_read+0x10b/0x3b0 [ptlrpc]
[ 5352.082372]  [&amp;lt;ffffffffc1413c00&amp;gt;] ofd_preprw+0x450/0x1160 [ofd]
[ 5352.082419]  [&amp;lt;ffffffffc0e3f57b&amp;gt;] tgt_brw_read+0x9db/0x1e50 [ptlrpc]
[ 5352.082465]  [&amp;lt;ffffffffc0e3e10a&amp;gt;] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 5352.082509]  [&amp;lt;ffffffffc0de26db&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 5352.082548]  [&amp;lt;ffffffffc0de600c&amp;gt;] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[ 5352.082553]  [&amp;lt;ffffffffbc8c1c31&amp;gt;] kthread+0xd1/0xe0
[ 5352.082558]  [&amp;lt;ffffffffbcf74c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
[ 5352.082581]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="246217" author="pfarrell" created="Tue, 23 Apr 2019 15:57:25 +0000"  >&lt;p&gt;Hmm, after looking over the logs, there&apos;s no evidence of deadlock - it looks like a network issue.&#160; &lt;b&gt;Possibly&lt;/b&gt; the same as reported by NASA, but the signature is a bit different.&lt;/p&gt;

&lt;p&gt;It&apos;s good to distinguish between osd_ldiskfs and ldiskfs itself, by the way.&#160; You&apos;re seeing a lot of threads stuck in osd_ldiskfs, which is a layer up from ldiskfs itself.&lt;/p&gt;

&lt;p&gt;Having dug through the traces, etc, I don&apos;t see any evidence of deadlock.&#160; I see a node that is having an enormous amount of trouble doing network communication, and suffering because of it.&lt;/p&gt;

&lt;p&gt;We see thousands of this message:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[45514.433468] LustreError: 74198:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff980a1b566600 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And these:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[45513.596412] LNetError: 74206:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-125, 0)
[45513.608927] LNetError: 74206:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 148 previous similar messages &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We&apos;re seeing a lot of evictions, apparently from the same cause.&#160; All of this chaos is affecting various worker threads, which over time have become stuck (this is your increasing load average, which reports waiting threads).&#160; They&apos;re all stuck in pretty &quot;normal&quot; places, waiting for I/O or communication, so if they are indeed stuck (I suspect some are truly stuck and some would make progress given time), it&apos;s because they&apos;ve become confused by repeated communication failures and evictions.&lt;/p&gt;

&lt;p&gt;At first glance, these messages are &lt;b&gt;not&lt;/b&gt; the same as those in the NASA bug.&#160; I&apos;ll look around and see what I can find, but I&apos;m also going to see if Amir can take a look.&lt;/p&gt;</comment>
                            <comment id="246218" author="pfarrell" created="Tue, 23 Apr 2019 16:12:40 +0000"  >&lt;p&gt;Amir,&lt;/p&gt;

&lt;p&gt;Are you able to take a look at this one?&#160; The main messages of interest seem to be the bulk callback error and the lnet_is_health_check errors I highlighted.&lt;/p&gt;</comment>
                            <comment id="246219" author="sthiell" created="Tue, 23 Apr 2019 16:12:45 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;Thanks SO much for having a look at this so quickly! Ok, I may have been confused with the previous deadlock that we have seen. And duly noted for osd_ldiskfs/ldiskfs, thanks!&lt;/p&gt;

&lt;p&gt;I&apos;ve looked a bit further at the logs from last night, and I see one occurrence of event type 5 (multiple messages on each OSS but nothing like before), and it happened on all OSS at the same time. But it looks like it&apos;s just a client timing out:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-hn01 sthiell.root]# clush -w@oss journalctl -kn 10000 \| grep server_bulk_callback \| tail -1
fir-io3-s2: Apr 23 00:07:24 fir-io3-s2 kernel: LustreError: 95122:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ef1f90a4e00
fir-io3-s1: Apr 23 00:07:24 fir-io3-s1 kernel: LustreError: 108580:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8969ed6f0200
fir-io4-s2: Apr 23 00:07:25 fir-io4-s2 kernel: LustreError: 74909:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9e65a0062600
fir-io2-s2: Apr 23 00:07:24 fir-io2-s2 kernel: LustreError: 100670:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9c2062801e00
fir-io4-s1: Apr 23 00:07:25 fir-io4-s1 kernel: LustreError: 76336:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b56df6bca00
fir-io1-s2: Apr 23 00:07:24 fir-io1-s2 kernel: LustreError: 108542:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff98ba981c0200
fir-io2-s1: Apr 23 00:07:24 fir-io2-s1 kernel: LustreError: 99911:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9a85ce362800
fir-io1-s1: Apr 23 00:07:24 fir-io1-s1 kernel: LustreError: 102249:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9416e9788800



Logs on one OSS:

Apr 23 00:06:20 fir-io3-s2 kernel: Lustre: fir-OST001d: Client 6ea99810-c4ef-751c-68b4-b60bb649210c (at 10.8.8.35@o2ib6) reconnecting
Apr 23 00:06:20 fir-io3-s2 kernel: Lustre: Skipped 1 previous similar message
Apr 23 00:06:21 fir-io3-s2 kernel: Lustre: 123046:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1556003174/real 0]  req@ffff8f14d87f3c00 x1631583327882336/t0(0) o104-&amp;gt;fir-OST0023@10.8.1.26@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1556003181 ref 2 fl Rpc:X/0/ffffffff r
Apr 23 00:06:21 fir-io3-s2 kernel: Lustre: 123046:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Apr 23 00:06:22 fir-io3-s2 kernel: LNetError: 95119:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-125, 0)
Apr 23 00:06:22 fir-io3-s2 kernel: LustreError: 95120:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ef2ceb77800
Apr 23 00:06:22 fir-io3-s2 kernel: LustreError: 95120:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ef2ceb77800
Apr 23 00:06:22 fir-io3-s2 kernel: LustreError: 95120:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ef2ceb77800
Apr 23 00:06:22 fir-io3-s2 kernel: LustreError: 95120:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ed50ad20600
Apr 23 00:06:22 fir-io3-s2 kernel: LustreError: 95120:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ed50ad20600
Apr 23 00:06:22 fir-io3-s2 kernel: Lustre: fir-OST001d: Bulk IO write error with f2a4d35e-05ff-fe02-d6c7-6c183d27b8a1 (at 10.8.8.29@o2ib6), client will retry: rc = -110
Apr 23 00:06:22 fir-io3-s2 kernel: LNetError: 95119:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 5 previous similar messages
Apr 23 00:06:22 fir-io3-s2 kernel: LustreError: 95119:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ef2ceb77800
Apr 23 00:06:23 fir-io3-s2 kernel: Lustre: fir-OST0019: Client 057a35ee-7eb3-cd47-7142-9e6ee9c8aa59 (at 10.8.13.7@o2ib6) reconnecting
Apr 23 00:06:23 fir-io3-s2 kernel: Lustre: Skipped 4 previous similar messages
Apr 23 00:06:27 fir-io3-s2 kernel: LNetError: 95120:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-125, 0)
Apr 23 00:06:27 fir-io3-s2 kernel: LustreError: 95117:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ef08388e800
Apr 23 00:06:27 fir-io3-s2 kernel: LustreError: 95117:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8eeaf3f75800
Apr 23 00:06:27 fir-io3-s2 kernel: LustreError: 95117:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8eeaf3f75800
Apr 23 00:06:27 fir-io3-s2 kernel: LustreError: 95117:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8eeaf3f75800
Apr 23 00:06:27 fir-io3-s2 kernel: LustreError: 95117:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8eeaf3f75800
Apr 23 00:06:27 fir-io3-s2 kernel: Lustre: fir-OST001d: Bulk IO write error with f2a4d35e-05ff-fe02-d6c7-6c183d27b8a1 (at 10.8.8.29@o2ib6), client will retry: rc = -110
Apr 23 00:06:27 fir-io3-s2 kernel: Lustre: Skipped 1 previous similar message
Apr 23 00:06:27 fir-io3-s2 kernel: LNetError: 95120:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 5 previous similar messages
Apr 23 00:06:27 fir-io3-s2 kernel: LustreError: 95120:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff8ef08388e800
Apr 23 00:06:27 fir-io3-s2 kernel: Lustre: fir-OST001d: Client 43d6af20-5e3c-1ef3-577c-0e2086a05c21 (at 10.8.10.14@o2ib6) reconnecting
Apr 23 00:06:27 fir-io3-s2 kernel: Lustre: Skipped 2 previous similar messages
Apr 23 00:06:40 fir-io3-s2 kernel: Lustre: fir-OST0023: Client 38fc721f-2581-5cc7-2331-7b71af28244a (at 10.8.7.30@o2ib6) reconnecting
Apr 23 00:06:40 fir-io3-s2 kernel: Lustre: Skipped 4 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Also, I have looked at past OSS logs, and the event type 5 is actually present pretty often when a client timed out.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Lastly, I&apos;m attaching two graphs, just because I wanted to show you the mess of last evening. First one is the overall OSS I/O bandwidth and the second one is the rate of msgs_alloc (all servers). The small peak around midnight is actually matching event type -5 that I&apos;ve seen above.&lt;/p&gt;</comment>
                            <comment id="246221" author="pfarrell" created="Tue, 23 Apr 2019 16:23:18 +0000"  >&lt;p&gt;Can you help interpret the bandwidth graph?&#160; I see positive and negative numbers but I&apos;m not sure what they mean.&#160; Is one read and the other write?&lt;/p&gt;

&lt;p&gt;About the msg_allocs: While it would be good for Amir to weigh in, I think those are associated with resends due to message failures.&lt;/p&gt;

&lt;p&gt;About the -5:&lt;br/&gt;
Yes, -5 isn&apos;t that rare, but I think the -125 error code is?&#160; (It would be great to know if I&apos;m wrong about that and it is in fact common.)&lt;/p&gt;</comment>
                            <comment id="246224" author="sthiell" created="Tue, 23 Apr 2019 16:34:42 +0000"  >&lt;p&gt;Oops, sorry about that, missing legend... that&apos;s bad!&#160;&lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;. Overall read I/O from all OSS&apos;s are in blue (positive) and write in red (using negative values so it&apos;s more readable). They are taken from&#160;&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;/proc/fs/lustre/obdfilter/&lt;b&gt;-OST&lt;/b&gt;/stats&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;Curious to know more about the &lt;tt&gt;server_bulk_callback&lt;/tt&gt;&#160;errors too.&lt;/p&gt;

&lt;p&gt;And you said there was a node &quot;having an enormous amount of trouble doing network communication&quot;, are you able to tell me which one? or this information is missing from the logs?&lt;/p&gt;

&lt;p&gt;Also just to add more context, we&apos;re currently completing a client upgrade to include the patch from &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11359&quot; title=&quot;racer test 1 times out with client hung in dir_create.sh, ls, &#8230; and MDS in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11359&quot;&gt;&lt;del&gt;LU-11359&lt;/del&gt;&lt;/a&gt;&#160;mdt: fix mdt_dom_discard_data() timeouts&quot;, so we&apos;re rebooting a lot of clients. Perhaps this was a contributing factor too...&lt;/p&gt;</comment>
                            <comment id="246225" author="pfarrell" created="Tue, 23 Apr 2019 16:48:01 +0000"  >&lt;p&gt;That&apos;s possible.&#160; To clarify an earlier question:&lt;br/&gt;
 Do you see the &quot;-125&quot; error code associated with those messages in earlier instances?&lt;/p&gt;

&lt;p&gt;By the way, if you are rebooting clients, is the dip in activity perhaps from the MDS related hang on client loss we&apos;ve discussed previously that you recently reminded me is still an issue? (Are you still not unmounting on reboot? &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&#160;)&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Edit:&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Actually, the length of the gap mostly rules that out.&lt;/p&gt;</comment>
                            <comment id="246228" author="sthiell" created="Tue, 23 Apr 2019 16:58:14 +0000"  >&lt;p&gt;Yes, I can see many of them actually. An example is attached in&#160;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32458/32458_fir-io1-s1-previous_bulk.log&quot; title=&quot;fir-io1-s1-previous_bulk.log attached to LU-12215&quot;&gt;fir-io1-s1-previous_bulk.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;, just scroll a bit and you&apos;ll see. Then we can also see a few backtraces that NASA is seeing too. Looking at the I/O graph during that time, I see a small drop in I/O at the time of the event, and then it recovers.&lt;/p&gt;</comment>
                            <comment id="246230" author="sthiell" created="Tue, 23 Apr 2019 17:36:52 +0000"  >&lt;p&gt;Please note that we&apos;re also using the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12065&quot; title=&quot;Client got evicted when  lock callback timer expired  on OSS &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12065&quot;&gt;&lt;del&gt;LU-12065&lt;/del&gt;&lt;/a&gt; (lnd: increase CQ entries) everywhere now (servers, routers, clients).&#160; This one was added following some networking issues on OSS that were tracked in&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12096&quot; title=&quot;ldlm_run_ast_work call traces and network errors on overloaded OSS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12096&quot;&gt;LU-12096&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="246232" author="sthiell" created="Tue, 23 Apr 2019 18:04:28 +0000"  >&lt;p&gt;Perhaps something is wrong with our lnet config? I attached the output of &lt;tt&gt;lnetctl net show -v&lt;/tt&gt; on:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Fir servers (1 x IB EDR) as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32459/32459_lnet-fir-oss.txt&quot; title=&quot;lnet-fir-oss.txt attached to LU-12215&quot;&gt;lnet-fir-oss.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;Fir(EDR)-Sherlock1(FDR) lnet routers (EDR-FDR) as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32460/32460_lnet-sh-rtr-fir-1.txt&quot; title=&quot;lnet-sh-rtr-fir-1.txt attached to LU-12215&quot;&gt;lnet-sh-rtr-fir-1.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;Fir(EDR)-Sherlock2(EDR) lnet routers (EDR-EDR) as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32461/32461_lnet-sh-rtr-fir-2.txt&quot; title=&quot;lnet-sh-rtr-fir-2.txt attached to LU-12215&quot;&gt;lnet-sh-rtr-fir-2.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;Sherlock1 client (1xFDR) as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32462/32462_lnet-sh-1-fdr.txt&quot; title=&quot;lnet-sh-1-fdr.txt attached to LU-12215&quot;&gt;lnet-sh-1-fdr.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;Sherlock2 client (1xEDR) as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/32463/32463_lnet-sh-2-edr.txt&quot; title=&quot;lnet-sh-2-edr.txt attached to LU-12215&quot;&gt;lnet-sh-2-edr.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We usually don&apos;t tune much lnet and try to stick to the default values.&lt;/p&gt;</comment>
                            <comment id="246470" author="sthiell" created="Mon, 29 Apr 2019 18:50:44 +0000"  >&lt;p&gt;I see quite a lot of these msgs this morning, on OSS but also on one of our MDS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
Apr 29 09:29:55 fir-md1-s2 kernel: LNetError: 105174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
Apr 29 10:01:32 fir-md1-s2 kernel: LNetError: 105171:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
Apr 29 11:35:44 fir-md1-s2 kernel: LNetError: 105174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-125, 0)
Apr 29 11:35:59 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-125, 0)
Apr 29 11:38:34 fir-md1-s2 kernel: LNetError: 105165:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
Apr 29 11:38:52 fir-md1-s2 kernel: LNetError: 105177:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-125, 0)
Apr 29 11:40:57 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-125, 0)
Apr 29 11:41:43 fir-md1-s2 kernel: LNetError: 105167:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
Apr 29 11:42:53 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-125, 0)
Apr 29 11:44:39 fir-md1-s2 kernel: LNetError: 105172:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When enabling +net logging, I see some of the following messages (which are our Lustre routers):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-md1-s2 tmp]# strings dk.1556563279 | grep &apos;no credits&apos; | tail
00000800:00000200:7.0:1556563272.841468:0:105176:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits
00000800:00000200:21.0:1556563272.841474:0:105168:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:29.0:1556563272.841478:0:105169:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:31.0:1556563272.841481:0:105177:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits
00000800:00000200:13.0:1556563272.841481:0:105170:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:45.0:1556563272.841484:0:105171:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:47.0:1556563272.841486:0:105178:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits
00000800:00000200:6.0:1556563272.841486:0:105173:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.211@o2ib7: no credits
00000800:00000200:21.0:1556563272.841487:0:105168:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:29.0:1556563272.841491:0:105169:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Below, o2ib7 is Fir&apos;s IB EDR fabric,&lt;br/&gt;
10.0.10.5x@o2ib7 are MDS,&lt;br/&gt;
10.0.10.1xx@o2ib7 are OSS&lt;br/&gt;
10.0.10.2xx@o2ib7 are routers&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-md1-s2 tmp]# cat /sys/kernel/debug/lnet/nis
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.0.10.52@o2ib7             up    -1   24    8    0    64    56    39
10.0.10.52@o2ib7             up    -1   22    8    0    64    56    24
10.0.10.52@o2ib7             up    -1   23    8    0    64    56    39
10.0.10.52@o2ib7             up    -1   24    8    0    64    56    40

[root@fir-md1-s2 tmp]# cat /sys/kernel/debug/lnet/peers  | grep o2ib7
10.0.10.202@o2ib7           4    up    -1     8     8     8     8 -2451 0
10.0.10.3@o2ib7             1    NA    -1     8     8     8     8     4 0
10.0.10.105@o2ib7           1    NA    -1     8     8     8     8   -39 0
10.0.10.212@o2ib7           8    up    -1     8     8     8     4 -1376 2520
10.0.10.102@o2ib7           1    NA    -1     8     8     8     8   -32 0
10.0.10.204@o2ib7           4    up    -1     8     8     8     8 -3054 0
10.0.10.107@o2ib7           1    NA    -1     8     8     8     8   -40 0
10.0.10.209@o2ib7           5    up    -1     8     8     8     7 -1386 496
10.0.10.201@o2ib7           4    up    -1     8     8     8     8 -2971 0
10.0.10.104@o2ib7           1    NA    -1     8     8     8     8   -40 0
10.0.10.211@o2ib7           8    up    -1     8     8     8     4 -1168 1520
10.0.10.101@o2ib7           1    NA    -1     8     8     8     8   -30 0
10.0.10.203@o2ib7           4    up    -1     8     8     8     8 -2430 0
10.0.10.106@o2ib7           1    NA    -1     8     8     8     8   -36 0
10.0.10.51@o2ib7            1    NA    -1     8     8     8     8  -437 0
10.0.10.103@o2ib7           1    NA    -1     8     8     8     8   -32 0
10.0.10.108@o2ib7           1    NA    -1     8     8     8     8   -33 0
10.0.10.210@o2ib7           5    up    -1     8     8     8     7 -1166 65608
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Perhaps we should increase the peer_credits so that our MDS can handle more load to Sherlock&apos;s Lnet routers:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;    - net type: o2ib7
      local NI(s):
        - nid: 10.0.10.52@o2ib7
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 911564827
              recv_count: 916989206
              drop_count: 16536
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 0
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 2
          tcp bonding: 0
          CPT: &quot;[0,1,2,3]&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But that means we have to change it on the routers too I assume.&lt;/p&gt;</comment>
                            <comment id="246573" author="sthiell" created="Wed, 1 May 2019 05:44:35 +0000"  >&lt;p&gt;The lack of peer credits seems to have been a consequence of having enabled lnet debugging. We did some tests with &lt;tt&gt;+net&lt;/tt&gt;&#160;at some point but this was a bad idea I guess. Setting debug and subsystem_debug back to &lt;tt&gt;-all&lt;/tt&gt; seems to have fixed it. We&apos;re still seeing the following messages but it&apos;s hard to tell what could be wrong:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;fir-md1-s2: Apr 30 20:23:26 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-md1-s2: Apr 30 20:23:26 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
fir-md1-s2: Apr 30 20:33:38 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-md1-s2: Apr 30 20:51:04 fir-md1-s2 kernel: LNetError: 121172:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-md1-s2: Apr 30 21:06:09 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-md1-s2: Apr 30 21:06:09 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 3 previous similar messages
fir-md1-s2: Apr 30 21:18:03 fir-md1-s2 kernel: LNetError: 121176:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-md1-s2: Apr 30 21:18:03 fir-md1-s2 kernel: LNetError: 121176:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
fir-md1-s2: Apr 30 21:47:15 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-md1-s2: Apr 30 22:25:11 fir-md1-s2 kernel: LNetError: 121175:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-io1-s2: Apr 30 19:51:58 fir-io1-s2 kernel: LNetError: 108541:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-io1-s2: Apr 30 19:52:03 fir-io1-s2 kernel: LNetError: 108539:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-io1-s2: Apr 30 20:08:25 fir-io1-s2 kernel: LNetError: 108535:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-io1-s2: Apr 30 20:09:26 fir-io1-s2 kernel: LNetError: 108545:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-io1-s2: Apr 30 20:10:03 fir-io1-s2 kernel: LNetError: 108541:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-io1-s2: Apr 30 20:13:34 fir-io1-s2 kernel: LNetError: 108535:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
fir-io1-s2: Apr 30 20:17:04 fir-io1-s2 kernel: LNetError: 108533:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (0, 5)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="251299" author="pfarrell" created="Fri, 12 Jul 2019 20:44:20 +0000"  >&lt;p&gt;Stephane,&lt;/p&gt;

&lt;p&gt;I&apos;m kind of thinking this specific bug has been overcome by events and is maybe captured elsewhere, as we&apos;ve worked through various issues recently - Are you still seeing the issues described here?&lt;/p&gt;</comment>
                            <comment id="251306" author="sthiell" created="Fri, 12 Jul 2019 21:06:25 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;No, not recently, at least not with these specific errors. The last problem we had on Fir is described in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12451&quot; title=&quot;Out of router peer credits with DoM?&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12451&quot;&gt;LU-12451&lt;/a&gt;. I&apos;m not sure why we had some routers running out of rtr credits.&lt;/p&gt;</comment>
                            <comment id="251308" author="pfarrell" created="Fri, 12 Jul 2019 21:11:05 +0000"  >&lt;p&gt;OK - Rather than keep open an old ticket, we&apos;ll focus in the new ones.&#160; Thanks, Stephane.&lt;/p&gt;</comment>
                            <comment id="251310" author="pfarrell" created="Fri, 12 Jul 2019 21:12:01 +0000"  >&lt;p&gt;This particular issue has not reocurred recently, and there is another bug (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12451&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-12451&lt;/a&gt;)&#160;capturing similar issues, so we will focus on that and re-open this if necessary.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="32458" name="fir-io1-s1-previous_bulk.log" size="110129" author="sthiell" created="Tue, 23 Apr 2019 16:56:32 +0000"/>
                            <attachment id="32453" name="fir-io2-s2_kernel+sysrq_PARTIAL_20190422.log" size="2835711" author="sthiell" created="Tue, 23 Apr 2019 02:37:58 +0000"/>
                            <attachment id="32454" name="fir-io3-s2-vmcore-dmesg-20190422_204054.log" size="845274" author="sthiell" created="Tue, 23 Apr 2019 06:59:40 +0000"/>
                            <attachment id="32455" name="fir-io3-s2_foreach_bt_20190422_204054.txt" size="1155136" author="sthiell" created="Tue, 23 Apr 2019 06:59:44 +0000"/>
                            <attachment id="32457" name="fir-lnet-msgs_alloc-20190423.png" size="264416" author="sthiell" created="Tue, 23 Apr 2019 16:12:53 +0000"/>
                            <attachment id="32456" name="fir-ossbw-20190423.png" size="309692" author="sthiell" created="Tue, 23 Apr 2019 16:12:51 +0000"/>
                            <attachment id="32459" name="lnet-fir-oss.txt" size="1204" author="sthiell" created="Tue, 23 Apr 2019 18:01:01 +0000"/>
                            <attachment id="32462" name="lnet-sh-1-fdr.txt" size="1188" author="sthiell" created="Tue, 23 Apr 2019 18:01:11 +0000"/>
                            <attachment id="32463" name="lnet-sh-2-edr.txt" size="1191" author="sthiell" created="Tue, 23 Apr 2019 18:01:14 +0000"/>
                            <attachment id="32460" name="lnet-sh-rtr-fir-1.txt" size="1957" author="sthiell" created="Tue, 23 Apr 2019 18:01:05 +0000"/>
                            <attachment id="32461" name="lnet-sh-rtr-fir-2.txt" size="1957" author="sthiell" created="Tue, 23 Apr 2019 18:01:08 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00f7z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>