<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:43:33 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4531] frequent evictions and timeouts on routed lnet</title>
                <link>https://jira.whamcloud.com/browse/LU-4531</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;IU&apos;s dc2 filesystem is primarily served out via IB, but there is also a tcp connected cluster routed through 4 LNET routers (dc2xfer&lt;span class=&quot;error&quot;&gt;&amp;#91;01-04&amp;#93;&lt;/span&gt;). &lt;/p&gt;

&lt;p&gt;We&apos;ve been getting a lot of client evictions recently and aren&apos;t sure the best way to troubleshoot them. &lt;/p&gt;

&lt;p&gt;Here&apos;s an example of one of the evictions that happened on Jan 19:&lt;br/&gt;
Client &lt;span class=&quot;error&quot;&gt;&amp;#91;149.165.226.203@tcp&amp;#93;&lt;/span&gt; logs from Jan 19:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 19 03:33:40 c3 kernel: Lustre: dc2-OST0038-osc-ffff8860122a1000: Connection to service dc2-OST0038 via nid 10.10.0.6@o2ib was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete.
Jan 19 03:33:40 c3 kernel: LustreError: 18821:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
Jan 19 03:33:43 c3 kernel: LustreError: 167-0: This client was evicted by dc2-OST0038; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail.
Jan 19 03:33:43 c3 kernel: Lustre: Server dc2-OST0038_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Jan 19 03:33:43 c3 kernel: LustreError: 40645:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0038-osc-ffff8860122a1000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
Jan 19 03:33:43 c3 kernel: LustreError: 40645:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff887793d31200 (12440460/0/0/0) (rc: 2)
Jan 19 03:33:43 c3 kernel: LustreError: 38780:0:(llite_mmap.c:210:ll_tree_unlock()) couldn&apos;t unlock -5
Jan 19 03:33:43 c3 kernel: Lustre: dc2-OST0038-osc-ffff8860122a1000: Connection restored to service dc2-OST0038 using nid 10.10.0.6@o2ib.
Jan 19 03:33:43 c3 kernel: LustreError: 18821:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
Jan 19 04:12:18 c3 kernel: Lustre: dc2-OST0054-osc-ffff8860122a1000: Connection to service dc2-OST0054 via nid 10.10.0.9@o2ib was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete.
Jan 19 04:12:21 c3 kernel: LustreError: 18678:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
Jan 19 04:12:21 c3 kernel: LustreError: 167-0: This client was evicted by dc2-OST0054; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail.
Jan 19 04:12:21 c3 kernel: Lustre: Server dc2-OST0054_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Jan 19 04:12:21 c3 kernel: LustreError: 46325:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0054-osc-ffff8860122a1000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
Jan 19 04:12:21 c3 kernel: LustreError: 46325:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff887fd34beb40 (12159828/0/0/0) (rc: 2)
Jan 19 04:12:21 c3 kernel: Lustre: dc2-OST0054-osc-ffff8860122a1000: Connection restored to service dc2-OST0054 using nid 10.10.0.9@o2ib.
Jan 19 04:12:21 c3 kernel: LustreError: 46324:0:(llite_mmap.c:210:ll_tree_unlock()) couldn&apos;t unlock -5
Jan 19 04:12:22 c3 kernel: LustreError: 18678:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
Jan 19 08:30:58 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786163853766 sent from dc2-OST0095-osc-ffff8860122a1000 to NID 10.10.0.15@o2ib 22s ago has timed out (22s prior to deadline).
Jan 19 08:30:58 c3 kernel: Lustre: dc2-OST0095-osc-ffff8860122a1000: Connection to service dc2-OST0095 via nid 10.10.0.15@o2ib was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete.
Jan 19 08:31:00 c3 kernel: Lustre: dc2-OST0095-osc-ffff8860122a1000: Connection restored to service dc2-OST0095 using nid 10.10.0.15@o2ib.
Jan 19 08:31:00 c3 kernel: Lustre: Server dc2-OST0095_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Jan 19 15:10:06 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984600 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 22s ago has timed out (22s prior to deadline).
Jan 19 15:10:06 c3 kernel: Lustre: dc2-OST003f-osc-ffff8860122a1000: Connection to service dc2-OST003f via nid 10.10.0.7@o2ib was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete.
Jan 19 15:10:07 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984602 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 24s ago has timed out (22s prior to deadline).
Jan 19 15:10:09 c3 kernel: LustreError: 11-0: an error occurred &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; communicating with 10.10.0.7@o2ib. The ost_connect operation failed with -16
Jan 19 15:10:09 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984604 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 26s ago has timed out (22s prior to deadline).
Jan 19 15:10:11 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984607 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 28s ago has timed out (22s prior to deadline).
Jan 19 15:10:15 c3 kernel: Lustre: dc2-OST003f-osc-ffff8860122a1000: Connection restored to service dc2-OST003f using nid 10.10.0.7@o2ib.
Jan 19 15:10:15 c3 kernel: Lustre: Server dc2-OST003f_UUID version (2.1.6.0) is much newer than client version (1.8.9)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Router logs&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;dc2xfer01: Jan 19 02:19:47 dc2xfer01 kernel: : LustreError: 3529:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; GET from 12345-149.165.226.207@tcp: -113
dc2xfer01: Jan 19 02:20:46 dc2xfer01 kernel: : LustreError: 3536:0:(socklnd_cb.c:2520:ksocknal_check_peer_timeouts()) Total 8 stale ZC_REQs &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; peer 149.165.226.205@tcp detected; the oldest(ffff880ffee94000) timed out 3 secs ago, resid: 0, wmem: 8020736
dc2xfer01: Jan 19 02:21:12 dc2xfer01 kernel: : LustreError: 3530:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; GET from 12345-149.165.226.205@tcp: -113
dc2xfer01: Jan 19 08:02:42 dc2xfer01 kernel: : LustreError: 3536:0:(socklnd_cb.c:2520:ksocknal_check_peer_timeouts()) Total 8 stale ZC_REQs &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; peer 149.165.226.205@tcp detected; the oldest(ffff880fcd326000) timed out 8 secs ago, resid: 0, wmem: 7883680
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;ll attach the OSS logs. It looks like there are some Bulk IO errors, but does that indicate an issue between the OSS and the router, or the router and the client? In the past when I&apos;ve seen those errors, there hasn&apos;t been a router in between to confuse the issue. &lt;/p&gt;</description>
                <environment></environment>
        <key id="22857">LU-4531</key>
            <summary>frequent evictions and timeouts on routed lnet</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="isaac">Isaac Huang</assignee>
                                    <reporter username="manish">Manish Patel</reporter>
                        <labels>
                    </labels>
                <created>Thu, 23 Jan 2014 18:58:24 +0000</created>
                <updated>Fri, 18 Jul 2014 17:54:25 +0000</updated>
                            <resolved>Fri, 18 Jul 2014 17:54:25 +0000</resolved>
                                    <version>Lustre 2.1.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="75550" author="pjones" created="Fri, 24 Jan 2014 13:37:22 +0000"  >&lt;p&gt;Thanks for the submission&lt;/p&gt;</comment>
                            <comment id="75554" author="bfaccini" created="Fri, 24 Jan 2014 14:31:44 +0000"  >&lt;p&gt;Hello Kit,&lt;br/&gt;
Are all the impacted Clients running with 1.8.9 ?&lt;/p&gt;</comment>
                            <comment id="75685" author="kitwestneat" created="Mon, 27 Jan 2014 17:01:24 +0000"  >&lt;p&gt;They are all at 1.8.x. The majority are at 1.8.9 I think, but we have seen it on older versions as well.&lt;/p&gt;</comment>
                            <comment id="76318" author="manish" created="Wed, 5 Feb 2014 23:19:45 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;Upon further investigating on the network side we seeing -ve min values for rtr min and tx min value for TCP connections.&lt;/p&gt;

&lt;p&gt;Here is output from&lt;/p&gt;

&lt;p&gt;cat /proc/sys/lnet/peers&lt;/p&gt;

&lt;p&gt;nid                      refs state  last   max   rtr   min    tx   min queue&lt;br/&gt;
10.10.10.154@o2ib            3    up    -1     8     8     8     8  -176 0&lt;br/&gt;
10.10.10.156@o2ib            1    NA    -1     8     8     8     8 -1742 0&lt;br/&gt;
10.10.10.157@o2ib            1    NA    -1     8     8     8     8 -1610 0&lt;br/&gt;
10.10.10.158@o2ib            1    NA    -1     8     8     8     8 -32916 0&lt;br/&gt;
10.10.10.162@tcp             3    up   173     8     8    -2     6  -111 1241232&lt;/p&gt;


&lt;p&gt;cat /proc/sys/lnet/peers&lt;br/&gt;
nid                      refs state  last   max   rtr   min    tx   min queue&lt;br/&gt;
172.16.32.14@tcp          1    up    25     8     8    -1     8    -7 0&lt;br/&gt;
172.16.32.15@tcp          1    up   148     8     8    -2     8  -110 0&lt;br/&gt;
172.16.32.16@tcp          4    up    73     8     8    -2     5  -119 2101464&lt;br/&gt;
172.16.32.17@tcp          1    up     5     8     8    -2     8  -110 0&lt;br/&gt;
172.16.32.18@tcp          1    up    43     8     8    -2     8  -116 0&lt;br/&gt;
172.16.32.19@tcp          3    up   173     8     8    -2     6  -111 1241232&lt;/p&gt;

&lt;p&gt;Does that &quot;-ve min&quot; values mean it needs tuning for &quot;peer_buffer_credits&quot;.&lt;/p&gt;</comment>
                            <comment id="76781" author="manish" created="Tue, 11 Feb 2014 22:04:57 +0000"  >&lt;p&gt;Here are the Router logs incase if required.&lt;/p&gt;</comment>
                            <comment id="77258" author="pjones" created="Tue, 18 Feb 2014 16:49:34 +0000"  >&lt;p&gt;Isaac will help with this one&lt;/p&gt;</comment>
                            <comment id="77275" author="isaac" created="Tue, 18 Feb 2014 18:18:02 +0000"  >&lt;p&gt;I&apos;ve checked the router log, and there&apos;s several errors that indicate possible configuration problems:&lt;/p&gt;

&lt;p&gt;(lib-move.c:1957:lnet_parse_get()) 10.10.0.151@o2ib: Unable to send REPLY for GET from 12345-149.165.235.17@tcp: -22&lt;br/&gt;
No route to 149.165.235.17@tcp via from 10.10.0.151@o2ib&lt;br/&gt;
(lib-move.c:2272:lnet_parse()) 149.165.235.17@tcp, src 149.165.235.17@tcp: Bad dest nid 10.10.0.151@o2ib (it&apos;s my nid but on a different network)&lt;/p&gt;

&lt;p&gt;These were caused either by bad routing configurations or admin doing wrong thing, e.g. ping router&apos;s IB NID from a TCP client.&lt;/p&gt;

&lt;p&gt;Oct 16 13:42:42 dc2xfer03 kernel: : Lustre: Added LNI 149.165.235.153@tcp &lt;span class=&quot;error&quot;&gt;&amp;#91;8/256/0/180&amp;#93;&lt;/span&gt;&lt;br/&gt;
Oct 16 13:43:39 dc2xfer03 kernel: : LustreError: 2703:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.153@tcp: Unable to send REPLY for GET from 12345-149.165.228.160@tcp: -113&lt;/p&gt;

&lt;p&gt;So LNET was started at 13:42:42, then less than 1 minute later 149.165.228.160@tcp was considered dead. This often indicates that the peer_timeout was configured with a value too small.&lt;/p&gt;

&lt;p&gt;I think the 1st step to troubleshoot would be to make sure all LNET/LND options are configured properly. Please give me the module options for lnet and ksocklnd and ko2iblnd on clients, routers, and servers. I&apos;ll double check all the configurations.&lt;/p&gt;</comment>
                            <comment id="77280" author="manish" created="Tue, 18 Feb 2014 19:27:01 +0000"  >&lt;p&gt;Hi Isaac,&lt;/p&gt;

&lt;p&gt;Thank You, here are the details about the LNET/LND options configured which you requested.&lt;/p&gt;


&lt;p&gt;------------------&lt;br/&gt;
Lustre Servers&lt;br/&gt;
------------------&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet ip2nets=&quot;o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;&quot;
options lnet routes=&quot;tcp0 10.10.0.[151-154]@o2ib0; o2ib0 149.165.235.[151-154]@tcp0; gni0 10.10.0.[51-72]@o2ib0; gni1 10.10.0.[100-101]@o2ib0;&quot;
options lnet live_router_check_interval=&quot;60&quot;
options lnet dead_router_check_interval=&quot;60&quot;
options lnet check_routers_before_use=&quot;1&quot;
options lnet auto_down=&quot;1&quot;
options lnet avoid_asym_router_failure=&quot;1&quot;
options lnet router_ping_timeout=50

options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;------------------&lt;br/&gt;
Lustre Clients &lt;br/&gt;
------------------&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;tcp0(eth1)&quot;
options libcfs libcfs_panic_on_lbug=1
options lnet routes=&quot;o2ib 149.165.235.[151-154]@tcp0&quot;
options lnet check_routers_before_use=1
options lnet router_ping_timeout=50
options lnet avoid_asym_router_failure=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;--------------------------------------&lt;br/&gt;
Router Nodes &apos;dc2xfer01 to dc2xfer04&apos;&lt;br/&gt;
-------------------------------------- &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;tcp(eth7),o2ib(ib0)&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;--------------------------------------&lt;br/&gt;
Router Nodes &apos;dc2xfer05 to dc2xfer08&apos;&lt;br/&gt;
--------------------------------------&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;o2ib0(ib0), tcp0(eth7)&quot;
options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;Let me know if you need any more details on it or anything is missing.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="77622" author="manish" created="Fri, 21 Feb 2014 17:45:23 +0000"  >&lt;p&gt;Hi Isaac,&lt;/p&gt;

&lt;p&gt;Thank You, here are the details about the LNET/LND options configured which you requested.&lt;/p&gt;


&lt;p&gt;------------------&lt;br/&gt;
Lustre Servers&lt;br/&gt;
------------------&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet ip2nets=&quot;o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;&quot;
options lnet routes=&quot;tcp0 10.10.0.[151-154]@o2ib0; o2ib0 149.165.235.[151-154]@tcp0; gni0 10.10.0.[51-72]@o2ib0; gni1 10.10.0.[100-101]@o2ib0;&quot;
options lnet live_router_check_interval=&quot;60&quot;
options lnet dead_router_check_interval=&quot;60&quot;
options lnet check_routers_before_use=&quot;1&quot;
options lnet auto_down=&quot;1&quot;
options lnet avoid_asym_router_failure=&quot;1&quot;
options lnet router_ping_timeout=50

options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;------------------&lt;br/&gt;
Lustre Clients &lt;br/&gt;
------------------&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;tcp0(eth1)&quot;
options libcfs libcfs_panic_on_lbug=1
options lnet routes=&quot;o2ib 149.165.235.[151-154]@tcp0&quot;
options lnet check_routers_before_use=1
options lnet router_ping_timeout=50
options lnet avoid_asym_router_failure=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;--------------------------------------&lt;br/&gt;
Router Nodes &apos;dc2xfer01 to dc2xfer04&apos;&lt;br/&gt;
-------------------------------------- &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;tcp(eth7),o2ib(ib0)&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;--------------------------------------&lt;br/&gt;
Router Nodes &apos;dc2xfer05 to dc2xfer08&apos;&lt;br/&gt;
--------------------------------------&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;o2ib0(ib0), tcp0(eth7)&quot;
options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I have a new set of router logs which I have attached here. Also IU reported that ignore the incidence which occurred on 2/4 as they&apos;re related to a configuration error was made on dc2mds01 when bringing up its new public interface. To correct the error they modified dc2mds01&apos;s lnet options to exclude eth2, which is its new pubic interface.&lt;/p&gt;

&lt;p&gt;From:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet ip2nets=&quot;o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;To:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet ip2nets=&quot;o2ib0 10.10.*.*;&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Also, please ignore any LNet routing issues reported between 10/31/2013&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;11/1/2013 as they too are related to a configuration error made from their side.&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;Let me know if you need any more details on it or anything is missing.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="77627" author="isaac" created="Fri, 21 Feb 2014 18:16:01 +0000"  >&lt;p&gt;Several things not quite right in the configurations:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;For avoid_asym_router_failure to work properly, routers need to know how often clients/servers are supposed to ping them. So on all routers please add:&lt;br/&gt;
  options lnet dead_router_check_interval=60&lt;br/&gt;
  options lnet live_router_check_interval=60&lt;br/&gt;
  The value for the check interval options on routers should be the smallest of the corresponding values on clients and servers, e.g. dead_router_check_interval(router) = min(dead_router_check_interval(clients), dead_router_check_interval(servers)).&lt;/li&gt;
	&lt;li&gt;The peer_timeout should ALWAYS be disabled on clients and servers. So on all clients and servers, please add:&lt;br/&gt;
  options ko2iblnd peer_timeout=0&lt;br/&gt;
  options ksocklnd peer_timeout=0&lt;/li&gt;
	&lt;li&gt;On router Nodes &apos;dc2xfer05 to dc2xfer08&apos;, ko2iblnd peer_timeout=0. Why was it disabled on these routers but not on &apos;dc2xfer01 to dc2xfer04&apos;? Why disable it only for the IB but not the TCP? It doesn&apos;t make much sense to me.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="77631" author="isaac" created="Fri, 21 Feb 2014 18:34:29 +0000"  >&lt;p&gt;Lots of errors like:&lt;br/&gt;
lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY for GET from 12345-149.165.229.156@tcp: -113&lt;/p&gt;

&lt;p&gt;So routers were dropping pings from clients because they thought the clients were dead. The clients were apparent not dead because pings just arrived from them. Something could be wrong with the socklnd peer_timeout mechanism. To narrow it down, please on all routers disable peer timeout completely:&lt;br/&gt;
options ko2iblnd peer_timeout=0&lt;br/&gt;
options ksocklnd peer_timeout=0&lt;/p&gt;

&lt;p&gt;Also, have you ever done a LNet selftest stress test between clients and servers?&lt;/p&gt;</comment>
                            <comment id="77715" author="manish" created="Mon, 24 Feb 2014 15:04:25 +0000"  >&lt;p&gt;Hi Isaac,&lt;/p&gt;

&lt;p&gt;I have checked with the IU about the Nodes &apos;dc2xfer05 to dc2xfer08&apos; and they responded that initially they intended to use it as a router nodes but it has never been configured or used as a router and they are normal Lustre Client nodes, that&apos;s why the &quot;peer_timeout&quot; is disabled. Here are my few questions&lt;/p&gt;

&lt;p&gt;1. Do you still suggest that we disable &quot;peer_timeout&quot; on router nodes too.&lt;br/&gt;
2. what is the recommended setting when clients need to mount both routed and non-routed Lustre file systems, in that scenario what should be the &quot;peer_timeout&quot; settings.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish &lt;/p&gt;</comment>
                            <comment id="77869" author="isaac" created="Tue, 25 Feb 2014 22:31:04 +0000"  >&lt;p&gt;The only nodes where peer_timeout is supposed to work are routers. So it should be disabled everywhere except on routers. It was a common mistake to enable it on clients and servers, so some time ago we added a patch that&apos;d simply ignore peer_timeout setting and always disable it except on routers. But I can&apos;t remember what Lustre version that patch was landed on. So it&apos;s a good idea to always explicitly disable peer_timeout on clients and servers.&lt;/p&gt;

&lt;p&gt;The peer_timeout works on routers for certain error scenarios. Normally it should always be ON on routers. Here I&apos;m suggesting to disable it on routers because I suspect that there&apos;s a bug in peer_timeout mechanism that could cause routers to treat good clients/servers as dead and hence dropping messages unnecessarily. We&apos;d be able to narrow it down by disabling peer_timeout on routers. This is for troubleshooting only.&lt;/p&gt;

&lt;p&gt;Whether it&apos;s routed or not, peer_timeout should be always off on both clients and servers.&lt;/p&gt;</comment>
                            <comment id="84567" author="morrone" created="Wed, 21 May 2014 00:45:02 +0000"  >&lt;p&gt;peer_timeout was enabled by default in the socklnd and o2iblnd back in 2009.  If that should not have been set, that was a bug in lustre/lnet, not a &quot;common mistake&quot; in configuration.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;so some time ago we added a patch that&apos;d simply ignore peer_timeout setting and always disable it except on routers. But I can&apos;t remember what Lustre version that patch was landed on.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Could you please track that down?  It would be very helpful to note which commit addressed this.&lt;/p&gt;</comment>
                            <comment id="84618" author="morrone" created="Wed, 21 May 2014 17:29:17 +0000"  >&lt;p&gt;I did some legwork and it looks to me like the peer_timeouts-should-only-be-used-on-router-nodes bug was addressed in the following commit:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;commit 00243fccc1977e4dee8041f4c0f9854373598dc2
Author:     Isaac Huang &amp;lt;he.huang@intel.com&amp;gt;
AuthorDate: Tue Mar 19 13:20:53 2013 -0600
Commit:     Oleg Drokin &amp;lt;oleg.drokin@intel.com&amp;gt;
CommitDate: Fri Apr 12 21:31:56 2013 -0400

    LU-2133 lnet: wrong peer state reported
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The commit description should have made this change far more clear.  The state of the code surrounding peer_timeout is pretty subtle and not adequately document either.  We have some technical debt here.&lt;/p&gt;

&lt;p&gt;But it looks to me like Lustre 2.4 and later do not require the work-around of manually setting peer_timeout=0 on non-router nodes.  Do you agree?&lt;/p&gt;</comment>
                            <comment id="85049" author="hornc" created="Wed, 28 May 2014 18:20:32 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Isaac Huang added a comment - 21/Feb/14 6:16 PM&lt;br/&gt;
The value for the check interval options on routers should be the smallest of the corresponding values on clients and servers, e.g. dead_router_check_interval(router) = min(dead_router_check_interval(clients), dead_router_check_interval(servers)).&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Hi Isaac,&lt;/p&gt;

&lt;p&gt;This advice for the dead/live router check interval is not consistent with what is written in the Lustre OPs manual. The OPs manual states that the maximum corresponding value on clients and servers should be used:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The following router checker parameters must be set to the maximum value of the corresponding setting for this option on any client or server:&lt;/p&gt;

&lt;p&gt;dead_router_check_interval&lt;/p&gt;

&lt;p&gt;live_router_check_interval&lt;/p&gt;

&lt;p&gt;router_ping_timeout&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Can you please clarify which is correct? I suspect your version is correct, in which case we should open a ticket to fix the OPs manual.&lt;/p&gt;

&lt;p&gt;Note, also, that the OPs manual mentions setting router_ping_timeout on the router, but you didn&apos;t mention it in your earlier comment. Is it necessary to set the router_ping_timeout on a router?&lt;/p&gt;</comment>
                            <comment id="86478" author="morrone" created="Thu, 12 Jun 2014 20:42:40 +0000"  >&lt;blockquote&gt;&lt;p&gt;But it looks to me like Lustre 2.4 and later do not require the work-around of manually setting peer_timeout=0 on non-router nodes. Do you agree?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;We could use an answer to this question.&lt;/p&gt;</comment>
                            <comment id="86624" author="isaac" created="Fri, 13 Jun 2014 21:56:31 +0000"  >&lt;p&gt;The bug was &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-630&quot; title=&quot;mount failure after MGS connection lost and file system is unmounted&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-630&quot;&gt;&lt;del&gt;LU-630&lt;/del&gt;&lt;/a&gt;, fixed by &lt;a href=&quot;http://review.whamcloud.com/#/c/2646/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/2646/&lt;/a&gt;, which was then back ported to b2_1 &lt;a href=&quot;http://review.whamcloud.com/#/c/1797/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/1797/&lt;/a&gt; and landed for 2.1.2 and 2.3 and later.&lt;/p&gt;</comment>
                            <comment id="86625" author="isaac" created="Fri, 13 Jun 2014 21:58:20 +0000"  >&lt;p&gt;The OP manual was correct, my previous comment wrong.&lt;/p&gt;</comment>
                            <comment id="89513" author="pjones" created="Fri, 18 Jul 2014 17:54:25 +0000"  >&lt;p&gt;As per DDN ok to close&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="14017" name="2014-01-23-SR29545-jan19.errs" size="517460" author="kitwestneat" created="Thu, 23 Jan 2014 22:03:13 +0000"/>
                            <attachment id="14092" name="dc2_lnet_router.log.gz" size="84942" author="manish" created="Tue, 11 Feb 2014 22:04:57 +0000"/>
                            <attachment id="14145" name="lnet_router_logs_02_21_2014.gz" size="86303" author="manish" created="Fri, 21 Feb 2014 17:43:23 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwdhj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>12392</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>