<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:25:04 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16220] lnet recovery_interval setting</title>
                <link>https://jira.whamcloud.com/browse/LU-16220</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;If health_sensitivity=0 and the peer is&#160;offline, does recovery_interval play any role in how often the client pings the peer? &lt;br/&gt;
I ask because, when a filesystem is down, the clients trying to connect to the servers are causing path lookup storms on our InfiniBand fabric. Too many path lookups cause our subnet manager to lock up and it requires a restart. &lt;/p&gt;
</description>
                <environment></environment>
        <key id="72693">LU-16220</key>
            <summary>lnet recovery_interval setting</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="mhanafi">Mahmoud Hanafi</reporter>
                        <labels>
                    </labels>
                <created>Thu, 6 Oct 2022 16:27:47 +0000</created>
                <updated>Wed, 26 Oct 2022 20:49:07 +0000</updated>
                                            <version>Lustre 2.12.8</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="348963" author="pjones" created="Fri, 7 Oct 2022 03:19:26 +0000"  >&lt;p&gt;Serguei&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="348994" author="ssmirnov" created="Fri, 7 Oct 2022 15:45:12 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Peer recovery shouldn&apos;t be happening if health feature is disabled. You should be able to verify this with&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl debug recovery --peer &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;There&apos;s a possibility that the client pings the server driven by higher-level lustre keepalive mechanism, which, if I remember correctly, pings the server several times within obd_timeout period if there&apos;s no other traffic. Another possibility is lnd trying to reconnect on its own.&#160;&lt;/p&gt;

&lt;p&gt;If you could share net debug log from the client trying to reconnect, we could clarify what is going on in your case.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="349570" author="mhanafi" created="Thu, 13 Oct 2022 23:34:32 +0000"  >&lt;p&gt;Got some debugging info.&lt;br/&gt;
With the servers shutdown the client are trying to reconnect. With our new filesystem each target has 4 IP address to try. They keep trying over and over.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
00000100:00080000:13.0:1665694986.626629:0:3025:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt; state from CONNECTING to DISCONN
00000100:00080000:13.0:1665694986.626629:0:3025:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:1422:ptlrpc_connect_interpret()) recovery of nbptest4-OST0064_UUID on 10.151.27.139@o2ib failed (-110)
00000100:00080000:13.0:1665694986.627063:0:234:0:(pinger.c:247:ptlrpc_pinger_process_import()) efda8ea5-1d5f-3073-1623-a227220573a0-&amp;gt;nbptest4-OST0064_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0
00000100:00080000:13.0:1665694986.627065:0:234:0:(recover.c:58:ptlrpc_initiate_recovery()) nbptest4-OST0064_UUID: starting recovery
00000100:00080000:13.0:1665694986.627065:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt; state from DISCONN to CONNECTING
00000100:00080000:13.0:1665694986.627066:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.138@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.627067:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.139@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.627068:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.141@o2ib last attempt 1897590
00000100:00080000:13.0:1665694986.627070:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.140@o2ib last attempt 1897591
00000100:00080000:13.0:1665694986.627071:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:616:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: Connection changing to nbptest4-OST0064 (at 10.151.27.141@o2ib)
00000100:00080000:13.0:1665694986.627073:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:624:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt; ffff990a2dd1a800 using connection 10.151.27.141@o2ib/10.151.27.141@o2ib
00000100:00100000:13.0:1665694986.627076:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:817:ptlrpc_connect_import_locked()) @@@ (re)connect request (timeout 5) &#160;req@ffff99040756b600 x1744618326358592/t0(0) o8-&amp;gt;nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
00000100:00100000:13.0:1665694986.632012:0:3025:0:(client.c:2188:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1665694986/real 1665694986] &#160;req@ffff99040756b600 x1744618326358592/t0(0) o8-&amp;gt;nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 1 dl 1665695266 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
00000100:00100000:13.0:1665694986.632016:0:3025:0:(client.c:2217:ptlrpc_expire_one_request()) @@@ err -110, sent_state=CONNECTING (now=CONNECTING) &#160;req@ffff99040756b600 x1744618326358592/t0(0) o8-&amp;gt;nbptest4-OST0064-osc-ffff990a2dd1a000@10.151.27.141@o2ib:28/4 lens 520/544 e 0 to 1 dl 1665695266 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
00000100:00080000:13.0:1665694986.632019:0:3025:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt; state from CONNECTING to DISCONN
00000100:00080000:13.0:1665694986.632020:0:3025:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:1422:ptlrpc_connect_interpret()) recovery of nbptest4-OST0064_UUID on 10.151.27.141@o2ib failed (-110)
00000100:00080000:13.0:1665694986.632471:0:234:0:(pinger.c:247:ptlrpc_pinger_process_import()) efda8ea5-1d5f-3073-1623-a227220573a0-&amp;gt;nbptest4-OST0064_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 1 suppress 0
00000100:00080000:13.0:1665694986.632473:0:234:0:(recover.c:58:ptlrpc_initiate_recovery()) nbptest4-OST0064_UUID: starting recovery
00000100:00080000:13.0:1665694986.632474:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:86:import_set_state_nolock()) ffff990a2dd1a800 nbptest4-OST0064_UUID: changing &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt; state from DISCONN to CONNECTING
00000100:00080000:13.0:1665694986.632476:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.138@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.632477:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.139@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.632479:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.141@o2ib last attempt 1897596
00000100:00080000:13.0:1665694986.632481:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:544:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: connect to NID 10.151.27.140@o2ib last attempt 1897591
00000100:00080000:13.0:1665694986.632483:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:616:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: Connection changing to nbptest4-OST0064 (at 10.151.27.140@o2ib)
00000100:00080000:13.0:1665694986.632485:0:234:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:624:import_select_connection()) nbptest4-OST0064-osc-ffff990a2dd1a000: &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt; ffff990a2dd1a800 using connection 10.151.27.140@o2ib/10.151.27.140@o2ib &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="349571" author="mhanafi" created="Thu, 13 Oct 2022 23:50:18 +0000"  >&lt;p&gt;Is there a way to pause/stop all client reconnect attempts?&#160;&lt;/p&gt;</comment>
                            <comment id="350057" author="ssmirnov" created="Tue, 18 Oct 2022 21:15:17 +0000"  >&lt;p&gt;Checked with &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=adilger&quot; class=&quot;user-hover&quot; rel=&quot;adilger&quot;&gt;adilger&lt;/a&gt;&#160;about this. Here&apos;s the summary.&lt;/p&gt;

&lt;p&gt;If FS stays mounted on the clients:&#160;&lt;/p&gt;

&lt;p&gt;For OST connections, setting &quot;osc.*.idle_timeout&quot; prior to servers going down should prevent reconnect attempts.&#160;&lt;/p&gt;

&lt;p&gt;Setting &quot;lctl set_param timeout=3600&quot; will reduce the ping interval to 15min. Large value can be used if needed.&lt;/p&gt;

&lt;p&gt;To avoid client evictions, both settings need to be restored before the server comes back up.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="350890" author="mhanafi" created="Wed, 26 Oct 2022 20:49:07 +0000"  >&lt;p&gt;thank you, I will test this setting.&#160;&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="46049" name="dk.rmmod3.r435i0n15.bz2" size="445453" author="mhanafi" created="Thu, 13 Oct 2022 23:35:26 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i0328v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>