<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:03:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-33] client can&apos;t recover on N-hop router configuraton </title>
                <link>https://jira.whamcloud.com/browse/LU-33</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I&apos;m testing on Lustre N-hop routing (e.g. o2ib0 &amp;lt;&amp;gt; tcp &amp;lt;&amp;gt; o2ib1) below.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
MDS/OSS  &amp;lt;-- IB (o2ib0) --&amp;gt; Router1 &amp;lt;-- TCP (tcp0) --&amp;gt; Router2 &amp;lt;-- IB (o2ib1) --&amp;gt; Client

- Network configuration -
There are two IB fabrics and 1GbE connects both fabrics with LNET routers.

MDS/OSS IP address: 192.168.100.120@o2ib0
options lnet networks=o2ib0 routes=&quot;tcp0 192.168.100.121@o2ib0; o2ib1 192.168.100.121@o2ib0&quot;

Router1 IP address: 192.168.100.121@o2ib0, 192.168.10.121@tcp0
options lnet ip2nets=&quot;tcp0 192.168.20.*; o2ib0(ib0) 192.168.100.*&quot; routes=&quot;o2ib1 192.168.20.122@tcp0&quot; forwarding=&quot;enabled&quot;

Router2 IP address: 192.168.200.122@o2ib1, 192.168.10.122@tcp0
options lnet ip2nets=&quot;tcp0 192.168.20.*; o2ib1(ib0) 192.168.200.*&quot; routes=&quot;o2ib0 192.168.20.121@tcp0&quot; forwarding=&quot;enabled&quot;

Client IP address: 192.168.200.123@o2ib1
options lnet networks=o2ib1(ib0) routes=&quot;o2ib0 192.168.200.122@o2ib1&quot;

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It worked with above the configurations, but it seems that there is an issue if Router2 downs (e.g. &apos;lctl net down&apos;), then restart it. The problem is that the client can&apos;t be recovery unless the client umount and remount the filesytem on it.&lt;/p&gt;</description>
                <environment></environment>
        <key id="10143">LU-33</key>
            <summary>client can&apos;t recover on N-hop router configuraton </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="liang">Liang Zhen</assignee>
                                    <reporter username="ihara">Shuichi Ihara</reporter>
                        <labels>
                    </labels>
                <created>Tue, 28 Dec 2010 01:56:31 +0000</created>
                <updated>Tue, 28 Jun 2011 15:01:39 +0000</updated>
                            <resolved>Fri, 28 Jan 2011 00:45:30 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>1</watches>
                                                                            <comments>
                            <comment id="10355" author="liang" created="Tue, 28 Dec 2010 06:58:53 +0000"  >&lt;p&gt;Ihara,&lt;/p&gt;

&lt;p&gt;could you help me to get a few things after restarting the router in the test:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;cat /proc/sys/lnet/peers on router2&lt;/li&gt;
	&lt;li&gt;cat /proc/sys/lnet/peers and /proc/sys/lnet/routes on client&lt;/li&gt;
	&lt;li&gt;lctl ping route2 on client to see if it can work, if it can&apos;t please get error messages on client&lt;/li&gt;
	&lt;li&gt;lctl ping client on route2 to see if it can work, if it can&apos;t please get error messages on router2&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;NB: also, could you tell me detail version of your lustre? &lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10356" author="ihara" created="Tue, 28 Dec 2010 22:02:54 +0000"  >&lt;p&gt;Liang,&lt;/p&gt;

&lt;p&gt;here are results you requested. I just remount the lustre on the clients. do you need the following results when the problem happens?&lt;/p&gt;

&lt;p&gt;1. /proc/sys/lnet/peers on router2&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;cat/proc/sys/lnet/peers&lt;br/&gt;
nid                      refs state   max   rtr   min    tx   min queue&lt;br/&gt;
192.168.200.123@o2ib1       1    up     8     8     5     8     5 0&lt;br/&gt;
192.168.20.121@tcp          3    up     8     8     5     8 -1029 0&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;2./proc/sys/lnet/peers and /proc/sys/lnet/routes on client&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;cat /proc/sys/lnet/peers&lt;br/&gt;
nid                      refs state   max   rtr   min    tx   min queue&lt;br/&gt;
192.168.200.122@o2ib1       3    up     8     8     8     8     5 0&lt;/li&gt;
&lt;/ol&gt;


&lt;ol&gt;
	&lt;li&gt;cat /proc/sys/lnet/routes&lt;br/&gt;
Routing disabled&lt;br/&gt;
net      hops   state router&lt;br/&gt;
o2ib        1      up 192.168.200.122@o2ib1&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;3. lctl ping route2 on client&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;lctl ping 192.168.200.122@o2ib1&lt;br/&gt;
12345-0@lo&lt;br/&gt;
12345-192.168.20.122@tcp&lt;br/&gt;
12345-192.168.200.122@o2ib1&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;4. lctl ping client on the route2&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;lctl ping 192.168.200.123@o2ib1&lt;br/&gt;
12345-0@lo&lt;br/&gt;
12345-192.168.200.123@o2ib1&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;I&apos;m using the lustre-1.8.4.ddn2 which is based on lustre-1.8.4 and backported patches mostly from 1.8.5.&lt;/p&gt;</comment>
                            <comment id="10357" author="liang" created="Tue, 28 Dec 2010 23:08:55 +0000"  >&lt;p&gt;Ihara, yes please collect these information when you got the problem, and if possible, please attach source tar-ball of lnet at here.&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10372" author="ihara" created="Thu, 30 Dec 2010 05:48:19 +0000"  >&lt;p&gt;Here are results when the problem happened.&lt;/p&gt;

&lt;p&gt;1. /proc/sys/lnet/peers on router2&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@r03 ~&amp;#93;&lt;/span&gt;# cat /proc/sys/lnet/peers &lt;br/&gt;
nid                      refs state   max   rtr   min    tx   min queue&lt;br/&gt;
192.168.20.121@tcp          3    up     8     8     8     8     5 0&lt;/p&gt;

&lt;p&gt;2. /proc/sys/lnet/peers and /proc/sys/lnet/routes on client&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@r04 ~&amp;#93;&lt;/span&gt;# cat /proc/sys/lnet/peers &lt;br/&gt;
nid                      refs state   max   rtr   min    tx   min queue&lt;br/&gt;
192.168.200.122@o2ib1       3  down     8     8     8     8     5 0&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@r04 ~&amp;#93;&lt;/span&gt;# cat /proc/sys/lnet/routes&lt;br/&gt;
Routing disabled&lt;br/&gt;
net      hops   state router&lt;br/&gt;
o2ib        1    down 192.168.200.122@o2ib1&lt;/p&gt;

&lt;p&gt;3. lctl ping route2 on client&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@r04 ~&amp;#93;&lt;/span&gt;# lctl ping 192.168.200.122@o2ib1&lt;br/&gt;
failed to ping 192.168.200.122@o2ib1: Input/output error&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@r04 ~&amp;#93;&lt;/span&gt;# lctl ping 192.168.200.122@o2ib1&lt;br/&gt;
12345-0@lo&lt;br/&gt;
12345-192.168.20.122@tcp&lt;br/&gt;
12345-192.168.200.122@o2ib1&lt;/p&gt;

&lt;p&gt;4. lctl ping client on the route2&lt;/p&gt;

&lt;p&gt;lctl ping route2 on client&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@r03 ~&amp;#93;&lt;/span&gt;# lctl ping 192.168.200.123@o2ib1&lt;br/&gt;
12345-0@lo&lt;br/&gt;
12345-192.168.200.123@o2ib1&lt;/p&gt;

&lt;p&gt;when I did &quot;lctl ping route2&quot; on client, got an Input/output error, but tried again once, the connection restore, then the client did connection to MGS and recovered the filesystem correctly.&lt;/p&gt;

&lt;p&gt;And if &quot;lctl ping client&quot; on router2, it can also help connection restore once I did.&lt;br/&gt;
So, we can get the connection back once ping client from route2 or router2 from clients.&lt;/p&gt;</comment>
                            <comment id="10373" author="ihara" created="Thu, 30 Dec 2010 05:51:09 +0000"  >&lt;p&gt;attached is part of lnet code in lustre-1.8.4.ddn2.&lt;/p&gt;</comment>
                            <comment id="10375" author="liang" created="Thu, 30 Dec 2010 19:24:28 +0000"  >&lt;p&gt;Ihara,&lt;/p&gt;

&lt;p&gt;I think the reason is:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;the client can only connect to one router (route2), client will mark route2 as DOWN when you turn down it.&lt;/li&gt;
	&lt;li&gt;it never come up again because:
	&lt;ul&gt;
		&lt;li&gt;any message from upper layer wouldn&apos;t be sent because the only router is down&lt;/li&gt;
		&lt;li&gt;LNet itself will not send anything to check the router because dead_router_check_interval is not enabled&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So I think you can add this to your modprobe.conf (on clients and servers)&lt;br/&gt;
&quot;dead_router_check_interval=5&quot;, so the worst case, you will see the router come up again in 10 seconds (two pings)&lt;/p&gt;

&lt;p&gt;You may also want to add this, &quot;live_router_check_interval=60&quot;, so client/server will check live router for each minute.&lt;/p&gt;

&lt;p&gt;Liang&lt;/p&gt;</comment>
                            <comment id="10380" author="ihara" created="Wed, 5 Jan 2011 05:42:42 +0000"  >&lt;p&gt;Liang,&lt;/p&gt;

&lt;p&gt;Thanks. It can be fixed by your advises, but I needed to add two parameters  (dead_router_check_interval=5 and live_router_check_interval=60) not only on servers/clients, but also both routers.&lt;br/&gt;
Because router2 needs to detect all router1 fails (router1 also has to detect router2 fails), otherwise the clients can&apos;t restore the connections due to clients are not checking router1 and servers also can&apos;t detect router2 fails.&lt;/p&gt;

&lt;p&gt;I just wonder if we could see any notices on the routers when the connection back.&lt;/p&gt;

&lt;p&gt;when all router1 fails, we can see the following error messages on router2.&lt;/p&gt;

&lt;p&gt;Jan  5 22:23:14 r03 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;127541.771483&amp;#93;&lt;/span&gt; Lustre: No route to 12345-192.168.100.120@o2ib via LNET_NID_ANY (all routers down)&lt;br/&gt;
Jan  5 22:23:14 r03 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;127541.780334&amp;#93;&lt;/span&gt; Lustre: Skipped 3 previous similar messages&lt;/p&gt;

&lt;p&gt;However, even router1 is back from fails and all connection restores on the clients, there are no any messages on router2.&lt;br/&gt;
on clients, the we can see the following messages to indicate the connection is restored.&lt;/p&gt;

&lt;p&gt;Jan  5 22:22:34 r11 kernel: [ 3202.159898] Lustre: MGC192.168.100.120@o2ib: Connection restored to service MGS using nid 192.168.100.120@o2ib.&lt;br/&gt;
Jan  5 22:22:34 r11 kernel: [ 3202.159903] Lustre: Skipped 1 previous similar message&lt;br/&gt;
Jan  5 22:22:40 r11 kernel: [ 3208.029050] Lustre: 12298:0:(import.c:517:import_select_connection()) lustre-OST0000-osc-ffff810319b57400: tried all connections, increasing latency to 6s&lt;/p&gt;

&lt;p&gt;Ihara&lt;/p&gt;
</comment>
                            <comment id="10483" author="liang" created="Tue, 25 Jan 2011 00:59:38 +0000"  >&lt;p&gt;Ihara, &lt;br/&gt;
I think the most reliable information indicating status of routers are those files under proc, we can&apos;t rely on console output anyway.&lt;br/&gt;
How do you think if I change status of this to &quot;resolved&quot;?&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10487" author="ihara" created="Tue, 25 Jan 2011 08:12:34 +0000"  >&lt;p&gt;Liang, it&apos;s enough now. And, yes, we do get information from /proc as much as possible. please close this ticket.&lt;/p&gt;

&lt;p&gt;Many thanks!&lt;/p&gt;

&lt;p&gt;Ihara&lt;/p&gt;</comment>
                            <comment id="10502" author="liang" created="Fri, 28 Jan 2011 00:45:30 +0000"  >&lt;p&gt;mark it as resolved&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="10070" name="client-messages" size="195292" author="ihara" created="Tue, 28 Dec 2010 01:56:31 +0000"/>
                            <attachment id="10080" name="lnet-1.8.4-ddn.tar.gz" size="976749" author="ihara" created="Thu, 30 Dec 2010 05:51:09 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw0lj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10177</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>