<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:51:28 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5435] Simulate message loss and high latency in LNet</title>
                <link>https://jira.whamcloud.com/browse/LU-5435</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Although Lustre has OBD_FAILs to concoct request or reply loss of RPCs, but they are mostly used for unit-tests and not flexible enough to inject random message losses while running with workload.&lt;/p&gt;

&lt;p&gt;Combination of OBD_FAIL_PTLRPC_DROP_RPC and CFS_FAIL_RAND can randomly drop RPCs, however, it always drops request before send and leaves RPC in the same status, it cannot simulate LNet message loss which may trigger more complex RPC status, for example, loss of LNet ACK/REPLY of ptlrpc bulk request, or ptlrpc reply etc.&lt;/p&gt;

&lt;p&gt;So we need to create a new mechanism to support randomly silent message loss in network of small testing systems. A straightforward solution is to allow user to control message drop in LNet, it needs new user interfaces to add or remove Drop Rule of message, and internal handlers of these drop rules in core LNet. &lt;/p&gt;

&lt;p&gt;To simplify implementation, LNet Drop Rule should only be applied to the receive side of a connection (this still can cover all message paths), each Drop Rule contains a few attributes:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Source NID&lt;/li&gt;
	&lt;li&gt;Destination NID&lt;/li&gt;
	&lt;li&gt;Drop Rate Factor, if the factor is N, in each N incoming messages that can match this rule, LNet will randomly drop one of them.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;User can add new Drop Rule by run command:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl net_drop add --source SOURCE_NID &#8211;dest DESTINATION_NID  --rate DROP_RATE
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here are some examples&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ Lctl net_drop_add --source *@o2ib0 --dest *@tcp2 --rate 1000
  Randomly drop 1 message in each 1000 messages from o2ib0 to tcp2

$ Lctl net_drop_add --source 192.168.1.100@tcp0 --dest *@o2ib3 --rate 500
  Randomly drop 1 message in each 500 messages from 192.168.1.100@tcp0 to any nodes of o2ib3

$ Lctl net_drop_add --source *@o2ib2 --dest * 2000
  Randomly drop 1 message in each 2000 incoming messages from o2ib2.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;User can remove Drop Rule by running command&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;  lctl net_drop_del --source SOURCE_NID --dest DESTINATION_NID
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;All rules will be removed if user simply run &#8220;lctl net_drop_del --all&#8221;&lt;/p&gt;

&lt;p&gt;Show all LNet Drop Rules by running command&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl net_drop_list
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With LNet Drop Rule, we can simulate unreliable network with simple environment and small number of machines. User can add Drop Rule on either end point of cluster (client or server), or LNet routers.&lt;/p&gt;

&lt;p&gt;The major benefit of adding Drop Rules only on LNet routers is, the same router pool can be used to test any Lustre version, because router only needs LNet which does not have compatibility issue. It also means this feature does not need to be backported.&lt;/p&gt;</description>
                <environment></environment>
        <key id="25825">LU-5435</key>
            <summary>Simulate message loss and high latency in LNet</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="liang">Liang Zhen</assignee>
                                    <reporter username="liang">Liang Zhen</reporter>
                        <labels>
                    </labels>
                <created>Thu, 31 Jul 2014 07:40:48 +0000</created>
                <updated>Mon, 27 Apr 2015 20:42:46 +0000</updated>
                            <resolved>Tue, 4 Nov 2014 16:04:54 +0000</resolved>
                                                    <fixVersion>Lustre 2.7.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>14</watches>
                                                                            <comments>
                            <comment id="90519" author="liang" created="Thu, 31 Jul 2014 07:45:16 +0000"  >&lt;p&gt;links of patches:&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/11313&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/11313&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/11314&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/11314&lt;/a&gt; &lt;/p&gt;</comment>
                            <comment id="90690" author="liang" created="Mon, 4 Aug 2014 13:26:33 +0000"  >&lt;p&gt;I&apos;m working on another mechanism which is &quot;Packet Delay Simulator&quot;, it will randomly choose some messages and delay sending for arbitrary seconds (finite time). By this way, we can simulate network congestion, and see how Lustre reacts.&lt;br/&gt;
Not like &quot;Drop Rule&quot;, this sub-module is only functional on LNet router, which means message sender will see completion event of send immediately (no latency for sender), then message is blocked on router for N seconds, receiver will see delayed message after N seconds.&lt;/p&gt;

&lt;p&gt;To guarantee delayed message can eventually be sent out, router_checker thread will be reused for sending delayed messages, although LND threads can send delayed messages as well (but there is no guarantee that LND threads will be waken up and send delayed message)&lt;/p&gt;

&lt;p&gt;although implementation is going to be different with Drop Rule, but lctl command will be like drop rule:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl net_delay_add --source NID --dest NID --latency SECONDS
lctl net_delay_del --source NID --dest NID
lctl net_delay_list
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="91074" author="doug" created="Thu, 7 Aug 2014 18:06:15 +0000"  >&lt;p&gt;If the delays and drops are &quot;random&quot;, could not a TCP/IP network impairment solution be used here rather than changing LNet source?&lt;/p&gt;</comment>
                            <comment id="91076" author="johann" created="Thu, 7 Aug 2014 18:14:44 +0000"  >&lt;p&gt;We need something that can work with any supported lnd and more particularly IB.&lt;/p&gt;</comment>
                            <comment id="91079" author="doug" created="Thu, 7 Aug 2014 18:30:56 +0000"  >&lt;p&gt;I thought the objective was to determine how LNet (and Lustre) reacts to delays and drops.  The underlying transport, be it IB or TCP, should not matter at that point.&lt;/p&gt;</comment>
                            <comment id="91083" author="johann" created="Thu, 7 Aug 2014 18:58:10 +0000"  >&lt;p&gt;Our intent is to run soak testing with fault injection (i.e. message drop/delay, failover, ...) on a configuration which is as close as possible to a real-life Lustre installation. We are thus going to use IB and LNET routing.&lt;br/&gt;
FYI, i used to work on a Lustre bug (a race) a couple of years ago which could only be reproduced from client nodes connected through IB.&lt;/p&gt;</comment>
                            <comment id="91399" author="liang" created="Tue, 12 Aug 2014 15:25:17 +0000"  >&lt;p&gt;Hi Doug, as Johann said, this is actually for testing lustre with simulation of unreliable network. There are several benefits of having this in LNet instead of using mechanism of underlying network stack:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;we can use the same approach for different network types (ib, tcp, even gemini)&lt;/li&gt;
	&lt;li&gt;we have more control to fault simulation, for example, only drop/delay message for specific portal, or specific lnet message type between two nodes.&lt;/li&gt;
	&lt;li&gt;communication between application will not be affected.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Also, this is a new sub-module of LNet, it only has very little change to LNet, so it&apos;s a low risk feature.&lt;/p&gt;</comment>
                            <comment id="91400" author="liang" created="Tue, 12 Aug 2014 15:31:43 +0000"  >&lt;p&gt;The third patch is ready, it&apos;s implementation of latency simulation: &lt;a href=&quot;http://review.whamcloud.com/#/c/11409/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/11409/&lt;/a&gt;&lt;br/&gt;
in current implementation, delayed message on a &quot;Delay Rule&quot; always have fixed latency, we should be able to have random latency between (0, latency), which means we need to somehow sorting delayed message, we may use binheap to sort them, which is very scalable and already implemented in libcfs.&lt;/p&gt;</comment>
                            <comment id="91430" author="doug" created="Tue, 12 Aug 2014 17:34:46 +0000"  >&lt;p&gt;This makes sense as long as it is understood that this approach is only verifying what Lustre does with dropped and delayed messages.  It does not examine what our LNDs do to detect and respond to network faults from the underlying fabric (IB, TCP, or Gemini).  Dropping/delaying messages intentionally from LNet is not a network fault and will not trigger any sort of switch/router behaviour. &lt;/p&gt;</comment>
                            <comment id="91502" author="liang" created="Wed, 13 Aug 2014 01:46:57 +0000"  >&lt;p&gt;true, &quot;network&quot; here is from perspective of upper layer Lustre stack, so it will only simulate, e.g. lnet router failure or buffer congestion of Lustre network, as you said, it does not examine how LNet/LND behave for underlying network failure. &lt;/p&gt;</comment>
                            <comment id="98283" author="jlevi" created="Tue, 4 Nov 2014 16:04:54 +0000"  >&lt;p&gt;Patches landed to Master.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwspj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>15136</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>