<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:43:44 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11422] Make LNet Selftest post Health backward compatible</title>
                <link>https://jira.whamcloud.com/browse/LU-11422</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;LNet Selftest post LNet Health landing loses backward compatibility, which means lnet-selftest cannot be run between cross-version peers (Lustre 2.12 and pre Lustre 2.12). We should fix that.&lt;/p&gt;

&lt;p&gt;In LNet Health feature, new health related stats have been added which changes the &lt;tt&gt;struct lnet_counters&lt;/tt&gt; that we previously had (patch &lt;a href=&quot;https://review.whamcloud.com/32949&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32949&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9120&quot; title=&quot;LNet Network Health Feature&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9120&quot;&gt;&lt;del&gt;LU-9120&lt;/del&gt;&lt;/a&gt; lnet: add global health statistics&lt;/tt&gt;&quot;). Due to this, &lt;tt&gt;struct srpc_stat_reply&lt;/tt&gt; is changed as it looks like this -&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 struct srpc_stat_reply {
&#160; &#160; &#160; &#160; __u32 &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; str_status;
&#160; &#160; &#160; &#160; struct lst_sid&#160; &#160; &#160; &#160; &#160; str_sid;
&#160; &#160; &#160; &#160; struct sfw_counters &#160; &#160; str_fw; 
&#160; &#160; &#160; &#160; struct srpc_counters&#160; &#160; str_rpc;
&#160; &#160; &#160; &#160; struct lnet_counters&#160; &#160; str_lnet;
 } WIRE_ATTR;

 struct lnet_counters {
        __u32   msgs_alloc;
        __u32   msgs_max;
+       __u32   rst_alloc;
        __u32   errors;
        __u32   send_count;
        __u32   recv_count;
        __u32   route_count;
        __u32   drop_count;
+       __u32   resend_count;
+       __u32   response_timeout_count;
+       __u32   local_interrupt_count;
+       __u32   local_dropped_count;
+       __u32   local_aborted_count;
+       __u32   local_no_route_count;
+       __u32   local_timeout_count;
+       __u32   local_error_count;
+       __u32   remote_dropped_count;
+       __u32   remote_error_count;
+       __u32   remote_timeout_count;
+       __u32   network_timeout_count;
        __u64   send_length;
        __u64   recv_length;
        __u64   route_length;
        __u64   drop_length; 
} WIRE_ATTR;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Amir&apos;s idea -&#160;&lt;br/&gt;
 &quot;What we can do is make a copy of the structure which is similar to the older one. And in the post health selftest we can have a translation function which takes the new structure and copies the relevant fields to the old one. This way selftest remains backwards compatible&quot;&lt;/p&gt;</description>
                <environment></environment>
        <key id="53403">LU-11422</key>
            <summary>Make LNet Selftest post Health backward compatible</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="sharmaso">Sonia Sharma</assignee>
                                    <reporter username="sharmaso">Sonia Sharma</reporter>
                        <labels>
                    </labels>
                <created>Mon, 24 Sep 2018 19:28:17 +0000</created>
                <updated>Wed, 10 Oct 2018 02:21:29 +0000</updated>
                            <resolved>Wed, 10 Oct 2018 02:21:29 +0000</resolved>
                                                    <fixVersion>Lustre 2.12.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="233931" author="adilger" created="Mon, 24 Sep 2018 22:00:56 +0000"  >&lt;p&gt;To understand this issue better, it would be useful to firstly describe which structure is affected, whether this compatibility is between userspace and the kernel, or between peers over the network. Please link this ticket to the ticket/patch that introduced the compatibility issue. That would make it more clear between which versions the incompatibility exists, and how much effort is needed to maintain compatibility. &lt;/p&gt;</comment>
                            <comment id="233932" author="adilger" created="Mon, 24 Sep 2018 22:03:15 +0000"  >&lt;p&gt;Also, please fill in the Affects Version and Fix Version fields. If this incompatibility is introduced in current master, and we don&apos;t land the fix for this issue before 2.12 is released, does that mean cross-version LST usage is broken?  That would make this issue a must-fix for 2.12 (ie. Critical or Blocker)&lt;/p&gt;</comment>
                            <comment id="233936" author="sharmaso" created="Tue, 25 Sep 2018 00:48:11 +0000"  >&lt;p&gt;Updated&lt;/p&gt;</comment>
                            <comment id="233938" author="adilger" created="Tue, 25 Sep 2018 02:33:08 +0000"  >&lt;p&gt;Having looked at this change, I&apos;d say the current implementation is quite problematic.&#160; It will require field-by-field data copying, and there isn&apos;t an easy way for older tools/clients to handle this easily.&#160; Conversely, if all of the new fields are added to the end of the data structure, then it would be trivial for old clients to ignore the remaining fields, and new clients to just assume the new fields are zero of they are not present in the old struct.  Is there any reason these fields were added in the middle of the struct, even being added into two disjoint places in the struct?&lt;/p&gt;

&lt;p&gt;I see in this patch that &lt;tt&gt;lnet_selftest_structure_assertion()&lt;/tt&gt; had the code commented out that was telling you that this change should not be done in this manner.  The whole point of these assertions is to warn the developer that they are making a change that will break the userspace API or the network protocol, or both.  That kind of change might be temporarily OK during testing, but should basically never be landed.&lt;/p&gt;

&lt;p&gt;It appears that &lt;tt&gt;struct srpc_msg&lt;/tt&gt; already has a mechanism to handle changes in a better manner - it has both a &lt;tt&gt;msg_magic&lt;/tt&gt; and a &lt;tt&gt;msg_version&lt;/tt&gt; field in the structure, that could be changed to indicate the presence of additional fields in the struct.  New clients/tools would understand both the old/new version of the struct, and be able to (mostly) ignore the differences, except when accessing the &lt;tt&gt;stat_reply&lt;/tt&gt; member.  In the latter case, the new client would just avoid to access the added fields when the &lt;tt&gt;msg_version = 0x3&lt;/tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;*&amp;#93;&lt;/span&gt;, and when communicating with an old client/tool it just copies the smaller struct and drops the rest of the stats.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;*&amp;#93;&lt;/span&gt; NB - for the &lt;tt&gt;msg_version&lt;/tt&gt; field, it is &lt;b&gt;much&lt;/b&gt; better to make this a bitmap of features rather than a strict enumeration of versions.  Having 2^32 versions is not useful as you will never have so many different features, and it is not clear to anyone what the difference between &quot;v4&quot; and &quot;v7&quot; of a struct/protocol would be.  Conversely, having a &lt;tt&gt;SRPC_MSG_FEAT_LOCAL_COUNTER = 0x0002&lt;/tt&gt; (or similar) feature flag is easy to check, and it makes it clear to any reader what feature is be present in the struct.  If this flag is not set (for unrelated messages), or there is no need to send the extra stats (for older clients/tools), then there is no reason to break compatibility.  If some other flag is set that the tool doesn&apos;t understand, then either it can return an error, or (better) parse the parts of the struct it understands, and ignore the rest.&lt;/p&gt;

&lt;p&gt;The best mechanism is if there are &quot;compat&quot; and &quot;incompat&quot; regions of the version (e.g. &lt;tt&gt;__u16 compat; __u16 incompat&lt;/tt&gt;) and where a particular feature flag goes depends on what kind of changes are being made.  Just having additional fields added to the end of the struct is typically a harmless &quot;compat&quot; change and is preferable.  Changing the meaning of the fields (as was done in the original patch) would be an &quot;incompat&quot; change and should be avoided if at all possible (as it likely can be in this case).&lt;/p&gt;

&lt;p&gt;I also now see that there is also a &lt;tt&gt;msg_ses_feats&lt;/tt&gt; field that appears to allow peers to negotiate which mutually compatible features they support, but I&apos;m not sure if this would allow anything other than returning an error if an unknown feature is advertised. For Lustre RPC connections, the client sends a bitmask of features that it supports, and the server masks out any feature bits that it doesn&apos;t understand (using its own &quot;feature supported&quot; bitmask) and returns it to the client rather than returning an error.  This feature mask mechanism has served us very well for about 15 years, and it was inherited from ext2/3/4 which had another 10 years of use for managing on-disk vs. user tools vs. kernel compatibility.  It would be much better if this were changed to have a permissive connection that accepted any features in the request, but masked off the unknown ones and returned it the the client.  Otherwise, we just have a gratuitous &quot;request with all known features, get an error reply with supported features, retry with supported features, continue as if it had done it right the first time&quot; dance.&lt;/p&gt;

&lt;p&gt;As for fixing this particular problem, I see some possible solutions, depending on how this is handled in the code:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;if the addition of the extra fields to &lt;tt&gt;lnet_counters&lt;/tt&gt; makes the message size grow (i.e. it is (now) the largest member of &lt;tt&gt;msg_body&lt;/tt&gt;) then this might need additional handling to maintain compatibility
	&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
		&lt;li&gt;if the tools/peer return an error for larger &lt;tt&gt;srpc_msg&lt;/tt&gt; then that should be fixed in its own right
		&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
			&lt;li&gt;allowing a message size &amp;gt;= the current &lt;tt&gt;sizeof(srpc_msg)&lt;/tt&gt;, at least to avoid such issues in the future&lt;/li&gt;
			&lt;li&gt;split the stats message into &quot;old_stats&quot; and &quot;local_stats&quot; and send them in two RPCs, assuming they are not constantly used)&lt;/li&gt;
		&lt;/ul&gt;
		&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;if this is accepted, then these flags should only be set in stats messages when they are needed&lt;/li&gt;
	&lt;li&gt;if it is just a matter of determining whether the extra stat fields are present, then the &lt;tt&gt;srpc_msg&lt;/tt&gt; version == feature field may be enough&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="233940" author="ashehata" created="Tue, 25 Sep 2018 04:06:17 +0000"  >&lt;p&gt;My main problem with lnet_selftest&apos;s use of the lnet_counters structure is that it creates a coupling where none should exist. We&apos;re taking an LNet internal structure and creating a wire protocol dependency. Now when we want to add extra statistics, then we have to jump through unnecessary hoops. IMO, there shouldn&apos;t have been this coupling between a test tool and LNet. That&apos;s why I&apos;m proposing we break up this dependency in the first place. selftest wire protocol shouldn&apos;t be dependent on lnet&apos;s internal structure.&lt;/p&gt;

&lt;p&gt;I think it makes more sense to me to separate selftest from LNet by the use of APIs. selftest shouldn&apos;t be accessing LNet structures directly. There should be an API that pulls whatever information selftest needs. And then selftest can store this data in whatever form it needs.&lt;/p&gt;</comment>
                            <comment id="234029" author="gerrit" created="Wed, 26 Sep 2018 17:36:59 +0000"  >&lt;p&gt;Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/33242&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33242&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11422&quot; title=&quot;Make LNet Selftest post Health backward compatible&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11422&quot;&gt;&lt;del&gt;LU-11422&lt;/del&gt;&lt;/a&gt; lnet: Fix selftest backward compatibility post health&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: a1dae49acc92b48b1b3c463065a8d8ed0f30b155&lt;/p&gt;</comment>
                            <comment id="234684" author="gerrit" created="Wed, 10 Oct 2018 01:51:19 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/33242/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33242/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11422&quot; title=&quot;Make LNet Selftest post Health backward compatible&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11422&quot;&gt;&lt;del&gt;LU-11422&lt;/del&gt;&lt;/a&gt; lnet: Fix selftest backward compatibility post health&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 60f6f2b480b482f2022cbea416d8bea87f848bec&lt;/p&gt;</comment>
                            <comment id="234692" author="pjones" created="Wed, 10 Oct 2018 02:21:29 +0000"  >&lt;p&gt;Landed for 2.12&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="43816">LU-9120</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i002zj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>