<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:35:45 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-17476] lnet: only report mismatched nid in ME if bits match</title>
                <link>https://jira.whamcloud.com/browse/LU-17476</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;There are rare cases where a client-to-server AST &lt;b&gt;reply&lt;/b&gt; was being dropped by the server:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
:
request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104-&amp;gt;lfs02-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:&apos;&apos;
lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don&apos;t perform health checking (-2, 0)
lnet_is_health_check()) health check = 0, status = -2, hstatus = 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As a part of MD matching for incoming GET or PUT from a peer with multiple NIDs, use &quot;matchbits&quot; only if they are available and only report an error on NID/PID mismatch. If can&apos;t use &quot;matchbits&quot; for matching, fail on NID/PID mismatch as before.&lt;/p&gt;

</description>
                <environment></environment>
        <key id="80384">LU-17476</key>
            <summary>lnet: only report mismatched nid in ME if bits match</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="ssmirnov">Serguei Smirnov</reporter>
                        <labels>
                    </labels>
                <created>Fri, 26 Jan 2024 19:42:45 +0000</created>
                <updated>Thu, 8 Feb 2024 08:53:52 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="401537" author="gerrit" created="Sat, 27 Jan 2024 20:22:27 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/53843&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/53843&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17476&quot; title=&quot;lnet: only report mismatched nid in ME if bits match&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17476&quot;&gt;LU-17476&lt;/a&gt; lnet: prefer to use bits only to match ME&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 6087ac44e0eb29c6129ee59e97363c33575a9cfb&lt;/p&gt;</comment>
                            <comment id="401705" author="hornc" created="Mon, 29 Jan 2024 19:13:42 +0000"  >&lt;p&gt;Can you say more about the cases where this issue occurs? I&apos;ve seen it when there is some sort of mismatch between expected and actual primary NID of a peer.&lt;/p&gt;</comment>
                            <comment id="401720" author="gerrit" created="Mon, 29 Jan 2024 21:31:12 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/53851&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/53851&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17476&quot; title=&quot;lnet: only report mismatched nid in ME if bits match&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17476&quot;&gt;LU-17476&lt;/a&gt; lnet: force non-primary NID in LNetPut&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 953ff57510b1a6f10175f08bedc7a63b67fedebe&lt;/p&gt;</comment>
                            <comment id="402071" author="gerrit" created="Wed, 31 Jan 2024 22:28:42 +0000"  >&lt;p&gt;&lt;del&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch:&lt;/del&gt; &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/53872&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/53872&lt;/a&gt;&lt;br/&gt;
&lt;del&gt;Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17476&quot; title=&quot;lnet: only report mismatched nid in ME if bits match&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17476&quot;&gt;LU-17476&lt;/a&gt; tests: add test for MR match NIDs code&lt;/del&gt;&lt;br/&gt;
&lt;del&gt;Project: fs/lustre-release&lt;/del&gt;&lt;br/&gt;
&lt;del&gt;Branch: master&lt;/del&gt;&lt;br/&gt;
&lt;del&gt;Current Patch Set: 1&lt;/del&gt;&lt;br/&gt;
&lt;del&gt;Commit: 878db375153071006346ea132d452f759cb14bd7&lt;/del&gt;&lt;/p&gt;

&lt;p&gt;Merged into &lt;a href=&quot;https://review.whamcloud.com/53843&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/53843&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="402083" author="adilger" created="Thu, 1 Feb 2024 01:25:05 +0000"  >&lt;p&gt;Chris, this issue has been observed in the case of a Lustre server-to-client blocking AST request that cannot be replied by the client (neither client nor server have patch &lt;a href=&quot;https://review.whamcloud.com/50530&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/50530&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16709&quot; title=&quot;LNet: locking multiple NIDs of the same MR peer as primary results in incorrect representation&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16709&quot;&gt;&lt;del&gt;LU-16709&lt;/del&gt;&lt;/a&gt; lnet: fix locking multiple peer NIDs&lt;/tt&gt;&quot;).&lt;/p&gt;

&lt;p&gt;The following is Oleg&apos;s analysis of the kernel debug logs on the server (with &quot;&lt;tt&gt;+rpctrace+dlmtrace+lnet&lt;/tt&gt;&quot; enabled):&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;ugh, so the network seems to deliver everything fast, but lnet just does not like it?&lt;/p&gt;

&lt;p&gt;Client side, speedy handling:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:84.0:1706140870.571121:0:340548:0:(service.c:2136:ptlrpc_server_handle_req_in())
     got req x1788044801687552
00000100:00100000:84.0:1706140870.571138:0:340548:0:(nrs_fifo.c:177:nrs_fifo_req_get())
     NRS start fifo request from 12345-10.85.153.100@tcp, seq: 155892
00000100:00100000:84.0:1706140870.571140:0:340548:0:(service.c:2283:ptlrpc_server_handle_request())
     Handling RPC req@00000000a35a640a pname:cluuid+ref:pid:xid:nid:opc:job
     ldlm_cb01_010:c63d2a6a-1533-4998-b176-4f630f667a59+5:3678513:x1788044801687552:12345-10.85.153.100@tcp:104:
00000100:00000200:84.0:1706140870.571142:0:340548:0:(service.c:2298:ptlrpc_server_handle_request())
     got req 1788044801687552
00000100:00000200:84.0:1706140870.571151:0:340548:0:(niobuf.c:86:ptl_send_buf())
     Sending 224 bytes to portal 16, xid 1788044801687552, offset 224
00000400:00000200:84.0:1706140870.571153:0:340548:0:(lib-move.c:4904:LNetPut()) LNetPut -&amp;gt; 12345-10.85.153.101@tcp
00000400:00000200:84.0:1706140870.571154:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash())
     Match nid 10.85.153.101@tcp to cpt 1
00000400:00000200:84.0:1706140870.571156:0:340548:0:(lib-move.c:2731:lnet_handle_send_case_locked())
     Source Specified: 10.31.3.108@tcp to MR: &#160;10.85.153.101@tcp local destination
00000400:00000200:84.0:1706140870.571157:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash())
     Match nid 10.85.153.101@tcp to cpt 1
00000400:00000200:84.0:1706140870.571160:0:340548:0:(lib-move.c:1884:lnet_handle_send())
     TRACE: 10.31.3.108@tcp(10.31.3.108@tcp:10.31.3.108@tcp) -&amp;gt;
     10.85.153.101@tcp(10.85.153.101@tcp:10.85.153.101@tcp) &amp;lt;?&amp;gt; : PUT try# 0
00000800:00000200:84.0:1706140870.571161:0:340548:0:(socklnd_cb.c:992:ksocknal_send())
     sending 224 bytes in 1 frags to 12345-10.85.153.101@tcp
00000800:00000200:84.0:1706140870.571162:0:340548:0:(socklnd.c:231:ksocknal_find_peer_locked())
     got peer_ni [000000007915095f] -&amp;gt; 12345-10.85.153.101@tcp (5)
00000800:00000200:84.0:1706140870.571166:0:340548:0:(socklnd_cb.c:758:ksocknal_queue_tx_locked())
     Sending to 12345-10.85.153.101@tcp ip 10.85.153.101:988
00000800:00000200:84.0:1706140870.571167:0:340548:0:(socklnd_cb.c:777:ksocknal_queue_tx_locked())
     Packet 000000005a9eb1fb type 1, nob 320 niov 1 nkiov 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Server side, starting with the send completion (clocks are in sync betweem client and server):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0 &#160;req@00000000a8fbe768 x1788044801687552/t0(0) o104-&amp;gt;lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:&apos;&apos;
 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I think this is the first (instant) attempt at receiving the response on the server:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000100:19.0:1706140870.571652:0:18948:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;But somehow it does not really lead to anything useful server side, as we see this message being repeatedly arriving at increasing intervals:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000200:17.0:1706140908.945592:0:18949:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0 &#160;req@00000000a8fbe768 x1788044801687552/t0(0)
     o104-&amp;gt;lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946
     ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:&apos;&apos;
00000400:00000100:16.0:1706140908.945848:0:18950:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1038:lnet_is_health_check())
     Msg 00000000a906b193 is in inconsistent state, don&apos;t perform health checking (-2, 0)
00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1043:lnet_is_health_check())
     health check = 0, status = -2, hstatus = 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;inconsistent state? because we already received it once?&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000200:18.0:1706140946.314936:0:18950:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0 &#160;req@00000000a8fbe768 x1788044801687552/t0(0) o104-&amp;gt;lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706140984 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:&apos;&apos;
00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000200:17.0:1706140984.714683:0:18948:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0 &#160;req@00000000a8fbe768 x1788044801687552/t0(0) o104-&amp;gt;lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706141022 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:&apos;&apos;
00000400:00000200:17.0:1706140984.715079:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4
00000400:00000200:17.0:1706140984.715079:0:18948:0:(lib-ptl.c:571:lnet_ptl_match_md())
     Request from 12345-10.31.3.108@tcp of length 224 into portal 16 MB=0x65a379f3f4000
00000400:00000200:17.0:1706140984.715081:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4
00000400:00000100:17.0:1706140984.715082:0:18948:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
00000400:00000200:17.0:1706140984.715083:0:18948:0:(lib-msg.c:1038:lnet_is_health_check())
     Msg 0000000051e9a6cd is in inconsistent state, don&apos;t perform health checking (-2, 0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and so on until the request timed out&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jan 25 00:01:48 mds000 kernel: Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request())
     @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870]
     &#160;req@00000000a8fbe768 x1788044801687552/t0(0) o104-&amp;gt;lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:&apos;&apos;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So with this in mind I think we might need somebody with good LNet understanding to tell us what is going on and why the seemingly received message does not get processed and passed up to Lustre?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven&apos;t yet looked into that code to confirm.&lt;/p&gt;</comment>
                            <comment id="402768" author="gerrit" created="Mon, 5 Feb 2024 22:43:14 +0000"  >&lt;p&gt;&lt;del&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch:&lt;/del&gt; &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/53929&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/53929&lt;/a&gt;&lt;br/&gt;
&lt;del&gt;Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-17476&quot; title=&quot;lnet: only report mismatched nid in ME if bits match&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-17476&quot;&gt;LU-17476&lt;/a&gt; tests: check for sanity/13a failures&lt;/del&gt;&lt;br/&gt;
&lt;del&gt;Project: fs/lustre-release&lt;/del&gt;&lt;br/&gt;
&lt;del&gt;Branch: master&lt;/del&gt;&lt;br/&gt;
&lt;del&gt;Current Patch Set: 1&lt;/del&gt;&lt;br/&gt;
&lt;del&gt;Commit: b411c5c7b817e4914ecfaa596c5e0b82106fc23c&lt;/del&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="79660">LU-17379</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i04993:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>