<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:34:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3458] OST not able to register at MGS with predefined index.</title>
                <link>https://jira.whamcloud.com/browse/LU-3458</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;client may have lost a reply to register target operation, but MGS will think reply is delivered and mark a target as used, but client don&apos;t have an reply accepted and think it need restart register from beginning after reconnect. &lt;br/&gt;
OOPS.&lt;br/&gt;
MGS send response a index already used. &lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[ 2619.730706] Lustre: MGC172.18.1.2@tcp: Reactivating &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;
[ 2626.816706] Lustre: 56551:0:(client.c:1819:ptlrpc_expire_one_request()) @@@ Request  sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1370729827/real 1370729827]  req@ffff8806014d5c00 x1437314393833474/t0(0) o253-&amp;gt;MGC172.18.1.2@tcp@172.18.1.2@tcp:26/25 lens 4736/4736 e 0 to 1 dl 1370729834 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 2626.844904] LustreError: 166-1: MGC172.18.1.2@tcp: Connection to MGS (at 172.18.1.2@tcp) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
[ 2626.896533] Lustre: MGC172.18.1.2@tcp: Reactivating &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;
[ 2626.902111] Lustre: MGC172.18.1.2@tcp: Connection restored to MGS (at 172.18.1.2@tcp)
[ 2629.380926] Lustre: MGC172.18.1.2@tcp: Reactivating &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;
[ 2632.367220] LustreError: 15f-b: Communication to the MGS &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; error -98. Is the MGS running?
[ 2632.376077] LustreError: 58337:0:(obd_mount.c:1834:server_fill_super()) Unable to start targets: -98
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;attached logs describe that bug in details (log1 from MGS side, log2 from OSS side - initial register xid is x1437620344193026).&lt;/p&gt;

&lt;p&gt;Bug hit because MGS don&apos;t schedule a reply to the target register command, and assume client always get a reply. Bug originally hit on Xyratex b_neo_stable branch (mostly 2.1 codebase) but quick look say - bug exist at 2.4 also.&lt;/p&gt;</description>
                <environment>Any lustre from 1.6.0 with mountconf and OST prepared with --index option.</environment>
        <key id="19377">LU-3458</key>
            <summary>OST not able to register at MGS with predefined index.</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="shadow">Alexey Lyashkov</reporter>
                        <labels>
                    </labels>
                <created>Wed, 12 Jun 2013 09:55:48 +0000</created>
                <updated>Wed, 23 Dec 2015 02:36:20 +0000</updated>
                                            <version>Lustre 2.0.0</version>
                    <version>Lustre 2.1.0</version>
                    <version>Lustre 2.2.0</version>
                    <version>Lustre 2.3.0</version>
                    <version>Lustre 2.4.0</version>
                    <version>Lustre 1.8.x (1.8.0 - 1.8.5)</version>
                    <version>Lustre 2.5.0</version>
                                                        <due></due>
                            <votes>1</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="60426" author="shadow" created="Wed, 12 Jun 2013 09:58:29 +0000"  >&lt;p&gt;Xyratex-bug MRP-1111&lt;/p&gt;</comment>
                            <comment id="60433" author="bzzz" created="Wed, 12 Jun 2013 13:02:27 +0000"  >&lt;p&gt;I think MGS could be able to reconstruct reply using last_rcvd, but that doesn&apos;t cover the case when MGS crashes and the reply was lost. I guess in the latter case the only way is to remove new profile with writeconf.&lt;/p&gt;
</comment>
                            <comment id="60435" author="shadow" created="Wed, 12 Jun 2013 13:18:02 +0000"  >&lt;p&gt;Looks we need implement a long aged task about simplify recovery for an MGS.&lt;br/&gt;
add_target will send a transaction number to the target and that request will be resend to the MGS in case MGS crash. if request resend but it&apos;s committed on MGS - reply is reconstructed from an on disk state. But it&apos;s may be too hard with dynamic index allocation from same NID as MGS don&apos;t know about mapping and it&apos;s produce bitmap leak (but looks that bug exist today also).&lt;/p&gt;</comment>
                            <comment id="60533" author="adilger" created="Thu, 13 Jun 2013 11:39:36 +0000"  >&lt;p&gt;There is no longer dynamic index mapping in Lustre 2.4+ because this was never used by real systems where the administrator wants to know which OST index is on a particular node.&lt;/p&gt;

&lt;p&gt;I agree with Alex that last_rcvd would handle the reply reconstruction, but it also means the MGS would need to &quot;recover&quot; if all clients are in the last_rcvd file (which would be bad).  The clients should NOT be added to the last_rcvd, only the new servers.  Ideally, there should be some way for the OST to remove itself from the last_rcvd file after it has registered and gotten a reply?&lt;/p&gt;

&lt;p&gt;One danger (and the reason this is an error in the first place) is that you don&apos;t want multiple OSTs accidentally trying to claim that they are the same OST index.  That would cause serious filesystem corruption.  That means there should be some way to determine the OST is the right one (e.g. RPC XID) before sending the reconstructed reply.  This wouldn&apos;t work if both the MGS and OSS rebooted, but is probably better than today.&lt;/p&gt;

&lt;p&gt;I was also wondering if there should also be some way to add a newly-formatted OST to replace an old OST with a &quot;--replace&quot; option to mkfs.lustre, which removes the &quot;LDD_F_VIRGIN&quot; flag and allows it to take over the old OST slot.  This isn&apos;t directly related, but a similar problem.  That would mean at least the administrator has to understand what is happening before adding the OST in an existing index.&lt;/p&gt;</comment>
                            <comment id="60764" author="adilger" created="Mon, 17 Jun 2013 15:51:20 +0000"  >&lt;p&gt;So, to clarify, I&apos;m not against fixing the MGS/MGC recovery code, but this needs to be done carefully to avoid the problem of complex recovery on the MGS.  It should only be done for OSTs connecting during initial registration.&lt;/p&gt;

&lt;p&gt;Shadow, are you planning on working on this problem?&lt;/p&gt;</comment>
                            <comment id="66848" author="shadow" created="Tue, 17 Sep 2013 16:57:19 +0000"  >&lt;p&gt;Currently not, I have more interested to fix LNet issues now and workaround exist. But i will happy to discus about generic recovery for an MGS as it&apos;s long aged task.&lt;/p&gt;</comment>
                            <comment id="66879" author="pjones" created="Tue, 17 Sep 2013 22:20:57 +0000"  >&lt;p&gt;ok. Thanks Alexey!&lt;/p&gt;</comment>
                            <comment id="67222" author="shadow" created="Mon, 23 Sep 2013 09:36:55 +0000"  >&lt;p&gt;One more problem in that area - we have a single last_rcvd per disk storage, so MGS and MDT need to share a single last_rcvd file or we need to change name for such file to ability to run several recovery services on single disk partition. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="10110">LU-14</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="13040" name="log1" size="6651431" author="shadow" created="Wed, 12 Jun 2013 09:55:48 +0000"/>
                            <attachment id="13041" name="log2" size="6056411" author="shadow" created="Wed, 12 Jun 2013 09:55:48 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvt3r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>8645</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>