<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:27:20 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16475] Reusing OST indexes after lctl del_ost</title>
                <link>https://jira.whamcloud.com/browse/LU-16475</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I have been investigating the possibility of reusing OST indexes after&#160;&lt;tt&gt;lctl del_ost&lt;/tt&gt; and I wanted to describe the current known issues and ideas for improvements in a ticket to get some feedback.&lt;/p&gt;

&lt;p&gt;We have been using &lt;tt&gt;lctl del_ost&lt;/tt&gt; in production (backported to 2.12) on two different systems, and it worked great, as long as one doesn&apos;t intend to reuse the indexes of the deleted OSTs.&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;lctl del_ost&lt;/tt&gt; removes OSTs llog entries on the MGS in &lt;tt&gt;CONFIGS/fsname-MDT*&lt;/tt&gt; and &lt;tt&gt;CONFIG/fsname-client&lt;/tt&gt;. The MGS propagates those changes to MDTs and client. However, as long as the MDTs and clients are not restarted, they keep in-memory references to the deleted OSTs.&lt;/p&gt;

&lt;p&gt;Let&apos;s test after removing an OST as follow:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; # lctl conf_param newfir-OST0000.osc.active=0 # deactive the OST
 # lctl --device MGS del_ost --target newfir-OST0000 # remove OST from the config
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Using the following command on the MDS, we can see that the deleted OST (here OST0000) is still referenced:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mds# lctl get_param osc.{*}OST{*}.prealloc_status
osc.newfir-OST0000-osc-MDT0000.prealloc_status=-108
osc.newfir-OST0001-osc-MDT0000.prealloc_status=0
osc.newfir-OST0002-osc-MDT0000.prealloc_status=0
osc.newfir-OST0003-osc-MDT0000.prealloc_status=0
osc.newfir-OST0004-osc-MDT0000.prealloc_status=0
osc.newfir-OST0005-osc-MDT0000.prealloc_status=0
osc.newfir-OST0006-osc-MDT0000.prealloc_status=0
osc.newfir-OST0007-osc-MDT0000.prealloc_status=0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;On the client, it can be seen with a &quot;inactive device&quot; in &lt;tt&gt;lfs df -v&lt;/tt&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-rbh03 ~]# lfs df -v /newfir
UUID 1K-blocks Used Available Use% Mounted on
newfir-MDT0000_UUID 9056940 5548 8233744 1% /newfir[MDT:0]
OST0000 : inactive device
newfir-OST0001_UUID 148751801588 240244 147251465344 1% /newfir[OST:1] f
newfir-OST0002_UUID 148751801588 175304 147251530284 1% /newfir[OST:2] f
newfir-OST0003_UUID 148751801588 292072 147251413516 1% /newfir[OST:3] f
newfir-OST0004_UUID 148751801588 299544 147251406044 1% /newfir[OST:4] f
newfir-OST0005_UUID 148751801588 323452 147251382136 1% /newfir[OST:5] f
newfir-OST0006_UUID 148751801588 226332 147251479256 1% /newfir[OST:6] f
newfir-OST0007_UUID 148751801588 274664 147251430924 1% /newfir[OST:7] f

filesystem_summary: 1041262611116 1831612 1030760107504 1% /newfir
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Ideally, all references to the OST should be removed after &lt;tt&gt;lctl del_ost&lt;/tt&gt;, so that we can just reuse the OST index as if it were never used before.. But that seems quite a big endeavour.&lt;/p&gt;

&lt;p&gt;Now, if we remount the MDTs and clients, it&apos;s much better, there is no trace of the deleted OST in memory anymore. In theory, we should be able to reuse the OST index in that case. However, I found a problem with the current implementation of &lt;tt&gt;lctl del_ost&lt;/tt&gt; which keeps a configuration file under &lt;tt&gt;CONFIGS/fsname-OST0000&lt;/tt&gt; on the MGS for the deleted OST. This is a problem that probably should be fixed (I will try to submit a patch for that). Indeed if we try to start a fresh OST with the same index after &lt;tt&gt;del_ost&lt;/tt&gt; and a full restart of all targets, we still get the following error:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jan 13 15:10:55 fir-io1-s1 kernel: LustreError: 141-4: The config log for newfir-OST0000 already exists, yet the server claims it never registered. It may have been reformatted, or the index changed. writeconf the MDT to regenerate all logs.
Jan 13 15:10:55 fir-io1-s1 kernel: LustreError: 55854:0:(mgs_llog.c:4351:mgs_write_log_target()) Can&apos;t write logs for newfir-OST0000 (-114)
Jan 13 15:10:55 fir-io1-s1 kernel: LustreError: 55854:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write newfir-OST0000 log (-114)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;it comes from &lt;tt&gt;mgs/mgs_llog.c&lt;/tt&gt; in &lt;tt&gt;mgs_write_log_ost()&lt;/tt&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;3089 &#160; &#160; &#160; &#160; /* If the ost log already exists, that means that someone reformatted
3090&#160; &#160; &#160; &#160; &#160; &#160; the ost and it called target_add again. */
3091 &#160; &#160; &#160; &#160; if (!mgs_log_is_empty(env, mgs, mti-&amp;gt;mti_svname)) {
3092 &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; LCONSOLE_ERROR_MSG(0x141, &quot;The config log for %s already &quot;
3093&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &quot;exists, yet the server claims it never &quot;
3094&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &quot;registered. It may have been reformatted, &quot;
3095&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &quot;or the index changed. writeconf the MDT to &quot;
3096&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &quot;regenerate all logs.\n&quot;, mti-&amp;gt;mti_svname);
3097 &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; RETURN(-EALREADY);
3098 &#160; &#160; &#160; &#160; }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;By manually removing &lt;tt&gt;CONFIGS/newfir-OST0000&lt;/tt&gt; from the MGS after &lt;tt&gt;del_ost&lt;/tt&gt;, this error goes away, and then mounting a freshly formatted OST with the same index seems to work.&lt;/p&gt;

&lt;p&gt;The last thing I haven&apos;t heavily tested is the LAST_ID issue. In my tests, working only with few files, this doesn&apos;t seem to be an issue (does not trigger a lfsck layout check to try to repair it), but I wonder if that could be a problem in production, when LAST_ID is very high, as it doesn&apos;t seem to be reset to 0 when I check &lt;tt&gt;osc.newfir-OST0000-osc-MDT0000.prealloc_last_id&lt;/tt&gt; when registering the replacement OST. I am wondering if there is a way to do ensure it is reset on &lt;tt&gt;del_ost&lt;/tt&gt;? (where is it stored? perhaps this is also something we can clean on &lt;tt&gt;del_ost&lt;/tt&gt;?)&lt;/p&gt;

&lt;p&gt;Just to clarify, note that we have never used &lt;tt&gt;mkfs.lustre --replace&lt;/tt&gt; here, as we do actually want the new OST to register to the MGS+MDTs.&lt;/p&gt;</description>
                <environment></environment>
        <key id="74002">LU-16475</key>
            <summary>Reusing OST indexes after lctl del_ost</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="sthiell">Stephane Thiell</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Sat, 14 Jan 2023 04:53:48 +0000</created>
                <updated>Wed, 10 May 2023 17:57:45 +0000</updated>
                                            <version>Lustre 2.16.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="359094" author="adilger" created="Sat, 14 Jan 2023 12:28:39 +0000"  >&lt;p&gt;Two things here:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;the config llogs are meant to be read in ascending order, and the clients just keep a cursor of where they are in the log. If new config records are added at the end (new OST or new parameter) the clients will read from the cursor to the current end of the log. With del_ost the clients have no way to know that records earlier in the llog were cancelled. It was expected that clients would have remounted long before there is a need to reuse the index.&lt;/li&gt;
	&lt;li&gt;the LAST_ID value is intentionally not reset to avoid giving out the same OST object number/FID+ost_index for two different files. The newly formatted OST will understand that the MDS has used objects op to N and start at N+1. There should not be any concern with this numbering. I&apos;m not totally sure this is 100% fixed in 2.12.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="34119">LU-7668</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03a1j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>