<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:50:16 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5298] The lwp device cannot be started when we migrate from Lustre 2.1 to Lustre 2.4</title>
                <link>https://jira.whamcloud.com/browse/LU-5298</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have an issue with the quotas on our filesystems after the upgrade from&lt;br/&gt;
Lustre 2.1.6 to Lustre 2.4.3.&lt;/p&gt;

&lt;p&gt;The quotas have been successfully enabled on all target devices using&lt;br/&gt;
&lt;em&gt;&apos;tunefs.lustre --quota $device&apos;&lt;/em&gt;. The enforcement has been enabled with &lt;em&gt;&apos;lctl conf_param scratch.quota.mdt=ug&apos;&lt;/em&gt; and &lt;em&gt;&apos;lctl conf_param scratch.quota.ost=ug&apos;&lt;/em&gt; on&lt;br/&gt;
the MGS. However, the enforcement does not work and users can exceed their&lt;br/&gt;
quota limits.&lt;/p&gt;

&lt;p&gt;Check of the quota_slave.info on the MDT:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;# lctl get_param osd-*.*.quota_slave.info
osd-ldiskfs.scratch-MDT0000.quota_slave.info=
target name:    scratch-MDT0000
pool ID:        0
type:           md
quota enabled:  ug
conn to master: not setup yet
space acct:     ug
user uptodate:  glb[0],slv[0],reint[1]
group uptodate: glb[0],slv[0],reint[1]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We can see that the connection to the QMT is not setup yet. I also noticed that&lt;br/&gt;
the lwp device is not started, so no callback can be sent to the QMT.&lt;/p&gt;

&lt;p&gt;By looking at the code, it seems that the lwp device cannot be started when we&lt;br/&gt;
migrate from Lustre 2.1 to Lustre 2.4.&lt;/p&gt;

&lt;p&gt;In lustre/obdclass/obd_mount_server.c:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt; 717 /**
 718  * Retrieve MDT nids from the client log, then start the lwp device.
 719  * there are only two scenarios which would include mdt nid.
 720  * 1.
 721  * marker   5 (flags=0x01, v2.1.54.0) lustre-MDT0000  &lt;span class=&quot;code-quote&quot;&gt;&apos;add mdc&apos;&lt;/span&gt; xxx-
 722  * add_uuid  nid=192.168.122.162@tcp(0x20000c0a87aa2)  0:  1:192.168.122.162@tcp
 723  * attach    0:lustre-MDT0000-mdc  1:mdc  2:lustre-clilmv_UUID
 724  * setup     0:lustre-MDT0000-mdc  1:lustre-MDT0000_UUID  2:192.168.122.162@tcp
 725  * add_uuid  nid=192.168.172.1@tcp(0x20000c0a8ac01)  0:  1:192.168.172.1@tcp
 726  * add_conn  0:lustre-MDT0000-mdc  1:192.168.172.1@tcp
 727  * modify_mdc_tgts add 0:lustre-clilmv  1:lustre-MDT0000_UUID xxxx
 728  * marker   5 (flags=0x02, v2.1.54.0) lustre-MDT0000  &lt;span class=&quot;code-quote&quot;&gt;&apos;add mdc&apos;&lt;/span&gt; xxxx-
 729  * 2.
 730  * marker   7 (flags=0x01, v2.1.54.0) lustre-MDT0000  &lt;span class=&quot;code-quote&quot;&gt;&apos;add failnid&apos;&lt;/span&gt; xxxx-
 731  * add_uuid  nid=192.168.122.2@tcp(0x20000c0a87a02)  0:  1:192.168.122.2@tcp
 732  * add_conn  0:lustre-MDT0000-mdc  1:192.168.122.2@tcp
 733  * marker   7 (flags=0x02, v2.1.54.0) lustre-MDT0000  &lt;span class=&quot;code-quote&quot;&gt;&apos;add failnid&apos;&lt;/span&gt; xxxx-
 734  **/
 735 &lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; client_lwp_config_process(&lt;span class=&quot;code-keyword&quot;&gt;const&lt;/span&gt; struct lu_env *env,
 736                      struct llog_handle *handle,
 737                      struct llog_rec_hdr *rec, void *data)
 738 {
[...]
 779         /* Don&apos;t &lt;span class=&quot;code-keyword&quot;&gt;try&lt;/span&gt; to connect old MDT server without LWP support,
 780          * otherwise, the old MDT could regard &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; LWP client as
 781          * a normal client and save the export on disk &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery.
 782          *
 783          * This usually happen when rolling upgrade. LU-3929 */
 784         &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (marker-&amp;gt;cm_vers &amp;lt; OBD_OCD_VERSION(2, 3, 60, 0))
 785             GOTO(out, rc = 0);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The function checks the MDT server version in the llog. I checked on the MGS of&lt;br/&gt;
our Lustre 2.4 filesystem and the version of the device scratch-MDT0000-mdc is&lt;br/&gt;
2.1.6.0.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#09 (224)marker   5 (flags=0x01, v2.1.6.0) scratch-MDT0000 &lt;span class=&quot;code-quote&quot;&gt;&apos;add mdc&apos;&lt;/span&gt; Sat Jul  5 14:40:44 2014-
#10 (088)add_uuid  nid=192.168.122.41@tcp(0x20000c0a87a29)  0:  1:192.168.122.41@tcp
#11 (128)attach    0:scratch-MDT0000-mdc  1:mdc  2:scratch-clilmv_UUID
#12 (144)setup     0:scratch-MDT0000-mdc  1:scratch-MDT0000_UUID  2:192.168.122.41@tcp
#13 (168)modify_mdc_tgts add 0:scratch-clilmv  1:scratch-MDT0000_UUID  2:0  3:1  4:scratch-MDT0000-mdc_UUID
#14 (224)marker   5 (flags=0x02, v2.1.6.0) scratch-MDT0000 &lt;span class=&quot;code-quote&quot;&gt;&apos;add mdc&apos;&lt;/span&gt; Sat Jul  5 14:40:44 2014-
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After a writeconf on the filesystem, the llog has been updated and the device scratch-MDT0000-mdc is now registered with version 2.4.3.0.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#09 (224)marker   6 (flags=0x01, v2.4.3.0) scratch-MDT0000 &lt;span class=&quot;code-quote&quot;&gt;&apos;add mdc&apos;&lt;/span&gt; Sat Jul  5 15:19:27 2014-
#10 (088)add_uuid  nid=192.168.122.41@tcp(0x20000c0a87a29)  0:  1:192.168.122.41@tcp
#11 (128)attach    0:scratch-MDT0000-mdc  1:mdc  2:scratch-clilmv_UUID
#12 (144)setup     0:scratch-MDT0000-mdc  1:scratch-MDT0000_UUID  2:192.168.122.41@tcp
#13 (168)modify_mdc_tgts add 0:scratch-clilmv  1:scratch-MDT0000_UUID  2:0  3:1  4:scratch-MDT0000-mdc_UUID
#14 (224)marker   6 (flags=0x02, v2.4.3.0) scratch-MDT0000 &lt;span class=&quot;code-quote&quot;&gt;&apos;add mdc&apos;&lt;/span&gt; Sat Jul  5 15:19:27 2014-
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Check of the quota_slave.info:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;# lctl get_param osd-*.*.quota_slave.info
osd-ldiskfs.scratch-MDT0000.quota_slave.info=
target name:    scratch-MDT0000
pool ID:        0
type:           md
quota enabled:  none
conn to master: setup
space acct:     ug
user uptodate:  glb[0],slv[0],reint[0]
group uptodate: glb[0],slv[0],reint[0]
# lctl conf_param scratch.quota.mdt=ug
# lctl conf_param scratch.quota.ost=ug
# lctl get_param osd-*.*.quota_slave.info
osd-ldiskfs.scratch-MDT0000.quota_slave.info=
target name:    scratch-MDT0000
pool ID:        0
type:           md
quota enabled:  ug
conn to master: setup
space acct:     ug
user uptodate:  glb[1],slv[1],reint[0]
group uptodate: glb[1],slv[1],reint[0]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Same behavior is observed on OSTs.&lt;/p&gt;

&lt;p&gt;It would be better to:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;get the current version of the MDT server instead of the one recorded in the llog&lt;/li&gt;
	&lt;li&gt;or modify the operations manual to perform a writeconf while upgrading to a major release&lt;/li&gt;
	&lt;li&gt;or add a CWARN before lustre/obdclass/obd_mount_server.c:785 GOTO(out, rc = 0);&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I think this issue can be related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5192&quot; title=&quot;upgrade 2.1 -&amp;gt; 2.4.3 quota errors&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5192&quot;&gt;&lt;del&gt;LU-5192&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</description>
                <environment>RHEL6 w/ patched kernel for Lustre server</environment>
        <key id="25444">LU-5298</key>
            <summary>The lwp device cannot be started when we migrate from Lustre 2.1 to Lustre 2.4</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="bruno.travouillon">Bruno Travouillon</reporter>
                        <labels>
                    </labels>
                <created>Mon, 7 Jul 2014 07:53:41 +0000</created>
                <updated>Fri, 1 Jul 2016 06:13:42 +0000</updated>
                            <resolved>Fri, 1 Jul 2016 06:13:42 +0000</resolved>
                                    <version>Lustre 2.4.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="88255" author="johann" created="Mon, 7 Jul 2014 13:19:37 +0000"  >&lt;p&gt;For reference, the issue was introduced by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3929&quot; title=&quot;2.1.6-&amp;gt;2.4.1 rolling upgrade: lustre-MDT0000: recovery is timed out, evict stale exports&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3929&quot;&gt;&lt;del&gt;LU-3929&lt;/del&gt;&lt;/a&gt; and i agree that the patch is bogus. I am not sure to understand how we (and Bull as well) were able to pass rolling upgrade tests.&lt;br/&gt;
Yujian, could you please advise whether we execute a writeconf as part of our rolling upgrade tests?&lt;/p&gt;</comment>
                            <comment id="88315" author="jfc" created="Mon, 7 Jul 2014 16:27:48 +0000"  >&lt;p&gt;Niu,&lt;br/&gt;
Could you please take a look at this issue.&lt;br/&gt;
Thanks,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="88397" author="niu" created="Tue, 8 Jul 2014 02:29:10 +0000"  >&lt;p&gt;We did some upgrade test and shows that config log won&apos;t be converted automatically, which means the fix of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3929&quot; title=&quot;2.1.6-&amp;gt;2.4.1 rolling upgrade: lustre-MDT0000: recovery is timed out, evict stale exports&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3929&quot;&gt;&lt;del&gt;LU-3929&lt;/del&gt;&lt;/a&gt; is wrong. I think we should just revert the bad fix of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3929&quot; title=&quot;2.1.6-&amp;gt;2.4.1 rolling upgrade: lustre-MDT0000: recovery is timed out, evict stale exports&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3929&quot;&gt;&lt;del&gt;LU-3929&lt;/del&gt;&lt;/a&gt; (&lt;a href=&quot;http://review.whamcloud.com/#/c/8328/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8328/&lt;/a&gt;). Oleg, could help to do that? Thanks.&lt;/p&gt;</comment>
                            <comment id="88406" author="yujian" created="Tue, 8 Jul 2014 06:52:15 +0000"  >&lt;blockquote&gt;&lt;p&gt;Yujian, could you please advise whether we execute a writeconf as part of our rolling upgrade tests?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Hi Johann, we did not perform writeconf in rolling upgrade tests. Unfortunately, quotas was not tested in rolling upgrade tests, which is the reason that rolling upgrade from 2.1.6 to 2.4.2 passed: &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/12d033c6-6bd7-11e3-a73e-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/12d033c6-6bd7-11e3-a73e-52540035b04c&lt;/a&gt; (the test output was lost due to Maloo cutover).&lt;/p&gt;

&lt;p&gt;We tested quotas in clean upgrade tests (conf-sanity test 32*). However, writeconf was performed in those tests. So, while fixing this bug, we also need improve our tests.&lt;/p&gt;</comment>
                            <comment id="88408" author="johann" created="Tue, 8 Jul 2014 08:11:50 +0000"  >&lt;p&gt;Niu, yes, i think we should revert the patch. A workaround for the initial problem would be to check whether OBD_CONNECT_LIGHTWEIGHT is set back in the connect reply and disconnect if it is not.&lt;/p&gt;

&lt;p&gt;Yujian, thanks for your reply.&lt;/p&gt;</comment>
                            <comment id="88412" author="niu" created="Tue, 8 Jul 2014 09:47:49 +0000"  >&lt;blockquote&gt;
&lt;p&gt;We tested quotas in clean upgrade tests (conf-sanity test 32*). However, writeconf was performed in those tests. So, while fixing this bug, we also need improve our tests.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;conf-sanity 32 requires writeconf to generate correct nids for devices, we probably need to verify quota in manual test cases?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A workaround for the initial problem would be to check whether OBD_CONNECT_LIGHTWEIGHT is set back in the connect reply and disconnect if it is not.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;So there will be a window that could leaving a stub in the last_rcvd. Actually, I think deny the LWP connection on old server is a simple and robust solution (we have the patch in &lt;a href=&quot;http://review.whamcloud.com/#/c/8086/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8086/&lt;/a&gt;), the drawback is that customer has to upgrade the MDS to a newer 2.1 before rolling upgrade to 2.4. Is that acceptable?&lt;/p&gt;</comment>
                            <comment id="88413" author="johann" created="Tue, 8 Jul 2014 10:06:48 +0000"  >&lt;blockquote&gt;
&lt;p&gt;So there will be a window that could leaving a stub in the last_rcvd.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Right, but the window should be pretty small and the side effect isn&apos;t that serious (i.e. have to wait for recovery timer to expire). The benefit is that it works with any versions &amp;lt; 2.4.0.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Actually, I think deny the LWP connection on old server is a simple and robust solution (we have the patch in &lt;a href=&quot;http://review.whamcloud.com/#/c/8086/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8086/&lt;/a&gt;), the drawback is that customer has to upgrade the MDS to a newer 2.1 before rolling upgrade to 2.4. Is that acceptable?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I think it is difficult to impose this constraint now that 2.4.0 was released a while ago. Actually, the two &quot;solutions&quot; are not incompatible and we can do both.&lt;/p&gt;</comment>
                            <comment id="97457" author="bruno.travouillon" created="Fri, 24 Oct 2014 19:57:50 +0000"  >&lt;p&gt;Revert &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3929&quot; title=&quot;2.1.6-&amp;gt;2.4.1 rolling upgrade: lustre-MDT0000: recovery is timed out, evict stale exports&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3929&quot;&gt;&lt;del&gt;LU-3929&lt;/del&gt;&lt;/a&gt; lwp: don&apos;t connect LWP to old MDT&quot; in 2.5.3 solves the issue. (commit a43e0e4ce4)&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="157480" author="niu" created="Fri, 1 Jul 2016 06:13:42 +0000"  >&lt;p&gt;The patch led to this problem has been reverted.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="20901">LU-3929</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 24 Oct 2014 07:53:41 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>Quota</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwqnj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>14784</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 7 Jul 2014 07:53:41 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>