<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:02:30 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6702] shutting down OSTs in parallel with MDT(s)</title>
                <link>https://jira.whamcloud.com/browse/LU-6702</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;When shutting down OSTs and MDTs in parallel, we see some OSTs that shut down quite quickly:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jun  9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST002e
Jun  9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST002e complete
Jun  9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0002
Jun  9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0002 complete
Jun  9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST002e
Jun  9 10:45:56 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST002e complete
Jun  9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0002
Jun  9 10:45:57 eagle-8.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0002 complete
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And yet in other cases, some OSTs get hung up on timeouts, seemingly to the MDT while being shut down:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jun  9 10:45:57 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST000c
Jun  9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.47@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun  9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 52 previous similar messages
Jun  9 10:45:58 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST000c complete
Jun  9 10:46:00 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0038
Jun  9 10:46:00 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST0038 complete
Jun  9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1585:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433871982/real 1433871982]  req@ffff8802d3ce1800 x1497165981090008/t0(0) o400-&amp;gt;testfs-MDT0000-lwp-OST0022@10.100.4.2@tcp:12/10 lens 224/224 e 0 to 1 dl 1433871989 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun  9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: testfs-MDT0000-lwp-OST004d: Connection to testfs-MDT0000 (at 10.100.4.2@tcp) was lost; in progress operations using this service will wait for recovery to complete
Jun  9 10:46:29 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1585:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Jun  9 10:47:14 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun  9 10:47:14 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 35 previous similar messages
Jun  9 10:47:55 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872064/real 1433872064]  req@ffff880101e93800 x1497165981090052/t0(0) o38-&amp;gt;testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872075 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun  9 10:47:55 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Jun  9 10:49:44 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST0023_UUID: not available for connect from 10.100.4.54@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun  9 10:49:44 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 77 previous similar messages
Jun  9 10:51:05 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872239/real 1433872239]  req@ffff88028aab4800 x1497165981090128/t0(0) o38-&amp;gt;testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872265 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun  9 10:51:05 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Jun  9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST004e_UUID: not available for connect from 10.100.4.33@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun  9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: 137-5: testfs-OST000c_UUID: not available for connect from 10.100.4.33@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun  9 10:54:47 eagle-18.eagle.hpdd.intel.com kernel: LustreError: Skipped 167 previous similar messages
Jun  9 10:56:20 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1433872539/real 1433872539]  req@ffff8803250dc800 x1497165981090224/t0(0) o38-&amp;gt;testfs-MDT0000-lwp-OST0022@10.100.4.1@tcp:12/10 lens 400/544 e 0 to 1 dl 1433872580 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun  9 10:56:20 eagle-18.eagle.hpdd.intel.com kernel: Lustre: 1582:0:(client.c:1920:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Jun  9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Failing over testfs-OST0022
Jun  9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: server umount testfs-OST004d complete
Jun  9 11:01:01 eagle-18.eagle.hpdd.intel.com kernel: Lustre: Skipped 1 previous similar message
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Apparently (if my log reading is not too rusty) these OSTs that got hung up being stopped got timeouts trying to communicate with the MDT, presumably because the MDT beat these OSTs to the stopped state.  Is my analysis here accurate?  If so a couple of questions:&lt;/p&gt;

&lt;p&gt;What is this connection from the OST to the MDT being used for?&lt;/p&gt;

&lt;p&gt;Is this a connection that the OST initiates to the MDT or vice versa?&lt;/p&gt;

&lt;p&gt;I had always understood that the ideal order for shutting down Lustre was to shut down the MDT(s) first and then the OST(s) so as to not leave the MDT up and running providing references to OSTs that are no longer up and able to service requests.  If that understanding is correct how does that square with the timeouts trying to shut down an OST after the MDT is down?&lt;/p&gt;</description>
                <environment></environment>
        <key id="30574">LU-6702</key>
            <summary>shutting down OSTs in parallel with MDT(s)</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="brian">Brian Murrell</reporter>
                        <labels>
                    </labels>
                <created>Tue, 9 Jun 2015 20:36:43 +0000</created>
                <updated>Fri, 12 Jun 2015 18:50:48 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="117987" author="adilger" created="Wed, 10 Jun 2015 01:59:40 +0000"  >&lt;p&gt;Brian,&lt;br/&gt;
You are right that shutting down the MDS first is probably best. I think shutting the MDS and OSS down at the same time causes some RPCs to be accepted but dropped rather than rejected outright. &lt;/p&gt;

&lt;p&gt;The OSS-&amp;gt;MDS connection is needed for quota and FLDB service, and is separate from the MDS-&amp;gt;OSS connection. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10120">
                    <name>Blocker</name>
                                            <outwardlinks description="is blocking">
                                                        </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxfbr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>