<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:36:57 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10646] Lost access to storage hardware during fsck</title>
                <link>https://jira.whamcloud.com/browse/LU-10646</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;After updating the firmware and OS on the hardware I was running a fsck -fy against the MDT and OSTs in a large file system when the multipath devices became unaccessible. Currently the processes are in a &quot;D&quot; state but I believe that they had progressed past the pass 5 stage and were all in the process of updating quota inconsistencies. &lt;/p&gt;

&lt;p&gt;I am working with the vendors to determine the reason for the multipath failures but my real concern at the moment is the states of the file systems. I am reluctant to simply reboot the systems because I just don&apos;t want to risk damage.&lt;/p&gt;

&lt;p&gt;I&apos;m hoping that Andreas can weigh in here and give me some advise.&lt;/p&gt;</description>
                <environment>Dell/DDN hardware</environment>
        <key id="50719">LU-10646</key>
            <summary>Lost access to storage hardware during fsck</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="jamervi">Joe Mervini</reporter>
                        <labels>
                    </labels>
                <created>Thu, 8 Feb 2018 18:54:27 +0000</created>
                <updated>Sun, 11 Feb 2018 01:22:33 +0000</updated>
                            <resolved>Sun, 11 Feb 2018 01:22:33 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="220524" author="pjones" created="Fri, 9 Feb 2018 04:55:39 +0000"  >&lt;p&gt;Fan Yong&lt;/p&gt;

&lt;p&gt;Can you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="220542" author="yong.fan" created="Fri, 9 Feb 2018 06:11:30 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=jamervi&quot; class=&quot;user-hover&quot; rel=&quot;jamervi&quot;&gt;jamervi&lt;/a&gt;,&lt;/p&gt;

&lt;p&gt;Andreas is on vacation. I hope I can help some.&lt;br/&gt;
Generally, I will NOT suggest the user to break the in-processing &lt;tt&gt;e2fsck&lt;/tt&gt; by force to avoid damage. For your case, let&apos;s check the system status firstly. Please &quot;echo t &amp;gt; /proc/sysrq-trigger&quot;, then attach the &quot;dmesg&quot; information.&lt;/p&gt;</comment>
                            <comment id="220556" author="jamervi" created="Fri, 9 Feb 2018 12:54:02 +0000"  >&lt;p&gt;We pretty much moved past the &quot;breaking the in-process e2fsck&quot; stage. There was no way we could re-establish the paths to the devices.&lt;/p&gt;

&lt;p&gt;On reboot we were able to successfully run fsck -n on all devices and they all came up clean but on one server it reported that it was skipping journal recovery because of the read-only nature of the fsck. &lt;/p&gt;

&lt;p&gt;The problem that we are encountering is that any time we run a fsck -fy it will cause a disruption with multipath causing all paths to the device being checked to fail. It is unclear whether the problem is with e2fsck, dm-mapper-multipath, the underlying ib_srp subsystem or a combination of all three.&lt;/p&gt;

&lt;p&gt;Currently I am running a generic fsck against one of the OSTs (fsck /dev/mapper/&amp;lt;device&amp;gt;) and it doesn&apos;t appear to be doing anything. Although the path is still active, I am not seeing any IO via iostat on the server or via the monitoring services on the storage controllers.&lt;/p&gt;

&lt;p&gt;Although I have only checked one other OST on that same system, I was to mount the device ldiskfs.&lt;/p&gt;

&lt;p&gt;Any assistance would be greatly appreciated.  &lt;/p&gt;</comment>
                            <comment id="220576" author="yong.fan" created="Fri, 9 Feb 2018 15:19:09 +0000"  >&lt;blockquote&gt;
&lt;p&gt;On reboot we were able to successfully run fsck -n on all devices and they all came up clean but on one server it reported that it was skipping journal recovery because of the read-only nature of the fsck.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I am not sure whether I understand the issue clearly or not. You said that you have ever run &quot;fsck -n&quot; on all OSTs successfully, and only one OST reported &quot;skipping journal recovery&quot;. That means the data paths are available (at least for read) for all OSTs, right? Otherwise, the readonly mode &lt;tt&gt;fsck&lt;/tt&gt; should fail.&lt;/p&gt;

&lt;p&gt;Then you said that once you run &quot;fsck -fy&quot; on some OST, then the fsck will break because of all data paths to such device are unavailable, right? If yes, it seems that the data paths to such OST is downgrade to readonly, not for write (just suspect). The OST is the one with &quot;skipping journal recovery&quot;? Or some OST that report clean when &quot;fsck -n&quot;? Or all OSTs will fail when &quot;fsck -fy&quot;?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Currently I am running a generic fsck against one of the OSTs (fsck /dev/mapper/&amp;lt;device&amp;gt;) and it doesn&apos;t appear to be doing anything. Although the path is still active, I am not seeing any IO via iostat on the server or via the monitoring services on the storage controllers.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;You mean neither read nor write are detected during the &quot;fsck -fy&quot;, right? The monitoring service see nothing from the beginning of the &quot;fsck -fy&quot;? What will the monitoring service see if run &quot;fsck -n&quot;? normally?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Although I have only checked one other OST on that same system, I was to mount the device ldiskfs.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Sorry, not clear about that. You mean you can mount the device as ldiskfs? If yes, then such OST data path is still available for read.&lt;/p&gt;

&lt;p&gt;Anyway, it seems that you have already broken the &quot;fsck -fy&quot; via reboot, right?&lt;/p&gt;

&lt;p&gt;Have you configured HA for the OST? If yes, what is the status when access the OST via another OSS node?&lt;/p&gt;</comment>
                            <comment id="220583" author="jamervi" created="Fri, 9 Feb 2018 15:53:52 +0000"  >&lt;p&gt;Yes - All the systems were rebooted yesterday because the ib_srp paths could not be re-established.&lt;/p&gt;

&lt;p&gt;After rebooting all the systems &apos;fsck -n&apos; was run on all OSTs and MDT. All target reported clean in the read-only fsck. On one OSS all OSTs reported skipping journal recovery during the &apos;fsck -n&apos;.&lt;/p&gt;

&lt;p&gt;When the &apos;fsck -fy&apos; was run on 1 OST after a period of time both paths to the device failed. After the fact the device was completely inaccessible from the host running the fsck. There is a possibility that access through the individual /dev/sd device is still available. The reason I say this is because if I go to the failover node and try to access the device I get errors due to multimount protection. (Now that you mention that, it seems that it is pointing more to a dm-mapper-mulitpath problem.) The experience of Wednesday night was that all targets failed when fsck -fy was run.&lt;/p&gt;

&lt;p&gt;When I run fsck -n against one of the OST I can see read activity on the storage controller.&lt;/p&gt;

&lt;p&gt;So to reiterate, it appears that the file systems are intact on all targets. fsck in read-only mode will run without error on all targets. fsck -fy will nominally hang in &quot;D&quot; state on all targets due to failure of paths to devices under multipath control. &lt;/p&gt;

&lt;p&gt;Hopefully this offers a little more clarity. &lt;/p&gt;</comment>
                            <comment id="220643" author="jamervi" created="Fri, 9 Feb 2018 20:45:38 +0000"  >&lt;p&gt;I am bypassing dm-mapper-multipath and running fsck. It is not disrupting the path.&lt;/p&gt;

&lt;p&gt;The device that I am running fsck against is one that I could not mount ldiskfs. I ran fsck without any flags and it came back after flushing the journal. I then ran fsck -p and got the message: &lt;br/&gt;
root@goss13 ~]# fsck -fp /dev/sde&lt;br/&gt;
fsck from util-linux 2.23.2&lt;br/&gt;
gscratch-OST007c: Interior extent node level 0 of inode 3:&lt;br/&gt;
Logical start 0 does not match logical start 73 at next level.&lt;/p&gt;

&lt;p&gt;gscratch-OST007c: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.&lt;br/&gt;
        (i.e., without -a or -p options)&lt;/p&gt;

&lt;p&gt;I am now running the fsck without flags and am seeing this:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@goss13 ~&amp;#93;&lt;/span&gt;# fsck /dev/sde&lt;br/&gt;
fsck from util-linux 2.23.2&lt;br/&gt;
e2fsck 1.42.13.wc5 (15-Apr-2016)&lt;br/&gt;
gscratch-OST007c contains a file system with errors, check forced.&lt;br/&gt;
Pass 1: Checking inodes, blocks, and sizes&lt;br/&gt;
Interior extent node level 0 of inode 3:&lt;br/&gt;
Logical start 0 does not match logical start 73 at next level.  Fix&amp;lt;y&amp;gt;? no&lt;br/&gt;
Inode 3, i_blocks is 600, should be 16.  Fix&amp;lt;y&amp;gt;? no&lt;/p&gt;

&lt;p&gt;Please advise.&lt;/p&gt;</comment>
                            <comment id="220695" author="yong.fan" created="Sat, 10 Feb 2018 17:14:26 +0000"  >&lt;p&gt;The inode &amp;lt;3&amp;gt; is ldiskfs internal inode, for user quota. That means the user quota is broken. There seems no other better way, have to choose &quot;yes&quot; to fix the corruption. The worst case is that the user quota become inconsistent. But it is not fatal, we can rebuild it later.&lt;/p&gt;

&lt;p&gt;To be safe, if possible, please do device-level backup such OST before the repairing (such as dd).&lt;/p&gt;</comment>
                            <comment id="220697" author="jamervi" created="Sat, 10 Feb 2018 21:53:45 +0000"  >&lt;p&gt;The other affected OST had the same error and was unwittingly run with the -fy option. Everything came up fine. I repair the OST mentioned about and I was able to mount both OSTs ldiskfs. &lt;/p&gt;

&lt;p&gt;We discovered the reason why the fsck was causing the multipath paths to disappear: There was an inconsistency in the values of max_sectors_kb between the dm device and the associated sd devices. This inconsistency has been lurking for quite some time but somehow the fsck tickled the system in such a way that cause it to rear its head.&lt;/p&gt;

&lt;p&gt;We are holding off mounting lustre until we replace a faulty storage controller but I believe that everything is on track to bring the file system back online. &lt;/p&gt;

&lt;p&gt;Thanks for your assistance. Please close this ticket.&lt;/p&gt;</comment>
                            <comment id="220698" author="yong.fan" created="Sun, 11 Feb 2018 01:22:33 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=jamervi&quot; class=&quot;user-hover&quot; rel=&quot;jamervi&quot;&gt;jamervi&lt;/a&gt;,&lt;br/&gt;
glad knowing the system recovered.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzslj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>