<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:25:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2518] Corrupted files</title>
                <link>https://jira.whamcloud.com/browse/LU-2518</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;End customer is Lunds University, who has support for our hardware and lustre through us. I have done some basic troubleshooting with them, had them try rm -f, and unlink, but they cannot perform any operations on the files. I suggested the run a file system check, but they cannot take the filesystem down for that. They realize they may have hit a bug, and upgrading could fix the issue, but they would really like to find a way to remove the files right now without going through that yet. I&apos;ll attach the logs I have, and also below you can find the last email I have from the customer, which provides a good summary for the issue.&lt;/p&gt;

&lt;p&gt;-----&lt;del&gt;From Customer&lt;/del&gt;-----&lt;/p&gt;

&lt;p&gt;The problematic files were sockets/pipes defined on a server not using &lt;br/&gt;
Lustre, and rsynced into Lustre from that server. That went just fine. &lt;br/&gt;
The next step we did was copying the files from one part of our Lustre &lt;br/&gt;
fs to another, and thereby acquiring proper ACLs. This has worked fine &lt;br/&gt;
for all normal files, directories and links, but these sockets have &lt;br/&gt;
turned into something broken, that we can&apos;t remove.&lt;/p&gt;

&lt;p&gt;A little googling brought this up:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://jira.whamcloud.com/browse/LU-784&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-784&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we&apos;re on 1.8.8, it seems very similar to our problem. What I&apos;d like &lt;br/&gt;
to know, is if there&apos;s someone from WhamCloud (or Intel, these days...) &lt;br/&gt;
that can give us any hints on what we can do, other than upgrading to &lt;br/&gt;
Lustre 2.X? I&apos;m guessing there are ways to clear the faulty inodes &lt;br/&gt;
directly from the MDS and/or OSTs, but I&apos;d need some guidance for that. &lt;br/&gt;
We&apos;d really like to have this fixed before talking about a Lustre upgrade...&lt;/p&gt;</description>
                <environment>RHEL 6.2</environment>
        <key id="17014">LU-2518</key>
            <summary>Corrupted files</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="chrislocke">Chris Locke</reporter>
                        <labels>
                            <label>mn8</label>
                    </labels>
                <created>Fri, 21 Dec 2012 09:41:35 +0000</created>
                <updated>Wed, 26 Mar 2014 22:03:17 +0000</updated>
                            <resolved>Tue, 29 Jan 2013 11:29:57 +0000</resolved>
                                    <version>Lustre 1.8.8</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="49545" author="chrislocke" created="Fri, 21 Dec 2012 09:47:16 +0000"  >&lt;p&gt;I should add, I am a Support Engineer for NetApp&lt;/p&gt;</comment>
                            <comment id="49546" author="pjones" created="Fri, 21 Dec 2012 10:32:38 +0000"  >&lt;p&gt;Bruno will help with this issue&lt;/p&gt;</comment>
                            <comment id="49548" author="bfaccini" created="Fri, 21 Dec 2012 11:03:59 +0000"  >&lt;p&gt;I am setting the necessary stuff to reproduce the issue on our test system and will try to get you a procedure to fix/remove the files asap.&lt;/p&gt;</comment>
                            <comment id="49579" author="bfaccini" created="Fri, 21 Dec 2012 19:11:54 +0000"  >&lt;p&gt;Seems that mounting the MDT as ldiskfs on the MDS and then issue &quot;setfacl -b &amp;lt;mount-point&amp;gt;/ROOT/&amp;lt;relative-path-to-file&amp;gt;&quot; fixes the problem &quot;live&quot; without any issue on still mounting-clients that even recently accessed the affected files/pipes/sockets.&lt;/p&gt;

&lt;p&gt;Looks like the problem comes from the fact setting ACL on such files in 1.8 does half the work ... and thus triggers errors upon further access.&lt;/p&gt;
</comment>
                            <comment id="49581" author="bfaccini" created="Fri, 21 Dec 2012 20:10:16 +0000"  >&lt;p&gt;Since &quot;setfacl -b&quot; resets all ACLs, you may also want to unset specific/non-default ACLs on affected files, still via/under the ldiskfs-mount of MDT, by first displaying them with &quot;getfacl&quot; and then selectivelly remove them with &quot;setfacl -x&quot;.&lt;/p&gt;

&lt;p&gt;Also, having alook to the concerned source code and ACL/EA content with debugfs seems that setfacl is doing things well but further actions (stat()) on file are missing/expecting something more (EA.LOV ?) to correctly interpret it ...&lt;/p&gt;</comment>
                            <comment id="49587" author="bobijam" created="Fri, 21 Dec 2012 23:26:05 +0000"  >&lt;p&gt;here is the fix patch (should fix MDS code) &lt;a href=&quot;http://review.whamcloud.com/4887&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/4887&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;commit message&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LU-2518 mds: handle reply buffer correctly

* obd_valid is a 64-bit value, mds_shrink_reply()&apos;s 3rd and 4th param
  takes boolean value, we need make proper conversion.
* Fix glitch in mds_shrink_reply().
* Add test cases.

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="49591" author="pjones" created="Sat, 22 Dec 2012 00:43:42 +0000"  >&lt;p&gt;Thanks Bobijam. Will the same fix apply to b1_8? The customer is running 1.8.8-wc1.&lt;/p&gt;</comment>
                            <comment id="49592" author="bobijam" created="Sat, 22 Dec 2012 00:46:17 +0000"  >&lt;p&gt;yes, this patch is only for b1_8. b2_x uses different ptlrpc pack method (new).&lt;/p&gt;</comment>
                            <comment id="49593" author="pjones" created="Sat, 22 Dec 2012 00:50:57 +0000"  >&lt;p&gt;Hmm. But the patch seems to be based on b2_1?!&lt;/p&gt;</comment>
                            <comment id="49595" author="bobijam" created="Sat, 22 Dec 2012 01:03:34 +0000"  >&lt;p&gt;sorry, my fault, pushed to the wrong branch, fix it now.&lt;/p&gt;</comment>
                            <comment id="49644" author="bfaccini" created="Mon, 24 Dec 2012 10:14:41 +0000"  >&lt;p&gt;Hello Chris,&lt;/p&gt;

&lt;p&gt;Did you read all the updates ??&lt;/p&gt;

&lt;p&gt;Do you and/or your customer feel comfortable with the rescue procedure I described (&quot;mount -t ldiskfs&quot;/&quot;setfacl&quot; on MDS) to fix corrupted files &quot;live&quot; ?? If not, feel free to ask for more details from me.&lt;/p&gt;

&lt;p&gt;Also, Zhenyu has already submitted a fix to handle the root bug and it is currently under testing, and should be included in next 1.8 release.&lt;/p&gt;

&lt;p&gt;Last, are named-pipes/sockets of a common use by your customer ?? Do you think new could be created to run applications in-place ?? Or were they only old stuff left that was just carried by the rsync ?? Are you and the customer aware that rsync has options to enable/disable these kind of files xfer ??&lt;/p&gt;</comment>
                            <comment id="49645" author="chrislocke" created="Mon, 24 Dec 2012 10:18:31 +0000"  >&lt;p&gt;Bruno,&lt;/p&gt;

&lt;p&gt;Yes, I kept up over the weekend, and forwarded the info to my customer. I doubt I&apos;ll hear anything back till Wednesday due to the holiday, but as soon as I do I&apos;ll let you know the response. Thank you all for working on this.&lt;/p&gt;</comment>
                            <comment id="49681" author="chrislocke" created="Wed, 26 Dec 2012 12:34:57 +0000"  >&lt;p&gt;Customer let me know that they are uncomfortable applying the patch themselves, and would like to know if Whamcloud could possibly do a remote session with them to go through it. If you guys are willing to do this I could get you in contact with the customer.&lt;/p&gt;</comment>
                            <comment id="49713" author="chrislocke" created="Thu, 27 Dec 2012 09:06:28 +0000"  >&lt;p&gt;Got another response back, here it is:&lt;/p&gt;

&lt;p&gt;&amp;gt; Last, are named-pipes/sockets of a common use by your customer ??&lt;/p&gt;

&lt;p&gt;Nope, not on the Lustre FS (so far).&lt;/p&gt;

&lt;p&gt;&amp;gt; Do you think new could be created to run applications in-place ??&lt;/p&gt;

&lt;p&gt;Probably not, but as we&apos;re not sure what SW-packages we&apos;ll have to install for our bioinformatics people in the future, we can&apos;t rule it out.&lt;/p&gt;

&lt;p&gt;&amp;gt; Or were they only old stuff left that was just carried by the rsync ??&lt;/p&gt;

&lt;p&gt;In this case, yes. A &quot;full&quot; rsync copy from another server was done to Lustre, and then when parts of was to be moved into another directory on Lustre, these were accidentally included.&lt;/p&gt;

&lt;p&gt;&amp;gt; Are you and the customer aware that rsync has options to enable/disable these kind of files xfer ??&lt;/p&gt;

&lt;p&gt;We do now... &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; Mostly we&apos;ve been after the &quot;copy everything, keep all info&quot; approach when using rsync, so we tend to use rsync -a.&lt;/p&gt;


&lt;p&gt;As we have a planned and promised upgrade of Lustre this spring (date &lt;br/&gt;
not decided yet) to a stable 2.X, and can&apos;t foresee any immediate need &lt;br/&gt;
for more named-pipes/socket in our Lustre FS, we&apos;re not sure if we need &lt;br/&gt;
to patch our setup of 1.8, but we&apos;d like to know:&lt;/p&gt;

&lt;p&gt;1) Are these &quot;untouchable&quot; files in any way harmful to the FS? Could &lt;br/&gt;
there be inode allocation problems, or anything like that? On a system &lt;br/&gt;
level, we guess that we could just avoid them during backup and hence &lt;br/&gt;
more or less forget about them until our upgrade.&lt;/p&gt;

&lt;p&gt;2) The process described to unset the ACLs on the specific files isn&apos;t &lt;br/&gt;
clear enough for me to just jump in and fix it. Could it be specified in &lt;br/&gt;
a better step-by-step version? Should I mount the MDT on the active or &lt;br/&gt;
the non-active MDS? Is mounting the FS on an MDS a standard procedure &lt;br/&gt;
for debugging/fixing? I&apos;m not sure if our contract allows us to ask for &lt;br/&gt;
hands-on (remote console) help in solving a task such as this, but it &lt;br/&gt;
would be nice to know.&lt;/p&gt;

&lt;p&gt;As you might have guessed by now, we&apos;re a bit cautious about breaking &lt;br/&gt;
the FS right now - it&apos;s been a long and repetitive process setting it &lt;br/&gt;
up, with backup problems as an added bonus.&lt;/p&gt;

&lt;p&gt;3) I stated earlier that we&apos;re on version 1.8.8 of Lustre, which turns &lt;br/&gt;
out to be only partly true - our clients are, but the MDS &amp;amp; ODS-machines &lt;br/&gt;
are still on 1.8.7. If we decide that patching is needed after all, &lt;br/&gt;
would the patch apply to 1.8.7? Should it be applied to all Lustre &lt;br/&gt;
machines (MDS, ODS, clients...) or a specific set?&lt;/p&gt;


&lt;p&gt;Cheers,&lt;/p&gt;

&lt;p&gt;/Mattias&lt;/p&gt;</comment>
                            <comment id="49717" author="bfaccini" created="Thu, 27 Dec 2012 10:10:57 +0000"  >&lt;p&gt;Hello Chris,&lt;/p&gt;

&lt;p&gt;Your answers are what I expected, so we can easily presume that no more/new named pipes/sockets will be created.&lt;/p&gt;

&lt;p&gt;About customer&apos;s questions, here are my answers :&lt;/p&gt;

&lt;p&gt;   1) the currently affected files will not cause any other problem than the known error each time they are accessed. In fact, the Inode is not corrupted it is just mis-interpreted by the MDS layer.&lt;/p&gt;

&lt;p&gt;   2) the work-around I found to fix the problem is not what we can say a &quot;standard&quot; procedure for fixing Inodes on the MDS side, and particulary if you run it live/in-parallel of the file-system being mounted and used. So if you/users can leave with such files, just wait for filesystem scheduled down-time to apply it. This said, the exact procedure/commands can be described as :&lt;/p&gt;

&lt;p&gt;         _ on the running/primary MDS mount the MDT device with ldiskfs, &quot;mkdir &amp;lt;/mount-point&amp;gt; ; mount -t ldiskfs &amp;lt;MDT-device&amp;gt; &amp;lt;/mount-point&amp;gt;&quot;.&lt;/p&gt;

&lt;p&gt;         _ then on each of the affected named-pipe/socket, run the &quot;setfacl -b &amp;lt;mount-point&amp;gt;/ROOT/&amp;lt;relative-path-to-file&amp;gt;&quot; command (where &amp;lt;relative-path-to-file&amp;gt; is the relative path to the file starting at the file-system root from Lustre/Clients point of view) reset all its ACLs to the default set. If you don&apos;t know the exact list/paths of the files having the problem, you may be able to find potential ones by running a &quot;find &amp;lt;LustreFS-mountpoint&amp;gt; -type p -print&quot; and &quot;find &amp;lt;LustreFS-mountpoint&amp;gt; -type s -print&quot; from any Client, and then try to access their content via any command like &quot;file&quot; and see if you get &quot;ERROR: cannot open &amp;lt;path/file&amp;gt; (Operation not supported)&quot; error.&lt;/p&gt;

&lt;p&gt;   3) as I wrote for 1) the issue is on the MDS side so I presume (Zhenyu correct me if I am wrong, my understanding is that only the MDS side is wrong and has to be fixed here ...) an upgrade of the MDS/Servers should be ok for this particular problem but as usual partial upgrade has to be done only under very specific circumstances and can not be a scenario fully tested nor validated from our side.&lt;/p&gt;

&lt;p&gt;Is this enough detailled and clear for you to report to the customer ??&lt;br/&gt;
Best regards and don&apos;t hesitate to ask more/again.&lt;br/&gt;
Bruno.&lt;/p&gt;
</comment>
                            <comment id="49719" author="ludc-mbo" created="Thu, 27 Dec 2012 11:31:45 +0000"  >&lt;p&gt;Hi Bruno!&lt;/p&gt;

&lt;p&gt;This is the customer speaking... &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;1) OK, good to know.&lt;/p&gt;

&lt;p&gt;2) The work-around seems clearer now. I&apos;ll have to discuss it with my colleagues as to if/when we can try this. We had a backup-crisis recently (had to classify all used tapes as rubbish, due to mechanical problems), and are slowly resyncing 140-150TB to tape. We&apos;ll probably wait at least until that&apos;s finished...&lt;/p&gt;

&lt;p&gt;3) OK, so you&apos;d recommend that all machines (MDS, ODS, clients) are patched/updated to the same level? If so, we&apos;ll probably wait until we can go for 2.X sometime during spring 2013. Is step 2) necessary to do before an upgrade to 2.X?&lt;/p&gt;

&lt;p&gt;Cheers,&lt;/p&gt;

&lt;p&gt;/Mattias, Lund University&lt;/p&gt;</comment>
                            <comment id="49723" author="bfaccini" created="Thu, 27 Dec 2012 13:05:54 +0000"  >&lt;p&gt;Hello Mattias,&lt;/p&gt;

&lt;p&gt;To be complete on item 3), let say that even if Zhenyu (who developed the patch) confirms the issue is on the Server/MDS side only and that no regression should exist between Servers running next 1.8 version including the patch and Clients left with 1.8.7, this will be an un-tested environment so there is still the possibility of an hidden issue.&lt;/p&gt;

&lt;p&gt;Prior to join WhamCloud/Intel team I was working for Bull and mainly at a customer site and each time it was decided to go this (unbalanced) way, because their Lustre datas are mission-critical, we have been running our own regression test-suite during hours prior to run with this configuration for production workload.&lt;/p&gt;

&lt;p&gt;But frankly I don&apos;t think that the problem you currently face here can justify the work+risk we discuss.&lt;/p&gt;

&lt;p&gt;Hope this helps and clarifies.&lt;br/&gt;
Best Regards.&lt;br/&gt;
Bruno.&lt;/p&gt;</comment>
                            <comment id="50105" author="bfaccini" created="Tue, 8 Jan 2013 04:18:00 +0000"  >
&lt;p&gt;BTW, I forgot to confirm that build #11616 from patch Set #5 do not show the problem anymore, this in full Clients+Servers 1.8.8 configuration. &lt;/p&gt;

&lt;p&gt;Any news/update from Site/NetApp side ??&lt;/p&gt;</comment>
                            <comment id="50111" author="ludc-mbo" created="Tue, 8 Jan 2013 06:54:53 +0000"  >&lt;p&gt;Hi!&lt;/p&gt;

&lt;p&gt;We&apos;re still not done with our backups, but as soon as we&apos;re satisfied that we have it all on tape (I&apos;m tempted to add more tapedrives... ) we&apos;ll try the live fix. Should happen later this week, me thinks. I&apos;ll let you know how it works out.&lt;/p&gt;

&lt;p&gt;/Mattias, Lund University&lt;/p&gt;</comment>
                            <comment id="50334" author="ludc-mbo" created="Fri, 11 Jan 2013 05:23:27 +0000"  >&lt;p&gt;Hi!&lt;/p&gt;

&lt;p&gt;OK, we&apos;ve finally been able to execute the &quot;live fix&quot;, and it seems to have worked just fine!&lt;/p&gt;

&lt;p&gt;Thanks a bundle - your response was quick, tested and accurate, couldn&apos;t really ask for more.&lt;/p&gt;

&lt;p&gt;/Mattias, now back at planning for expansion &amp;amp; upgrade of our Lustre setup... :-&amp;gt;&lt;/p&gt;</comment>
                            <comment id="51204" author="bfaccini" created="Fri, 25 Jan 2013 09:55:04 +0000"  >&lt;p&gt;Mattias, Chris,&lt;br/&gt;
Do you think we can close this ticket now ?&lt;br/&gt;
Thank&apos;s again and in advance for your help and answers.&lt;br/&gt;
Best Regards.&lt;br/&gt;
Bruno.&lt;/p&gt;</comment>
                            <comment id="51281" author="ludc-mbo" created="Sat, 26 Jan 2013 10:12:47 +0000"  >&lt;p&gt;Maybe I should have stated that more clearly in my last comment... &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;I&apos;m fine with closing this ticket - our problem has been solved!&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;/Mattias&lt;/p&gt;</comment>
                            <comment id="51344" author="bfaccini" created="Mon, 28 Jan 2013 12:49:15 +0000"  >&lt;p&gt;Ok thank&apos;s, so closing as resolved !!&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="12217">LU-784</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="12113" name="lustredebug.tar.gz" size="292631" author="chrislocke" created="Fri, 21 Dec 2012 09:41:35 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzve5r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5933</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>