<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:22:31 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9015] Meaning of revalidate FID rc=-4</title>
                <link>https://jira.whamcloud.com/browse/LU-9015</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have a curious problem where a user is managing to wedge a few COS6 compute nodes every day to the point they don&apos;t even respond to the console. The only thing we see logged before this happens is:&lt;/p&gt;

&lt;p&gt;Jan 12 15:59:14 n602 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;3463048.109323&amp;#93;&lt;/span&gt; LustreError: 21705:0:(file.c:3256:ll_inode_revalidate_fini()) fouo5: revalidate FID &lt;span class=&quot;error&quot;&gt;&amp;#91;0x2000068c4:0x16fd0:0x0&amp;#93;&lt;/span&gt; error: rc = -4&lt;/p&gt;

&lt;p&gt;I have no idea if this is a symptom of the node breaking down in some way unrelated to Lustre or if this is part of the cause, so I&apos;d like to figure out what this error means. After a quick look at the code this should be the return code from either md_intent_lock or md_getattr, but I can&apos;t find where the error code -4 is defined.&lt;/p&gt;

&lt;p&gt;Any tips?&lt;/p&gt;</description>
                <environment>lustre ee client-2.5.42.4-2.6.32_642.6.2</environment>
        <key id="42970">LU-9015</key>
            <summary>Meaning of revalidate FID rc=-4</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ys">Yang Sheng</assignee>
                                    <reporter username="zino">Peter Bortas</reporter>
                        <labels>
                    </labels>
                <created>Thu, 12 Jan 2017 18:20:39 +0000</created>
                <updated>Thu, 19 Jan 2017 16:59:46 +0000</updated>
                            <resolved>Thu, 19 Jan 2017 13:38:20 +0000</resolved>
                                    <version>Lustre 2.5.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="180817" author="green" created="Fri, 13 Jan 2017 18:28:41 +0000"  >&lt;p&gt;-4 is EINTR, and it likely has nothing to do with the hang at hand. This particular one was silenced in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6627&quot; title=&quot;Client inode close failed: ll_close_inode_openhandle())&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6627&quot;&gt;&lt;del&gt;LU-6627&lt;/del&gt;&lt;/a&gt; I believe.&lt;br/&gt;
It means that an application got some sort of a signal (like somebody pressed ^C or some other sources) in the middle of waiting for RPC reply from MDS.&lt;br/&gt;
application pid was 21705 if you somehow can track what it was back then.&lt;br/&gt;
It&apos;s defined in linux kernel source in include/uapi/asm-generic/errno-base.h file.&lt;/p&gt;

&lt;p&gt;Now if you want to investigate the hang, you probably want to do something like sysrq-t, do you have other debugging stuff enabled? nmi watchdogs, serial console attached? Are you setup to collect crashdumps?&lt;/p&gt;</comment>
                            <comment id="180820" author="pjones" created="Fri, 13 Jan 2017 18:33:21 +0000"  >&lt;p&gt;Yang Sheng&lt;/p&gt;

&lt;p&gt;Could you please assist with this one as further informaiton is supplied&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="180895" author="zino" created="Mon, 16 Jan 2017 11:55:01 +0000"  >&lt;p&gt;Thanks Oleg,&lt;/p&gt;

&lt;p&gt;That makes sense. It looks like (probably) all occurrences of this hang corresponds with the users jobs getting preemted and thus killed by Slurm.&lt;/p&gt;

&lt;p&gt;We do not currently have sysrq-t enabled in the node images, but I&apos;ll enable that and roll out new images during the week. Serial console is available via IPMI but we are not set up to collect crash dumps. Making a device available for crash dumps might not be possible with the limited space available.&lt;/p&gt;</comment>
                            <comment id="181392" author="zino" created="Thu, 19 Jan 2017 13:26:44 +0000"  >&lt;p&gt;We pushed out sysrq-enabled images Yesterday, and unfortunately this hang does not look like it can be escaped with sysrq on the console or break via SOL.&lt;/p&gt;

&lt;p&gt;Lustre is now only one of many suspects since the error message probably has nothing to do with it. So I&apos;m fine with closing this ticket for now. I can reopen it if further testing on our side points more fingers at Lustre.&lt;/p&gt;</comment>
                            <comment id="181396" author="ys" created="Thu, 19 Jan 2017 13:38:20 +0000"  >&lt;p&gt;Many thanks, Peter. &lt;/p&gt;</comment>
                            <comment id="181438" author="green" created="Thu, 19 Jan 2017 16:50:01 +0000"  >&lt;p&gt;Just a late note in case you did not know, but you don&apos;t need a dedicated crashdump device for every node.&lt;br/&gt;
You can have an nfs share where all nodes dump to once the need arises, this is kind of a setup that is very common and I do it too.&lt;/p&gt;

&lt;p&gt;I highly recommend you to look into setting something like that up (it does not even need to be crazy fast or have redundant storage, just a cheap huge multiterabate HDD in a node that&apos;s up all the time and has nfs server on it.)&lt;/p&gt;

&lt;p&gt;Having this setup early would save you tons of trouble later when you actually need a crashdump from something to debug some other problem.&lt;/p&gt;</comment>
                            <comment id="181442" author="zino" created="Thu, 19 Jan 2017 16:59:46 +0000"  >&lt;p&gt;I did not know you could crash-dump over network filesystems. Very interesting, thanks!&lt;/p&gt;

&lt;p&gt;Dump capability is now added to the roadmap for our cluster environment development.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz0if:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>