<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:18:02 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1597] Reads and Writes failing with -13 (-EACCES)</title>
                <link>https://jira.whamcloud.com/browse/LU-1597</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We&apos;re currently seeing a user&apos;s reads and writes failing with -13 (-EACCES) errors. The errors are coming from a set of clients from a single cluster, but are using multiple different filesystems. From what I can tell, the -EACCES is coming from this part of the server code:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;filter_capa.c:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;138         &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (capa == NULL) {
139                 &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (fid)
140                         CERROR(&lt;span class=&quot;code-quote&quot;&gt;&quot;seq/fid/opc &quot;&lt;/span&gt;LPU64&lt;span class=&quot;code-quote&quot;&gt;&quot;/&quot;&lt;/span&gt;DFID&lt;span class=&quot;code-quote&quot;&gt;&quot;/&quot;&lt;/span&gt;LPX64
141                                &lt;span class=&quot;code-quote&quot;&gt;&quot;: no capability has been passed\n&quot;&lt;/span&gt;,
142                                seq, PFID(fid), opc);
143                 &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
144                         CERROR(&lt;span class=&quot;code-quote&quot;&gt;&quot;seq/opc &quot;&lt;/span&gt;LPU64&lt;span class=&quot;code-quote&quot;&gt;&quot;/&quot;&lt;/span&gt;LPX64
145                                &lt;span class=&quot;code-quote&quot;&gt;&quot;: no capability has been passed\n&quot;&lt;/span&gt;,
146                                seq, opc);
147                 RETURN(-EACCES);
148         }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The message on the client is:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jul  3 13:26:50 ansel242 kernel: LustreError: 11-0: lsc-OST00b4-osc-ffff8806244c3800: Communicating with 172.19.1.113@o2ib100, operation ost_read failed with -13.
Jul  3 13:26:50 ansel242 kernel: LustreError: Skipped 3495061 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And there are corresponding messages on the server:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jul  3 13:26:51 sumom13 kernel: LustreError: 24607:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed
Jul  3 13:26:51 sumom13 kernel: LustreError: 24607:0:(filter_capa.c:146:filter_auth_capa()) Skipped 3495057 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It appears the for each &quot;ost_&lt;/p&gt;
{read|write}
&lt;p&gt; failed&quot; message on the client, there is a &quot;no capability&quot; message on the server.&lt;/p&gt;

&lt;p&gt;I&apos;m unsure why the capability isn&apos;t being set by the client, but it seems that is causing the -EACCES error to get propagated to the clients.&lt;/p&gt;

&lt;p&gt;Lustre versions:&lt;/p&gt;

&lt;p&gt;Client: lustre-modules-2.1.1-13chaos_2.6.32_220.17.1.3chaos.ch5.x86_64.x86_64&lt;br/&gt;
Server: lustre-modules-2.1.1-4chaos_2.6.32_220.7.1.7chaos.ch5.x86_64.x86_64&lt;/p&gt;</description>
                <environment>Client: lustre-modules-2.1.1-13chaos_2.6.32_220.17.1.3chaos.ch5.x86_64.x86_64&lt;br/&gt;
Server: lustre-modules-2.1.1-4chaos_2.6.32_220.7.1.7chaos.ch5.x86_64.x86_64</environment>
        <key id="15123">LU-1597</key>
            <summary>Reads and Writes failing with -13 (-EACCES)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="prakash">Prakash Surya</reporter>
                        <labels>
                    </labels>
                <created>Tue, 3 Jul 2012 16:39:36 +0000</created>
                <updated>Wed, 30 Apr 2014 23:02:11 +0000</updated>
                            <resolved>Thu, 27 Sep 2012 18:44:53 +0000</resolved>
                                                    <fixVersion>Lustre 2.3.0</fixVersion>
                    <fixVersion>Lustre 2.1.4</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="41422" author="adilger" created="Tue, 3 Jul 2012 17:27:21 +0000"  >&lt;p&gt;I previously saw some patches from LLNL in the security/capability code.  Any chance this is due to a delta between 2.1.1-4chaos and 2.1.1-13chaos, or are there other clients running 2.1.1-13chaos that are working correctly?&lt;/p&gt;

&lt;p&gt;Can the user access the same files correctly from other clients, and conversely do other users on the 2.1.1-13chaos clients run without problems?&lt;/p&gt;</comment>
                            <comment id="41427" author="morrone" created="Tue, 3 Jul 2012 18:12:00 +0000"  >&lt;p&gt;Andreas, perhaps you are thinking of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1102&quot; title=&quot;NULL pointer dereference in capa_encrypt_id+0x8b/0x3e0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1102&quot;&gt;&lt;del&gt;LU-1102&lt;/del&gt;&lt;/a&gt;?  We had questions about capa stuff there that were never answered.  We STILL don&apos;t know why Oleg claims capa is off by default when it appears to be on by default.&lt;br/&gt;
We wound up with a patch that avoids an assertion, but that is about all.&lt;/p&gt;

&lt;p&gt;We have other 2.1.1-13chaos clients that are talking to 2.1.1-4chaos servers without this error (as far as I know...).  But as far as I understand it the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1102&quot; title=&quot;NULL pointer dereference in capa_encrypt_id+0x8b/0x3e0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1102&quot;&gt;&lt;del&gt;LU-1102&lt;/del&gt;&lt;/a&gt; patch was to address a server-side crash, so we&apos;re not even using that in production yet.&lt;/p&gt;
</comment>
                            <comment id="41429" author="prakash" created="Tue, 3 Jul 2012 18:34:03 +0000"  >&lt;p&gt;From what I was told by the admins, the user is able to access the files from another cluster running the same lustre version (2.1.1-13chaos) just fine. The problem was only seen on this specific cluster (ansel).&lt;/p&gt;</comment>
                            <comment id="41431" author="prakash" created="Tue, 3 Jul 2012 18:47:28 +0000"  >&lt;p&gt;After rebooting the clients, we are still seeing the -13 errors persist. Can we raise the priority as this is currently affecting production.&lt;/p&gt;

&lt;p&gt;EDIT: Sorry, I was wrong. It looks like the nodes which were rebooted are no longer hitting the issue. Why the reboot helped is still an open question.&lt;/p&gt;</comment>
                            <comment id="41466" author="pjones" created="Thu, 5 Jul 2012 01:59:10 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Can you please look into this one?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="41678" author="prakash" created="Tue, 10 Jul 2012 19:54:55 +0000"  >&lt;p&gt;Is there any update on this? We&apos;re now seeing this on one of our SCF clusters.&lt;/p&gt;</comment>
                            <comment id="41680" author="morrone" created="Tue, 10 Jul 2012 20:35:02 +0000"  >&lt;p&gt;Whamcloud, please update bug to blocker status.  This is taking our largest cluster out of action at the moment.  Thanks.&lt;/p&gt;</comment>
                            <comment id="41681" author="bobijam" created="Tue, 10 Jul 2012 20:58:56 +0000"  >&lt;p&gt;So the -13 error happens on some clients, and after reboot, the clients are no longer hitting the issue? Can you please upload MDS/OSS and affected clients logs here?&lt;/p&gt;</comment>
                            <comment id="41685" author="morrone" created="Tue, 10 Jul 2012 21:24:15 +0000"  >&lt;p&gt;No, the current problems are on the SCF, so no logs are available.&lt;/p&gt;

&lt;p&gt;Why is the server complaining about capabilities?  Can you explain the listed error messages?&lt;/p&gt;

&lt;p&gt;On one of the OSS nodes, it looks to me like the the ll_ost_io threads will say a message like one of the following:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed
(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x280: no capability has been passed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and the ldlm_cn thread will print these two lines together:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed
(ost_handler.c:1534:ost_blocking_ast()) Error -13 syncing data on lock cancel
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I don&apos;t see anything else in the OSS&apos;s log at our normal log level except for some noise about a few router nodes that are down.&lt;/p&gt;</comment>
                            <comment id="41686" author="morrone" created="Tue, 10 Jul 2012 22:14:56 +0000"  >&lt;p&gt;Our default client log level isn&apos;t showing much there:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ptlrpc_check_status...Communicating with X, operation ost_write failed with -13
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;also see ost_read failures with -13.&lt;/p&gt;

&lt;p&gt;And these messages too:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;vvp_io_commit_write()) Write page X of inode Y failed -13
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
</comment>
                            <comment id="41687" author="morrone" created="Tue, 10 Jul 2012 22:22:41 +0000"  >&lt;blockquote&gt;&lt;p&gt;So the -13 error happens on some clients, and after reboot, the clients are no longer hitting the issue?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;We believe that rebooting made it go away on clients previously.  But we don&apos;t understand what is going on yet, so I can&apos;t claim definitively that a client reboot clears the problem.  We are going to drain the 300 or so nodes that are printing the errors over night, and then reboot them in the morning (leaving a few drained for investigation).&lt;/p&gt;

&lt;p&gt;But that is just a wild attempt to restore normality.  We really need to get to the bottom of the cause.&lt;/p&gt;</comment>
                            <comment id="41690" author="bobijam" created="Tue, 10 Jul 2012 22:55:55 +0000"  >&lt;p&gt;We are investigating it, a patch will be out soon.&lt;/p&gt;

&lt;p&gt;it looks like ost_blocking_ast() hasn&apos;t set capa when calling obd_sync().&lt;/p&gt;</comment>
                            <comment id="41691" author="bobijam" created="Wed, 11 Jul 2012 00:25:20 +0000"  >&lt;p&gt;patch tracking at &lt;a href=&quot;http://review.whamcloud.com/3372&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/3372&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;obdfilter: set default capa for OST&lt;/p&gt;

&lt;p&gt;A capability should be set for filter_sync(), and when the operation&lt;br/&gt;
is come from OSS itself, the capability check can be passed.&lt;/p&gt;

&lt;p&gt;If clients do not support capability, the server capability check&lt;br/&gt;
should be bypassed.&lt;/p&gt;</comment>
                            <comment id="41704" author="morrone" created="Wed, 11 Jul 2012 12:54:19 +0000"  >&lt;p&gt;Some of the opc make me think that there are other problematic areas.  For instance, filter_setattr() seems to be one of the call paths triggering the &quot;no capability has been passed&quot; error.&lt;/p&gt;

&lt;p&gt;I see that there is now a ticket and patch to disable capa by default: &lt;a href=&quot;http://jira.whamcloud.com/browse/LU-1621&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-1621&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So it sounds like our quick work-around is to just disable capa.&lt;/p&gt;</comment>
                            <comment id="41709" author="bobijam" created="Wed, 11 Jul 2012 13:20:07 +0000"  >&lt;p&gt;unfortunately capability feature is not done yet, we need disable it for now.&lt;/p&gt;</comment>
                            <comment id="41719" author="morrone" created="Wed, 11 Jul 2012 18:38:09 +0000"  >&lt;p&gt;Disabling capa on the servers got rid of the server error messages as expected, since filter_auth_capa() will now just always return 0.&lt;/p&gt;

&lt;p&gt;However, our clients are still throwing many -13 errors.  So the question now is if this is something that is negotiate at mount time, and now that we&apos;ve disabled it on servers the clients are going to be unhappy until we force them to reconnect?&lt;/p&gt;

&lt;p&gt;I&apos;d rather not tell the admins to reboot 15000 nodes until I&apos;m sure it will actually fix the problem.&lt;/p&gt;</comment>
                            <comment id="41723" author="morrone" created="Wed, 11 Jul 2012 21:06:05 +0000"  >&lt;p&gt;Hello?&lt;/p&gt;

&lt;p&gt;It looks like the clients are throwing the following two messages, not always at the same time:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Communicating with X, operation ost_write failed with -13
vvp_io_commit_write() write page X of inode Y failed -13
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &quot;Communicating&quot; error is thrown by client.c:ptlrpc_check_status(), which seems to be checking the rpc reply status.  That would seem to imply that the server returned this error without a console message.&lt;/p&gt;

&lt;p&gt;I caught a log on a client with rpctrace enabled.  I can&apos;t share it, but there didn&apos;t seem to be too much info there.  Server logs might be better if I can get them.  But that is harder without an automated trigger.&lt;/p&gt;</comment>
                            <comment id="41727" author="bobijam" created="Wed, 11 Jul 2012 22:08:42 +0000"  >&lt;p&gt;Don&apos;t reconnect clients for now, we&apos;ll try to fix it on the server side.&lt;/p&gt;

&lt;p&gt;updated &lt;a href=&quot;http://review.whamcloud.com/3372&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/3372&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;obdfilter: fix some capa code for OST&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;A capability should be set for filter_sync(), and when the operation&lt;br/&gt;
  is come from OSS itself, the capability check can be passed.&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;
	&lt;li&gt;filter_capa_fixoa() need check whether filter enabled capability.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="41744" author="morrone" created="Thu, 12 Jul 2012 11:30:16 +0000"  >&lt;p&gt;Ah, great.  We&apos;ll be trying that today on our test system.&lt;/p&gt;
</comment>
                            <comment id="41890" author="morrone" created="Mon, 16 Jul 2012 12:52:04 +0000"  >&lt;p&gt;See &lt;a href=&quot;https://github.com/chaos/lustre/tree/2.1.1-17chaos&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/chaos/lustre/tree/2.1.1-17chaos&lt;/a&gt; for our production solution.  We took the one fix from &lt;a href=&quot;http://review.whamcloud.com/3372&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/3372&lt;/a&gt; to honor the capablities disabled flag, and then took the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1621&quot; title=&quot;Disable lustre capa by force&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1621&quot;&gt;&lt;del&gt;LU-1621&lt;/del&gt;&lt;/a&gt; to disable capabilities by default.&lt;/p&gt;</comment>
                            <comment id="45682" author="jlevi" created="Thu, 27 Sep 2012 18:44:53 +0000"  >&lt;p&gt;Please let me know if this needs to be reopened.&lt;/p&gt;</comment>
                            <comment id="68228" author="dzieko" created="Thu, 3 Oct 2013 13:12:53 +0000"  >&lt;p&gt;I see the following on worker nodes:&lt;/p&gt;

&lt;p&gt; Oct  3 08:50:58 wn612 kernel: LustreError: 22347:0:(vvp_io.c:1018:vvp_io_commit_write()) Write page 0 of inode ffff8101eae5f010 failed -13&lt;/p&gt;

&lt;p&gt;and on servers:&lt;/p&gt;

&lt;p&gt; Oct  3 08:57:43 oss6 kernel: LustreError: 5343:0:(filter_capa.c:151:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed&lt;/p&gt;

&lt;p&gt;On both sides I have:&lt;/p&gt;

&lt;p&gt; lustre: 2.1.5&lt;br/&gt;
 kernel: patchless_client&lt;br/&gt;
 build:  v2_1_5_0--PRISTINE-2.6.18-348.3.1.el5&lt;/p&gt;

&lt;p&gt;On worker nodes are scientific linux 5.9 and OFED-1.5.3.2, on servers pure centos 6.3.&lt;/p&gt;

&lt;p&gt;Is this the same bug ?&lt;/p&gt;</comment>
                            <comment id="82928" author="skcoulter" created="Wed, 30 Apr 2014 21:51:15 +0000"  >&lt;p&gt;We are seeing the same problem.&lt;br/&gt;
Client is 2.1.4-5chaos.&lt;/p&gt;

&lt;p&gt;Has a solution been found?&lt;br/&gt;
It is killing jobs here.&lt;/p&gt;</comment>
                            <comment id="82950" author="morrone" created="Wed, 30 Apr 2014 23:02:00 +0000"  >&lt;p&gt;Susan, the solution that I described in described for Lustre 2.1.1-17chaos is still in the Lustre 2.1.4-*chaos releases.  You should open a new ticket describing your issue.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="15188">LU-1621</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv2zb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>3982</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>