[LU-1597] Reads and Writes failing with -13 (-EACCES) Created: 03/Jul/12  Updated: 30/Apr/14  Resolved: 27/Sep/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.3.0, Lustre 2.1.4

Type: Bug Priority: Blocker
Reporter: Prakash Surya (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Client: lustre-modules-2.1.1-13chaos_2.6.32_220.17.1.3chaos.ch5.x86_64.x86_64
Server: lustre-modules-2.1.1-4chaos_2.6.32_220.7.1.7chaos.ch5.x86_64.x86_64


Issue Links:
Related
is related to LU-1621 Disable lustre capa by force Resolved
Severity: 3
Rank (Obsolete): 3982

 Description   

We're currently seeing a user's reads and writes failing with -13 (-EACCES) errors. The errors are coming from a set of clients from a single cluster, but are using multiple different filesystems. From what I can tell, the -EACCES is coming from this part of the server code:

filter_capa.c:
138         if (capa == NULL) {
139                 if (fid)
140                         CERROR("seq/fid/opc "LPU64"/"DFID"/"LPX64
141                                ": no capability has been passed\n",
142                                seq, PFID(fid), opc);
143                 else
144                         CERROR("seq/opc "LPU64"/"LPX64
145                                ": no capability has been passed\n",
146                                seq, opc);
147                 RETURN(-EACCES);
148         }

The message on the client is:

Jul  3 13:26:50 ansel242 kernel: LustreError: 11-0: lsc-OST00b4-osc-ffff8806244c3800: Communicating with 172.19.1.113@o2ib100, operation ost_read failed with -13.
Jul  3 13:26:50 ansel242 kernel: LustreError: Skipped 3495061 previous similar messages

And there are corresponding messages on the server:

Jul  3 13:26:51 sumom13 kernel: LustreError: 24607:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed
Jul  3 13:26:51 sumom13 kernel: LustreError: 24607:0:(filter_capa.c:146:filter_auth_capa()) Skipped 3495057 previous similar messages

It appears the for each "ost_

{read|write}

failed" message on the client, there is a "no capability" message on the server.

I'm unsure why the capability isn't being set by the client, but it seems that is causing the -EACCES error to get propagated to the clients.

Lustre versions:

Client: lustre-modules-2.1.1-13chaos_2.6.32_220.17.1.3chaos.ch5.x86_64.x86_64
Server: lustre-modules-2.1.1-4chaos_2.6.32_220.7.1.7chaos.ch5.x86_64.x86_64



 Comments   
Comment by Andreas Dilger [ 03/Jul/12 ]

I previously saw some patches from LLNL in the security/capability code. Any chance this is due to a delta between 2.1.1-4chaos and 2.1.1-13chaos, or are there other clients running 2.1.1-13chaos that are working correctly?

Can the user access the same files correctly from other clients, and conversely do other users on the 2.1.1-13chaos clients run without problems?

Comment by Christopher Morrone [ 03/Jul/12 ]

Andreas, perhaps you are thinking of LU-1102? We had questions about capa stuff there that were never answered. We STILL don't know why Oleg claims capa is off by default when it appears to be on by default.
We wound up with a patch that avoids an assertion, but that is about all.

We have other 2.1.1-13chaos clients that are talking to 2.1.1-4chaos servers without this error (as far as I know...). But as far as I understand it the LU-1102 patch was to address a server-side crash, so we're not even using that in production yet.

Comment by Prakash Surya (Inactive) [ 03/Jul/12 ]

From what I was told by the admins, the user is able to access the files from another cluster running the same lustre version (2.1.1-13chaos) just fine. The problem was only seen on this specific cluster (ansel).

Comment by Prakash Surya (Inactive) [ 03/Jul/12 ]

After rebooting the clients, we are still seeing the -13 errors persist. Can we raise the priority as this is currently affecting production.

EDIT: Sorry, I was wrong. It looks like the nodes which were rebooted are no longer hitting the issue. Why the reboot helped is still an open question.

Comment by Peter Jones [ 05/Jul/12 ]

Bobijam

Can you please look into this one?

Peter

Comment by Prakash Surya (Inactive) [ 10/Jul/12 ]

Is there any update on this? We're now seeing this on one of our SCF clusters.

Comment by Christopher Morrone [ 10/Jul/12 ]

Whamcloud, please update bug to blocker status. This is taking our largest cluster out of action at the moment. Thanks.

Comment by Zhenyu Xu [ 10/Jul/12 ]

So the -13 error happens on some clients, and after reboot, the clients are no longer hitting the issue? Can you please upload MDS/OSS and affected clients logs here?

Comment by Christopher Morrone [ 10/Jul/12 ]

No, the current problems are on the SCF, so no logs are available.

Why is the server complaining about capabilities? Can you explain the listed error messages?

On one of the OSS nodes, it looks to me like the the ll_ost_io threads will say a message like one of the following:

(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed
(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x280: no capability has been passed

and the ldlm_cn thread will print these two lines together:

(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed
(ost_handler.c:1534:ost_blocking_ast()) Error -13 syncing data on lock cancel

I don't see anything else in the OSS's log at our normal log level except for some noise about a few router nodes that are down.

Comment by Christopher Morrone [ 10/Jul/12 ]

Our default client log level isn't showing much there:

ptlrpc_check_status...Communicating with X, operation ost_write failed with -13

also see ost_read failures with -13.

And these messages too:

vvp_io_commit_write()) Write page X of inode Y failed -13
Comment by Christopher Morrone [ 10/Jul/12 ]

So the -13 error happens on some clients, and after reboot, the clients are no longer hitting the issue?

We believe that rebooting made it go away on clients previously. But we don't understand what is going on yet, so I can't claim definitively that a client reboot clears the problem. We are going to drain the 300 or so nodes that are printing the errors over night, and then reboot them in the morning (leaving a few drained for investigation).

But that is just a wild attempt to restore normality. We really need to get to the bottom of the cause.

Comment by Zhenyu Xu [ 10/Jul/12 ]

We are investigating it, a patch will be out soon.

it looks like ost_blocking_ast() hasn't set capa when calling obd_sync().

Comment by Zhenyu Xu [ 11/Jul/12 ]

patch tracking at http://review.whamcloud.com/3372

obdfilter: set default capa for OST

A capability should be set for filter_sync(), and when the operation
is come from OSS itself, the capability check can be passed.

If clients do not support capability, the server capability check
should be bypassed.

Comment by Christopher Morrone [ 11/Jul/12 ]

Some of the opc make me think that there are other problematic areas. For instance, filter_setattr() seems to be one of the call paths triggering the "no capability has been passed" error.

I see that there is now a ticket and patch to disable capa by default: http://jira.whamcloud.com/browse/LU-1621.

So it sounds like our quick work-around is to just disable capa.

Comment by Zhenyu Xu [ 11/Jul/12 ]

unfortunately capability feature is not done yet, we need disable it for now.

Comment by Christopher Morrone [ 11/Jul/12 ]

Disabling capa on the servers got rid of the server error messages as expected, since filter_auth_capa() will now just always return 0.

However, our clients are still throwing many -13 errors. So the question now is if this is something that is negotiate at mount time, and now that we've disabled it on servers the clients are going to be unhappy until we force them to reconnect?

I'd rather not tell the admins to reboot 15000 nodes until I'm sure it will actually fix the problem.

Comment by Christopher Morrone [ 11/Jul/12 ]

Hello?

It looks like the clients are throwing the following two messages, not always at the same time:

Communicating with X, operation ost_write failed with -13
vvp_io_commit_write() write page X of inode Y failed -13

The "Communicating" error is thrown by client.c:ptlrpc_check_status(), which seems to be checking the rpc reply status. That would seem to imply that the server returned this error without a console message.

I caught a log on a client with rpctrace enabled. I can't share it, but there didn't seem to be too much info there. Server logs might be better if I can get them. But that is harder without an automated trigger.

Comment by Zhenyu Xu [ 11/Jul/12 ]

Don't reconnect clients for now, we'll try to fix it on the server side.

updated http://review.whamcloud.com/3372

obdfilter: fix some capa code for OST

  • A capability should be set for filter_sync(), and when the operation
    is come from OSS itself, the capability check can be passed.
  • filter_capa_fixoa() need check whether filter enabled capability.
Comment by Christopher Morrone [ 12/Jul/12 ]

Ah, great. We'll be trying that today on our test system.

Comment by Christopher Morrone [ 16/Jul/12 ]

See https://github.com/chaos/lustre/tree/2.1.1-17chaos for our production solution. We took the one fix from http://review.whamcloud.com/3372 to honor the capablities disabled flag, and then took the patch from LU-1621 to disable capabilities by default.

Comment by Jodi Levi (Inactive) [ 27/Sep/12 ]

Please let me know if this needs to be reopened.

Comment by Pawel Dziekonski [ 03/Oct/13 ]

I see the following on worker nodes:

Oct 3 08:50:58 wn612 kernel: LustreError: 22347:0:(vvp_io.c:1018:vvp_io_commit_write()) Write page 0 of inode ffff8101eae5f010 failed -13

and on servers:

Oct 3 08:57:43 oss6 kernel: LustreError: 5343:0:(filter_capa.c:151:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed

On both sides I have:

lustre: 2.1.5
kernel: patchless_client
build: v2_1_5_0--PRISTINE-2.6.18-348.3.1.el5

On worker nodes are scientific linux 5.9 and OFED-1.5.3.2, on servers pure centos 6.3.

Is this the same bug ?

Comment by Susan Coulter [ 30/Apr/14 ]

We are seeing the same problem.
Client is 2.1.4-5chaos.

Has a solution been found?
It is killing jobs here.

Comment by Christopher Morrone [ 30/Apr/14 ]

Susan, the solution that I described in described for Lustre 2.1.1-17chaos is still in the Lustre 2.1.4-*chaos releases. You should open a new ticket describing your issue.

Generated at Sat Feb 10 01:18:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.