[LU-1597] Reads and Writes failing with -13 (-EACCES) Created: 03/Jul/12 Updated: 30/Apr/14 Resolved: 27/Sep/12 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Client: lustre-modules-2.1.1-13chaos_2.6.32_220.17.1.3chaos.ch5.x86_64.x86_64 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 3982 | ||||||||
| Description |
|
We're currently seeing a user's reads and writes failing with -13 (-EACCES) errors. The errors are coming from a set of clients from a single cluster, but are using multiple different filesystems. From what I can tell, the -EACCES is coming from this part of the server code: filter_capa.c: 138 if (capa == NULL) { 139 if (fid) 140 CERROR("seq/fid/opc "LPU64"/"DFID"/"LPX64 141 ": no capability has been passed\n", 142 seq, PFID(fid), opc); 143 else 144 CERROR("seq/opc "LPU64"/"LPX64 145 ": no capability has been passed\n", 146 seq, opc); 147 RETURN(-EACCES); 148 } The message on the client is: Jul 3 13:26:50 ansel242 kernel: LustreError: 11-0: lsc-OST00b4-osc-ffff8806244c3800: Communicating with 172.19.1.113@o2ib100, operation ost_read failed with -13. Jul 3 13:26:50 ansel242 kernel: LustreError: Skipped 3495061 previous similar messages And there are corresponding messages on the server: Jul 3 13:26:51 sumom13 kernel: LustreError: 24607:0:(filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x40: no capability has been passed Jul 3 13:26:51 sumom13 kernel: LustreError: 24607:0:(filter_capa.c:146:filter_auth_capa()) Skipped 3495057 previous similar messages It appears the for each "ost_ {read|write}failed" message on the client, there is a "no capability" message on the server. I'm unsure why the capability isn't being set by the client, but it seems that is causing the -EACCES error to get propagated to the clients. Lustre versions: Client: lustre-modules-2.1.1-13chaos_2.6.32_220.17.1.3chaos.ch5.x86_64.x86_64 |
| Comments |
| Comment by Andreas Dilger [ 03/Jul/12 ] |
|
I previously saw some patches from LLNL in the security/capability code. Any chance this is due to a delta between 2.1.1-4chaos and 2.1.1-13chaos, or are there other clients running 2.1.1-13chaos that are working correctly? Can the user access the same files correctly from other clients, and conversely do other users on the 2.1.1-13chaos clients run without problems? |
| Comment by Christopher Morrone [ 03/Jul/12 ] |
|
Andreas, perhaps you are thinking of We have other 2.1.1-13chaos clients that are talking to 2.1.1-4chaos servers without this error (as far as I know...). But as far as I understand it the |
| Comment by Prakash Surya (Inactive) [ 03/Jul/12 ] |
|
From what I was told by the admins, the user is able to access the files from another cluster running the same lustre version (2.1.1-13chaos) just fine. The problem was only seen on this specific cluster (ansel). |
| Comment by Prakash Surya (Inactive) [ 03/Jul/12 ] |
|
After rebooting the clients, we are still seeing the -13 errors persist. Can we raise the priority as this is currently affecting production. EDIT: Sorry, I was wrong. It looks like the nodes which were rebooted are no longer hitting the issue. Why the reboot helped is still an open question. |
| Comment by Peter Jones [ 05/Jul/12 ] |
|
Bobijam Can you please look into this one? Peter |
| Comment by Prakash Surya (Inactive) [ 10/Jul/12 ] |
|
Is there any update on this? We're now seeing this on one of our SCF clusters. |
| Comment by Christopher Morrone [ 10/Jul/12 ] |
|
Whamcloud, please update bug to blocker status. This is taking our largest cluster out of action at the moment. Thanks. |
| Comment by Zhenyu Xu [ 10/Jul/12 ] |
|
So the -13 error happens on some clients, and after reboot, the clients are no longer hitting the issue? Can you please upload MDS/OSS and affected clients logs here? |
| Comment by Christopher Morrone [ 10/Jul/12 ] |
|
No, the current problems are on the SCF, so no logs are available. Why is the server complaining about capabilities? Can you explain the listed error messages? On one of the OSS nodes, it looks to me like the the ll_ost_io threads will say a message like one of the following: (filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed (filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x280: no capability has been passed and the ldlm_cn thread will print these two lines together: (filter_capa.c:146:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed (ost_handler.c:1534:ost_blocking_ast()) Error -13 syncing data on lock cancel I don't see anything else in the OSS's log at our normal log level except for some noise about a few router nodes that are down. |
| Comment by Christopher Morrone [ 10/Jul/12 ] |
|
Our default client log level isn't showing much there: ptlrpc_check_status...Communicating with X, operation ost_write failed with -13 also see ost_read failures with -13. And these messages too: vvp_io_commit_write()) Write page X of inode Y failed -13 |
| Comment by Christopher Morrone [ 10/Jul/12 ] |
We believe that rebooting made it go away on clients previously. But we don't understand what is going on yet, so I can't claim definitively that a client reboot clears the problem. We are going to drain the 300 or so nodes that are printing the errors over night, and then reboot them in the morning (leaving a few drained for investigation). But that is just a wild attempt to restore normality. We really need to get to the bottom of the cause. |
| Comment by Zhenyu Xu [ 10/Jul/12 ] |
|
We are investigating it, a patch will be out soon. it looks like ost_blocking_ast() hasn't set capa when calling obd_sync(). |
| Comment by Zhenyu Xu [ 11/Jul/12 ] |
|
patch tracking at http://review.whamcloud.com/3372 obdfilter: set default capa for OST A capability should be set for filter_sync(), and when the operation If clients do not support capability, the server capability check |
| Comment by Christopher Morrone [ 11/Jul/12 ] |
|
Some of the opc make me think that there are other problematic areas. For instance, filter_setattr() seems to be one of the call paths triggering the "no capability has been passed" error. I see that there is now a ticket and patch to disable capa by default: http://jira.whamcloud.com/browse/LU-1621. So it sounds like our quick work-around is to just disable capa. |
| Comment by Zhenyu Xu [ 11/Jul/12 ] |
|
unfortunately capability feature is not done yet, we need disable it for now. |
| Comment by Christopher Morrone [ 11/Jul/12 ] |
|
Disabling capa on the servers got rid of the server error messages as expected, since filter_auth_capa() will now just always return 0. However, our clients are still throwing many -13 errors. So the question now is if this is something that is negotiate at mount time, and now that we've disabled it on servers the clients are going to be unhappy until we force them to reconnect? I'd rather not tell the admins to reboot 15000 nodes until I'm sure it will actually fix the problem. |
| Comment by Christopher Morrone [ 11/Jul/12 ] |
|
Hello? It looks like the clients are throwing the following two messages, not always at the same time: Communicating with X, operation ost_write failed with -13 vvp_io_commit_write() write page X of inode Y failed -13 The "Communicating" error is thrown by client.c:ptlrpc_check_status(), which seems to be checking the rpc reply status. That would seem to imply that the server returned this error without a console message. I caught a log on a client with rpctrace enabled. I can't share it, but there didn't seem to be too much info there. Server logs might be better if I can get them. But that is harder without an automated trigger. |
| Comment by Zhenyu Xu [ 11/Jul/12 ] |
|
Don't reconnect clients for now, we'll try to fix it on the server side. updated http://review.whamcloud.com/3372 obdfilter: fix some capa code for OST
|
| Comment by Christopher Morrone [ 12/Jul/12 ] |
|
Ah, great. We'll be trying that today on our test system. |
| Comment by Christopher Morrone [ 16/Jul/12 ] |
|
See https://github.com/chaos/lustre/tree/2.1.1-17chaos for our production solution. We took the one fix from http://review.whamcloud.com/3372 to honor the capablities disabled flag, and then took the patch from |
| Comment by Jodi Levi (Inactive) [ 27/Sep/12 ] |
|
Please let me know if this needs to be reopened. |
| Comment by Pawel Dziekonski [ 03/Oct/13 ] |
|
I see the following on worker nodes: Oct 3 08:50:58 wn612 kernel: LustreError: 22347:0:(vvp_io.c:1018:vvp_io_commit_write()) Write page 0 of inode ffff8101eae5f010 failed -13 and on servers: Oct 3 08:57:43 oss6 kernel: LustreError: 5343:0:(filter_capa.c:151:filter_auth_capa()) seq/opc 0/0x20: no capability has been passed On both sides I have: lustre: 2.1.5 On worker nodes are scientific linux 5.9 and OFED-1.5.3.2, on servers pure centos 6.3. Is this the same bug ? |
| Comment by Susan Coulter [ 30/Apr/14 ] |
|
We are seeing the same problem. Has a solution been found? |
| Comment by Christopher Morrone [ 30/Apr/14 ] |
|
Susan, the solution that I described in described for Lustre 2.1.1-17chaos is still in the Lustre 2.1.4-*chaos releases. You should open a new ticket describing your issue. |