[LU-2145] The llog_origin_handle_create operation failed with -2 Created: 04/Jan/12 Updated: 16/Sep/13 Resolved: 16/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sequoia | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Project: | Orion | ||||||||
| Rank (Obsolete): | 2852 | ||||||||
| Description |
|
When running IOR's writing to our Orion test system, I am seeing many Lustre error messages of the form: Jan 4 16:24:52 zwicky142 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2 Jan 4 16:24:52 zwicky146 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2 Jan 4 16:24:53 zwicky150 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2 Jan 4 16:24:53 zwicky125 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2 Jan 4 16:24:53 zwicky123 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2 Jan 4 16:24:53 zwicky151 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2 Jan 4 16:24:53 zwicky115 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2 Jan 4 16:24:54 zwicky147 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2 These are all messages from the clients, with 10.1.1.211 being the address of our MDS. The clients are running: $ rpm -qa | grep lustre-modules lustre-modules-2.1.0-15chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64 And the servers: # zwicky1 /root > rpm -qa | grep lustre-orion-modules lustre-orion-modules-2.2.49.50-2chaos_2.6.32_220.2.1.1chaos.ch5.x86_64.x86_64 |
| Comments |
| Comment by Christopher Morrone [ 04/Jan/12 ] |
|
Blocker for ORI on Sequoia. |
| Comment by Peter Jones [ 04/Jan/12 ] |
|
Should this ticket have been opened under the ORI project rather than the LU project? |
| Comment by Prakash Surya (Inactive) [ 05/Jan/12 ] |
|
Yeah, sorry, it probably should have been an ORI ticket. It was too late to fix by the time I realized it was an LU. |
| Comment by Peter Jones [ 05/Jan/12 ] |
|
No problem. I have moved it to the Orion project. |
| Comment by Mikhail Pershin [ 05/Jan/12 ] |
|
are these messages seen upon client mount or during IOR itself? This can be harmless messages, related to sptlrpc llog. Is there some log (syslog or console) from any client? This bug is set as blocker, are these messages affecting IOR execution somehow? |
| Comment by Prakash Surya (Inactive) [ 05/Jan/12 ] |
|
I haven't checked during mount time, but they are definitely seen during IOR execution itself. We'll need to get confirmation from Chris, but I think it can be demoted from a blocker since they don't appear to be reaching the application and causing failures. As far as I can tell, they are relatively harmless, although it does muddy the syslog messages quite a bit. If these messages can be safely ignored, they shouldn't make it to syslog at all IMO. I can upload the client syslog info if it's helpful..? |
| Comment by Mikhail Pershin [ 06/Jan/12 ] |
|
Yes, please, upload some logs, I agree also that these messages are too noisy and can be avoided if harmless |
| Comment by Prakash Surya (Inactive) [ 06/Jan/12 ] |
|
Here's one of our Lustre syslog files. zwicky[101-164] are compute nodes. Let me know if any other information or logs would be helpful. |
| Comment by Brian Behlendorf [ 04/May/12 ] |
|
With the various console cleanup patches applies things are a little clearer. The error is for the MGS in fact not the MDT, since it's my understanding there's no need for the llog here ENOENT isn't unexpected. What's unclear to me is why we're making this request to the MGS in the first place. Below an except from the MGS lustre log. LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2. If someone could propose a patch to resolve this that would be great since we're getting a ton of noise if the console because of it. 00000100:00100000:7.0:1336148328.439828:0:5585:0:(service.c:1706:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_00:8b011063-f502-34dc-d164-c95bb23ad469+7:6944:x1400718304314232:12345-192. 20000000:00000001:7.0:1336148328.439830:0:5585:0:(mgs_handler.c:602:mgs_handle()) Process entered 20000000:01000000:7.0:1336148328.439831:0:5585:0:(mgs_handler.c:694:mgs_handle()) @@@ llog_init req@ffff880302ff9850 x1400718304314232/t0(0) o501->8b011063-f502-34dc-d164-c95bb23ad469@192.168.65.108@o2ib6:0/0 lens 264/0 e 0 00000040:00000001:7.0:1336148328.439835:0:5585:0:(llog_server.c:78:llog_origin_handle_open()) Process entered 00000040:00000040:7.0:1336148328.439837:0:5585:0:(llog_server.c:91:llog_origin_handle_open()) MGS: opening log lcz-sptlrpc 00000040:00000040:7.0:1336148328.439838:0:5585:0:(lustre_log.h:529:llog_ctxt_get()) GETting ctxt ffff8803143d8780 : new refcount 2 00000040:00000001:7.0:1336148328.439839:0:5585:0:(llog.c:691:llog_open()) Process entered 00000040:00000001:7.0:1336148328.439840:0:5585:0:(llog.c:70:llog_alloc_handle()) Process entered 00000040:00000010:7.0:1336148328.439841:0:5585:0:(llog.c:72:llog_alloc_handle()) kmalloced 'loghandle': 184 at ffff8805f800e680. 00000040:00000001:7.0:1336148328.439841:0:5585:0:(llog.c:80:llog_alloc_handle()) Process leaving (rc=18446612157949863552 : -131915759688064 : ffff8805f800e680) 00000040:00000001:7.0:1336148328.439843:0:5585:0:(llog_osd.c:925:llog_osd_open()) Process entered 00000004:00000001:7.0:1336148328.439845:0:5585:0:(osd_handler.c:1231:osd_object_read_lock()) obj=ffff880303190bf8 owner=(null) r_locks=0 w_locks=0 00000004:00000001:7.0:1336148328.439846:0:5585:0:(osd_handler.c:2349:osd_index_lookup()) Process entered 00000004:00000001:7.0:1336148328.439859:0:5585:0:(osd_handler.c:2358:osd_index_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000004:00000001:7.0:1336148328.439860:0:5585:0:(osd_handler.c:1262:osd_object_read_unlock()) obj=ffff880303190bf8 owner=(null) r_locks=1 w_locks=0 00000040:00000001:7.0:1336148328.439861:0:5585:0:(llog_osd.c:951:llog_osd_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) 00000040:00000001:7.0:1336148328.439862:0:5585:0:(llog_osd.c:988:llog_osd_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00000010:7.0:1336148328.439863:0:5585:0:(llog.c:100:llog_free_handle()) kfreed 'loghandle': 184 at ffff8805f800e680. 00000040:00000001:7.0:1336148328.439864:0:5585:0:(llog.c:717:llog_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00000001:7.0:1336148328.439865:0:5585:0:(llog_server.c:104:llog_origin_handle_open()) Process leaving via out_pop (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) 00000040:00000040:7.0:1336148328.439866:0:5585:0:(lustre_log.h:539:llog_ctxt_put()) PUTting ctxt ffff8803143d8780 : new refcount 1 00000040:00000001:7.0:1336148328.439867:0:5585:0:(llog_server.c:117:llog_origin_handle_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) |
| Comment by Andreas Dilger [ 04/May/12 ] |
|
The problem here is that this RPC handler function is returning an error code of -2 (i.e. as if it had some error handling the RPC itself), instead of the function returning 0 and setting a rq_status=2 (i.e. the RPC was handled correctly, but it produced an error). |
| Comment by Mikhail Pershin [ 04/May/12 ] |
|
This should be done in the same manner as MDT handler, using err_serious() primitive to distinguish serious errors returning error code and processing errors to be returned in rq_status. I expect that shouldn't be hard. |
| Comment by Brian Behlendorf [ 14/May/12 ] |
|
We're also seeing bursts of the following warning presumably from similar reasons to that described above. Because these errors are quite a bit of noise to the logs there a good chance that this week I'll put a patch together to address them as Mike suggests using err_serious(). It looks as is I can just use the MDT as an example and add the needed infrastructure to the MGS. May 14 13:52:43 grove608 kernel: LustreError: 42024:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2 |
| Comment by Prakash Surya (Inactive) [ 16/May/12 ] |
|
I'm looking at making a patch for this issue, but I'm a little confused as to when err_serious() should be used and when it shouldn't. Can anybody provide me with some background info as to when and why err_serious should be used? I'm looking at the MDT handler code, but can't spot a pattern to it's usage there. |
| Comment by Andreas Dilger [ 16/May/12 ] |
|
In the "good old days" (before err_serious existed) the rule was that if there was an error processing the RPC itself (e.g. corruption in the request, badly-formed RPC message buffers, bad opcode, etc) then the request handler would return a non-zero error (that eventually is stored in rq_status/pb_status) and rq_type = PTL_RPC_MSG_ERR is returned (via ptlrpc_send_error()) and this would result in a (hopefully rare) console message. If there was an error while executing the request (e.g. permission error on a file, ENOMEM/ENOSPC/... in the filesystem, etc) then the operation error is stored in rq_status and returns 0 to the RPC handler code, which set the rq_type is PTL_RPC_MSG_REPLY (via ptlrpc_send_reply()), with nothing printed on the console. In some cases, code got this wrong and incorrectly caused console messages to be printed (which should be considered a bug to be fixed). I don't know the details of how error_serious() factors into this, but maybe this gives you some background. Hopefully Alex or Mike can chime in on the details. |
| Comment by Mikhail Pershin [ 17/May/12 ] |
|
Right, err_serious() has the same means, see mdt_req_handle(), it is common handler which calls proper operation handler and gets 'rc' back and there is no way to know the type of 'rc', check comments there: if (likely(rc == 0)) { /* * Process request, there can be two types of rc: * 1) errors with msg unpack/pack, other failures outside the * operation itself. This is counted as serious errors; * 2) errors during fs operation, should be placed in rq_status * only */ rc = h->mh_act(info); if (rc == 0 && !req->rq_no_reply && req->rq_reply_state == NULL) { DEBUG_REQ(D_ERROR, req, "MDT \"handler\" %s did not " "pack reply and returned 0 error\n", h->mh_name); LBUG(); } serious = is_serious(rc); rc = clear_serious(rc); Such technique can be used for mgs_handler as well, if it calls function which does not pure processing but also pack/unpack work and other sanity checks of RPC consistency. |
| Comment by Prakash Surya (Inactive) [ 17/May/12 ] |
|
Thanks! That should help me get started. |
| Comment by Brian Behlendorf [ 06/Jun/12 ] |
|
Mike the following patch fully implements the MDT style ptlrpc handlers for the MGS. In the process of implementing that support the console warning which prompted me to file this issue are fixed. |
| Comment by Mikhail Pershin [ 11/Oct/12 ] |
|
Can I move this ticket to the Lustre project or someone needs it with historical |
| Comment by Peter Jones [ 11/Oct/12 ] |
|
It's ok Mike - anyone trying to use the original ORI number will find it resolves to this ticket in its new location |
| Comment by Ian Colle (Inactive) [ 25/Oct/12 ] |
|
Patches http://review.whamcloud.com/#change,4258 and http://review.whamcloud.com/#change,4273 both landed. Can it be closed now? |
| Comment by Christopher Morrone [ 25/Oct/12 ] |
|
I've pulled the first one into our tree and will make sure the message is really gone. |
| Comment by Mikhail Pershin [ 30/Oct/12 ] |
|
sorry, I've missed last discussion here. No, it is far from being closed, these two patches are just preparation for more patches, they will be pushed in couple days. |
| Comment by Mikhail Pershin [ 12/Nov/12 ] |
|
It takes more time than expected. I am trying to do that in correct way, first move ptlrpc services start/stop to the target and then handling request processing in common way. Estimated date for first part completion is next week. |
| Comment by Andreas Dilger [ 16/Nov/12 ] |
|
Mike, is the original problem in this bug fixed, and you are currently working on unified target code cleanup? If that is the case, I'd like to reduce the severity of this bug from being a 2.4.0 blocker. |
| Comment by Christopher Morrone [ 27/Nov/12 ] |
|
How is it going, Mike? |
| Comment by Mikhail Pershin [ 29/Nov/12 ] |
|
Chris, still in progress, adapting Brian patch to the master, trying to keep it small and don't touch MDS side as it is critical for DNE landing. I expect to push patch to the gerrit in few days. |
| Comment by Mikhail Pershin [ 04/Dec/12 ] |
|
move mgs ptlrpc services into unified server http://review.whamcloud.com/#change,4715 |
| Comment by Mikhail Pershin [ 13/Dec/12 ] |
|
http://review.whamcloud.com/4826 combined patch to fix initial problem. All service handling is kept in MGS for now, but unified request handler is used to handle MGS request. |
| Comment by Jodi Levi (Inactive) [ 08/Mar/13 ] |
|
Please confirm that there is no defect outstanding and this is just the Unified Target patch being landed. |
| Comment by Mikhail Pershin [ 10/Mar/13 ] |
|
this patch solves initial problem with spamming error messages. And it is related to UT at the same time. |
| Comment by James A Simmons [ 02/Apr/13 ] |
|
This patch; http://review.whamcloud.com/4826; seems to have gone into limbo. |
| Comment by Peter Jones [ 02/Apr/13 ] |
|
James Indeed. The senior engineers involved are entirely focused on the remaining issues blocking 2.4. This work is expected to be brought to completion for 2.5 Peter |
| Comment by Christopher Morrone [ 30/Apr/13 ] |
|
Something needs to happen before 2.4.0 to stop the console error message spam. |
| Comment by Andreas Dilger [ 01/May/13 ] |
|
Chris, what are the current spamming messages. The llog code has undergone some significant restructuring already, and I can't find either llog_origin_handle_create() or "operation failed" in the current code. I'm happy to land a patch to quiet the spammy messages, but the unified target patch is 1500 lines, and at this point we are down to the last few blockers, and trying to limit the size/complexity of landed patches. |
| Comment by Christopher Morrone [ 01/May/13 ] |
|
Andreas, I think the error comes from ptlrpc_check_status(). On master the message has changed a bit so a that a simple grep for "operation failed" will miss it, and "llog_origin_handle_create" is a %s that is filled in by ll_opcode2str(opc). But as far as I know the situation still remains that will trigger ptlrpc_check_status() printing an error message. According to the most recent comment from Mike in this ticket, the original problem for which this ticket was opened is only fixed in the larger UT patch that has not yet landed. I too am fine with a patch that just quiets the console message, as long as nothing bad is really happening behind the scenes that will bite us down the road. Which is why I posted that something needs to happen. I wasn't particularly advocating for UT this close to the release date. |
| Comment by Mikhail Pershin [ 16/Sep/13 ] |
|
patch was landed |