[LU-2145] The llog_origin_handle_create operation failed with -2 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.5.0
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Labels:
- sequoia

Severity:
3
Project:
Orion
Rank (Obsolete):
2852

Description

When running IOR's writing to our Orion test system, I am seeing many Lustre error messages of the form:

Jan  4 16:24:52 zwicky142 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
Jan  4 16:24:52 zwicky146 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
Jan  4 16:24:53 zwicky150 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
Jan  4 16:24:53 zwicky125 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
Jan  4 16:24:53 zwicky123 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
Jan  4 16:24:53 zwicky151 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
Jan  4 16:24:53 zwicky115 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
Jan  4 16:24:54 zwicky147 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2

These are all messages from the clients, with 10.1.1.211 being the address of our MDS.

The clients are running:

$ rpm -qa | grep lustre-modules
lustre-modules-2.1.0-15chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64

And the servers:

# zwicky1 /root > rpm -qa | grep lustre-orion-modules
lustre-orion-modules-2.2.49.50-2chaos_2.6.32_220.2.1.1chaos.ch5.x86_64.x86_64

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre.log-20120105.gz
223 kB
06/Jan/12 11:51 AM

Issue Links

is related to

LU-2611 Unified Target

Resolved

Activity

[LU-2145] The llog_origin_handle_create operation failed with -2

Mikhail Pershin added a comment - 11/Oct/12 7:53 AM

Can I move this ticket to the Lustre project or someone needs it with historical ~~ORI-455~~ ID?

Mikhail Pershin added a comment - 11/Oct/12 7:53 AM Can I move this ticket to the Lustre project or someone needs it with historical ORI-455 ID?

Brian Behlendorf added a comment - 06/Jun/12 7:10 PM

Mike the following patch fully implements the MDT style ptlrpc handlers for the MGS. In the process of implementing that support the console warning which prompted me to file this issue are fixed.

http://review.whamcloud.com/3055

Brian Behlendorf added a comment - 06/Jun/12 7:10 PM Mike the following patch fully implements the MDT style ptlrpc handlers for the MGS. In the process of implementing that support the console warning which prompted me to file this issue are fixed. http://review.whamcloud.com/3055

Prakash Surya (Inactive) added a comment - 17/May/12 11:51 AM

Thanks! That should help me get started.

Prakash Surya (Inactive) added a comment - 17/May/12 11:51 AM Thanks! That should help me get started.

Mikhail Pershin added a comment - 17/May/12 6:53 AM

Right, err_serious() has the same means, see mdt_req_handle(), it is common handler which calls proper operation handler and gets 'rc' back and there is no way to know the type of 'rc', check comments there:

        if (likely(rc == 0)) {
                /*
                 * Process request, there can be two types of rc:
                 * 1) errors with msg unpack/pack, other failures outside the
                 * operation itself. This is counted as serious errors;
                 * 2) errors during fs operation, should be placed in rq_status
                 * only
                 */
                rc = h->mh_act(info);
                if (rc == 0 &&
                    !req->rq_no_reply && req->rq_reply_state == NULL) {
                        DEBUG_REQ(D_ERROR, req, "MDT \"handler\" %s did not "
                                  "pack reply and returned 0 error\n",
                                  h->mh_name);
                        LBUG();
                }
                serious = is_serious(rc);
                rc = clear_serious(rc);

Such technique can be used for mgs_handler as well, if it calls function which does not pure processing but also pack/unpack work and other sanity checks of RPC consistency.

Mikhail Pershin added a comment - 17/May/12 6:53 AM Right, err_serious() has the same means, see mdt_req_handle(), it is common handler which calls proper operation handler and gets 'rc' back and there is no way to know the type of 'rc', check comments there: if (likely(rc == 0)) { /* * Process request, there can be two types of rc: * 1) errors with msg unpack/pack, other failures outside the * operation itself. This is counted as serious errors; * 2) errors during fs operation, should be placed in rq_status * only */ rc = h->mh_act(info); if (rc == 0 && !req->rq_no_reply && req->rq_reply_state == NULL) { DEBUG_REQ(D_ERROR, req, "MDT \" handler\ " %s did not " "pack reply and returned 0 error\n" , h->mh_name); LBUG(); } serious = is_serious(rc); rc = clear_serious(rc); Such technique can be used for mgs_handler as well, if it calls function which does not pure processing but also pack/unpack work and other sanity checks of RPC consistency.

Andreas Dilger added a comment - 16/May/12 6:05 PM

In the "good old days" (before err_serious existed) the rule was that if there was an error processing the RPC itself (e.g. corruption in the request, badly-formed RPC message buffers, bad opcode, etc) then the request handler would return a non-zero error (that eventually is stored in rq_status/pb_status) and rq_type = PTL_RPC_MSG_ERR is returned (via ptlrpc_send_error()) and this would result in a (hopefully rare) console message.

If there was an error while executing the request (e.g. permission error on a file, ENOMEM/ENOSPC/... in the filesystem, etc) then the operation error is stored in rq_status and returns 0 to the RPC handler code, which set the rq_type is PTL_RPC_MSG_REPLY (via ptlrpc_send_reply()), with nothing printed on the console. In some cases, code got this wrong and incorrectly caused console messages to be printed (which should be considered a bug to be fixed).

I don't know the details of how error_serious() factors into this, but maybe this gives you some background. Hopefully Alex or Mike can chime in on the details.

Andreas Dilger added a comment - 16/May/12 6:05 PM In the "good old days" (before err_serious existed) the rule was that if there was an error processing the RPC itself (e.g. corruption in the request, badly-formed RPC message buffers, bad opcode, etc) then the request handler would return a non-zero error (that eventually is stored in rq_status/pb_status) and rq_type = PTL_RPC_MSG_ERR is returned (via ptlrpc_send_error()) and this would result in a (hopefully rare) console message. If there was an error while executing the request (e.g. permission error on a file, ENOMEM/ENOSPC/... in the filesystem, etc) then the operation error is stored in rq_status and returns 0 to the RPC handler code, which set the rq_type is PTL_RPC_MSG_REPLY (via ptlrpc_send_reply()), with nothing printed on the console. In some cases, code got this wrong and incorrectly caused console messages to be printed (which should be considered a bug to be fixed). I don't know the details of how error_serious() factors into this, but maybe this gives you some background. Hopefully Alex or Mike can chime in on the details.

Prakash Surya (Inactive) added a comment - 16/May/12 3:57 PM

I'm looking at making a patch for this issue, but I'm a little confused as to when err_serious() should be used and when it shouldn't. Can anybody provide me with some background info as to when and why err_serious should be used? I'm looking at the MDT handler code, but can't spot a pattern to it's usage there.

Prakash Surya (Inactive) added a comment - 16/May/12 3:57 PM I'm looking at making a patch for this issue, but I'm a little confused as to when err_serious() should be used and when it shouldn't. Can anybody provide me with some background info as to when and why err_serious should be used? I'm looking at the MDT handler code, but can't spot a pattern to it's usage there.

Brian Behlendorf added a comment - 14/May/12 7:07 PM

We're also seeing bursts of the following warning presumably from similar reasons to that described above. Because these errors are quite a bit of noise to the logs there a good chance that this week I'll put a patch together to address them as Mike suggests using err_serious(). It looks as is I can just use the MDT as an example and add the needed infrastructure to the MGS.

May 14 13:52:43 grove608 kernel: LustreError: 42024:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2

Brian Behlendorf added a comment - 14/May/12 7:07 PM We're also seeing bursts of the following warning presumably from similar reasons to that described above. Because these errors are quite a bit of noise to the logs there a good chance that this week I'll put a patch together to address them as Mike suggests using err_serious(). It looks as is I can just use the MDT as an example and add the needed infrastructure to the MGS. May 14 13:52:43 grove608 kernel: LustreError: 42024:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2

Mikhail Pershin added a comment - 04/May/12 1:33 PM

This should be done in the same manner as MDT handler, using err_serious() primitive to distinguish serious errors returning error code and processing errors to be returned in rq_status. I expect that shouldn't be hard.

Mikhail Pershin added a comment - 04/May/12 1:33 PM This should be done in the same manner as MDT handler, using err_serious() primitive to distinguish serious errors returning error code and processing errors to be returned in rq_status. I expect that shouldn't be hard.

Andreas Dilger added a comment - 04/May/12 12:42 PM

The problem here is that this RPC handler function is returning an error code of -2 (i.e. as if it had some error handling the RPC itself), instead of the function returning 0 and setting a rq_status=2 (i.e. the RPC was handled correctly, but it produced an error).

Andreas Dilger added a comment - 04/May/12 12:42 PM The problem here is that this RPC handler function is returning an error code of -2 (i.e. as if it had some error handling the RPC itself), instead of the function returning 0 and setting a rq_status=2 (i.e. the RPC was handled correctly, but it produced an error).

Brian Behlendorf added a comment - 04/May/12 12:33 PM - edited

With the various console cleanup patches applies things are a little clearer. The error is for the MGS in fact not the MDT, since it's my understanding there's no need for the llog here ENOENT isn't unexpected. What's unclear to me is why we're making this request to the MGS in the first place. Below an except from the MGS lustre log.

 LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2.

If someone could propose a patch to resolve this that would be great since we're getting a ton of noise if the console because of it.

00000100:00100000:7.0:1336148328.439828:0:5585:0:(service.c:1706:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_00:8b011063-f502-34dc-d164-c95bb23ad469+7:6944:x1400718304314232:12345-192.
20000000:00000001:7.0:1336148328.439830:0:5585:0:(mgs_handler.c:602:mgs_handle()) Process entered
20000000:01000000:7.0:1336148328.439831:0:5585:0:(mgs_handler.c:694:mgs_handle()) @@@ llog_init  req@ffff880302ff9850 x1400718304314232/t0(0) o501->8b011063-f502-34dc-d164-c95bb23ad469@192.168.65.108@o2ib6:0/0 lens 264/0 e 0
00000040:00000001:7.0:1336148328.439835:0:5585:0:(llog_server.c:78:llog_origin_handle_open()) Process entered
00000040:00000040:7.0:1336148328.439837:0:5585:0:(llog_server.c:91:llog_origin_handle_open()) MGS: opening log lcz-sptlrpc
00000040:00000040:7.0:1336148328.439838:0:5585:0:(lustre_log.h:529:llog_ctxt_get()) GETting ctxt ffff8803143d8780 : new refcount 2
00000040:00000001:7.0:1336148328.439839:0:5585:0:(llog.c:691:llog_open()) Process entered
00000040:00000001:7.0:1336148328.439840:0:5585:0:(llog.c:70:llog_alloc_handle()) Process entered
00000040:00000010:7.0:1336148328.439841:0:5585:0:(llog.c:72:llog_alloc_handle()) kmalloced 'loghandle': 184 at ffff8805f800e680.
00000040:00000001:7.0:1336148328.439841:0:5585:0:(llog.c:80:llog_alloc_handle()) Process leaving (rc=18446612157949863552 : -131915759688064 : ffff8805f800e680)
00000040:00000001:7.0:1336148328.439843:0:5585:0:(llog_osd.c:925:llog_osd_open()) Process entered
00000004:00000001:7.0:1336148328.439845:0:5585:0:(osd_handler.c:1231:osd_object_read_lock()) obj=ffff880303190bf8 owner=(null) r_locks=0 w_locks=0
00000004:00000001:7.0:1336148328.439846:0:5585:0:(osd_handler.c:2349:osd_index_lookup()) Process entered
00000004:00000001:7.0:1336148328.439859:0:5585:0:(osd_handler.c:2358:osd_index_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000004:00000001:7.0:1336148328.439860:0:5585:0:(osd_handler.c:1262:osd_object_read_unlock()) obj=ffff880303190bf8 owner=(null) r_locks=1 w_locks=0
00000040:00000001:7.0:1336148328.439861:0:5585:0:(llog_osd.c:951:llog_osd_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
00000040:00000001:7.0:1336148328.439862:0:5585:0:(llog_osd.c:988:llog_osd_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000040:00000010:7.0:1336148328.439863:0:5585:0:(llog.c:100:llog_free_handle()) kfreed 'loghandle': 184 at ffff8805f800e680.
00000040:00000001:7.0:1336148328.439864:0:5585:0:(llog.c:717:llog_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000040:00000001:7.0:1336148328.439865:0:5585:0:(llog_server.c:104:llog_origin_handle_open()) Process leaving via out_pop (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
00000040:00000040:7.0:1336148328.439866:0:5585:0:(lustre_log.h:539:llog_ctxt_put()) PUTting ctxt ffff8803143d8780 : new refcount 1
00000040:00000001:7.0:1336148328.439867:0:5585:0:(llog_server.c:117:llog_origin_handle_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)

Brian Behlendorf added a comment - 04/May/12 12:33 PM - edited With the various console cleanup patches applies things are a little clearer. The error is for the MGS in fact not the MDT, since it's my understanding there's no need for the llog here ENOENT isn't unexpected. What's unclear to me is why we're making this request to the MGS in the first place. Below an except from the MGS lustre log. LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2. If someone could propose a patch to resolve this that would be great since we're getting a ton of noise if the console because of it. 00000100:00100000:7.0:1336148328.439828:0:5585:0:(service.c:1706:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_00:8b011063-f502-34dc-d164-c95bb23ad469+7:6944:x1400718304314232:12345-192. 20000000:00000001:7.0:1336148328.439830:0:5585:0:(mgs_handler.c:602:mgs_handle()) Process entered 20000000:01000000:7.0:1336148328.439831:0:5585:0:(mgs_handler.c:694:mgs_handle()) @@@ llog_init req@ffff880302ff9850 x1400718304314232/t0(0) o501->8b011063-f502-34dc-d164-c95bb23ad469@192.168.65.108@o2ib6:0/0 lens 264/0 e 0 00000040:00000001:7.0:1336148328.439835:0:5585:0:(llog_server.c:78:llog_origin_handle_open()) Process entered 00000040:00000040:7.0:1336148328.439837:0:5585:0:(llog_server.c:91:llog_origin_handle_open()) MGS: opening log lcz-sptlrpc 00000040:00000040:7.0:1336148328.439838:0:5585:0:(lustre_log.h:529:llog_ctxt_get()) GETting ctxt ffff8803143d8780 : new refcount 2 00000040:00000001:7.0:1336148328.439839:0:5585:0:(llog.c:691:llog_open()) Process entered 00000040:00000001:7.0:1336148328.439840:0:5585:0:(llog.c:70:llog_alloc_handle()) Process entered 00000040:00000010:7.0:1336148328.439841:0:5585:0:(llog.c:72:llog_alloc_handle()) kmalloced 'loghandle' : 184 at ffff8805f800e680. 00000040:00000001:7.0:1336148328.439841:0:5585:0:(llog.c:80:llog_alloc_handle()) Process leaving (rc=18446612157949863552 : -131915759688064 : ffff8805f800e680) 00000040:00000001:7.0:1336148328.439843:0:5585:0:(llog_osd.c:925:llog_osd_open()) Process entered 00000004:00000001:7.0:1336148328.439845:0:5585:0:(osd_handler.c:1231:osd_object_read_lock()) obj=ffff880303190bf8 owner=( null ) r_locks=0 w_locks=0 00000004:00000001:7.0:1336148328.439846:0:5585:0:(osd_handler.c:2349:osd_index_lookup()) Process entered 00000004:00000001:7.0:1336148328.439859:0:5585:0:(osd_handler.c:2358:osd_index_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000004:00000001:7.0:1336148328.439860:0:5585:0:(osd_handler.c:1262:osd_object_read_unlock()) obj=ffff880303190bf8 owner=( null ) r_locks=1 w_locks=0 00000040:00000001:7.0:1336148328.439861:0:5585:0:(llog_osd.c:951:llog_osd_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) 00000040:00000001:7.0:1336148328.439862:0:5585:0:(llog_osd.c:988:llog_osd_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00000010:7.0:1336148328.439863:0:5585:0:(llog.c:100:llog_free_handle()) kfreed 'loghandle' : 184 at ffff8805f800e680. 00000040:00000001:7.0:1336148328.439864:0:5585:0:(llog.c:717:llog_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00000001:7.0:1336148328.439865:0:5585:0:(llog_server.c:104:llog_origin_handle_open()) Process leaving via out_pop (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) 00000040:00000040:7.0:1336148328.439866:0:5585:0:(lustre_log.h:539:llog_ctxt_put()) PUTting ctxt ffff8803143d8780 : new refcount 1 00000040:00000001:7.0:1336148328.439867:0:5585:0:(llog_server.c:117:llog_origin_handle_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)

Prakash Surya (Inactive) added a comment - 06/Jan/12 11:51 AM

Here's one of our Lustre syslog files.

zwicky[101-164] are compute nodes.
zwicky[1-2],zwicky-mds1 are the Orion servers.

Let me know if any other information or logs would be helpful.

Prakash Surya (Inactive) added a comment - 06/Jan/12 11:51 AM Here's one of our Lustre syslog files. zwicky [101-164] are compute nodes. zwicky [1-2] ,zwicky-mds1 are the Orion servers. Let me know if any other information or logs would be helpful.

People

Assignee:: Mikhail Pershin

Reporter:: Prakash Surya (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 04/Jan/12 7:33 PM

Updated:: 16/Sep/13 10:17 AM

Resolved:: 16/Sep/13 10:17 AM