Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2145

The llog_origin_handle_create operation failed with -2

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.5.0
    • 3
    • Orion
    • 2852

    Description

      When running IOR's writing to our Orion test system, I am seeing many Lustre error messages of the form:

      Jan  4 16:24:52 zwicky142 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:52 zwicky146 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky150 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky125 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky123 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky151 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky115 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:54 zwicky147 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      

      These are all messages from the clients, with 10.1.1.211 being the address of our MDS.

      The clients are running:

      $ rpm -qa | grep lustre-modules
      lustre-modules-2.1.0-15chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64
      

      And the servers:

      # zwicky1 /root > rpm -qa | grep lustre-orion-modules
      lustre-orion-modules-2.2.49.50-2chaos_2.6.32_220.2.1.1chaos.ch5.x86_64.x86_64
      

      Attachments

        Issue Links

          Activity

            [LU-2145] The llog_origin_handle_create operation failed with -2

            Can I move this ticket to the Lustre project or someone needs it with historical ORI-455 ID?

            tappro Mikhail Pershin added a comment - Can I move this ticket to the Lustre project or someone needs it with historical ORI-455 ID?

            Mike the following patch fully implements the MDT style ptlrpc handlers for the MGS. In the process of implementing that support the console warning which prompted me to file this issue are fixed.

            http://review.whamcloud.com/3055

            behlendorf Brian Behlendorf added a comment - Mike the following patch fully implements the MDT style ptlrpc handlers for the MGS. In the process of implementing that support the console warning which prompted me to file this issue are fixed. http://review.whamcloud.com/3055

            Thanks! That should help me get started.

            prakash Prakash Surya (Inactive) added a comment - Thanks! That should help me get started.

            Right, err_serious() has the same means, see mdt_req_handle(), it is common handler which calls proper operation handler and gets 'rc' back and there is no way to know the type of 'rc', check comments there:

                    if (likely(rc == 0)) {
                            /*
                             * Process request, there can be two types of rc:
                             * 1) errors with msg unpack/pack, other failures outside the
                             * operation itself. This is counted as serious errors;
                             * 2) errors during fs operation, should be placed in rq_status
                             * only
                             */
                            rc = h->mh_act(info);
                            if (rc == 0 &&
                                !req->rq_no_reply && req->rq_reply_state == NULL) {
                                    DEBUG_REQ(D_ERROR, req, "MDT \"handler\" %s did not "
                                              "pack reply and returned 0 error\n",
                                              h->mh_name);
                                    LBUG();
                            }
                            serious = is_serious(rc);
                            rc = clear_serious(rc);
            

            Such technique can be used for mgs_handler as well, if it calls function which does not pure processing but also pack/unpack work and other sanity checks of RPC consistency.

            tappro Mikhail Pershin added a comment - Right, err_serious() has the same means, see mdt_req_handle(), it is common handler which calls proper operation handler and gets 'rc' back and there is no way to know the type of 'rc', check comments there: if (likely(rc == 0)) { /* * Process request, there can be two types of rc: * 1) errors with msg unpack/pack, other failures outside the * operation itself. This is counted as serious errors; * 2) errors during fs operation, should be placed in rq_status * only */ rc = h->mh_act(info); if (rc == 0 && !req->rq_no_reply && req->rq_reply_state == NULL) { DEBUG_REQ(D_ERROR, req, "MDT \" handler\ " %s did not " "pack reply and returned 0 error\n" , h->mh_name); LBUG(); } serious = is_serious(rc); rc = clear_serious(rc); Such technique can be used for mgs_handler as well, if it calls function which does not pure processing but also pack/unpack work and other sanity checks of RPC consistency.

            In the "good old days" (before err_serious existed) the rule was that if there was an error processing the RPC itself (e.g. corruption in the request, badly-formed RPC message buffers, bad opcode, etc) then the request handler would return a non-zero error (that eventually is stored in rq_status/pb_status) and rq_type = PTL_RPC_MSG_ERR is returned (via ptlrpc_send_error()) and this would result in a (hopefully rare) console message.

            If there was an error while executing the request (e.g. permission error on a file, ENOMEM/ENOSPC/... in the filesystem, etc) then the operation error is stored in rq_status and returns 0 to the RPC handler code, which set the rq_type is PTL_RPC_MSG_REPLY (via ptlrpc_send_reply()), with nothing printed on the console. In some cases, code got this wrong and incorrectly caused console messages to be printed (which should be considered a bug to be fixed).

            I don't know the details of how error_serious() factors into this, but maybe this gives you some background. Hopefully Alex or Mike can chime in on the details.

            adilger Andreas Dilger added a comment - In the "good old days" (before err_serious existed) the rule was that if there was an error processing the RPC itself (e.g. corruption in the request, badly-formed RPC message buffers, bad opcode, etc) then the request handler would return a non-zero error (that eventually is stored in rq_status/pb_status) and rq_type = PTL_RPC_MSG_ERR is returned (via ptlrpc_send_error()) and this would result in a (hopefully rare) console message. If there was an error while executing the request (e.g. permission error on a file, ENOMEM/ENOSPC/... in the filesystem, etc) then the operation error is stored in rq_status and returns 0 to the RPC handler code, which set the rq_type is PTL_RPC_MSG_REPLY (via ptlrpc_send_reply()), with nothing printed on the console. In some cases, code got this wrong and incorrectly caused console messages to be printed (which should be considered a bug to be fixed). I don't know the details of how error_serious() factors into this, but maybe this gives you some background. Hopefully Alex or Mike can chime in on the details.

            I'm looking at making a patch for this issue, but I'm a little confused as to when err_serious() should be used and when it shouldn't. Can anybody provide me with some background info as to when and why err_serious should be used? I'm looking at the MDT handler code, but can't spot a pattern to it's usage there.

            prakash Prakash Surya (Inactive) added a comment - I'm looking at making a patch for this issue, but I'm a little confused as to when err_serious() should be used and when it shouldn't. Can anybody provide me with some background info as to when and why err_serious should be used? I'm looking at the MDT handler code, but can't spot a pattern to it's usage there.

            We're also seeing bursts of the following warning presumably from similar reasons to that described above. Because these errors are quite a bit of noise to the logs there a good chance that this week I'll put a patch together to address them as Mike suggests using err_serious(). It looks as is I can just use the MDT as an example and add the needed infrastructure to the MGS.

            May 14 13:52:43 grove608 kernel: LustreError: 42024:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
            
            behlendorf Brian Behlendorf added a comment - We're also seeing bursts of the following warning presumably from similar reasons to that described above. Because these errors are quite a bit of noise to the logs there a good chance that this week I'll put a patch together to address them as Mike suggests using err_serious(). It looks as is I can just use the MDT as an example and add the needed infrastructure to the MGS. May 14 13:52:43 grove608 kernel: LustreError: 42024:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2

            This should be done in the same manner as MDT handler, using err_serious() primitive to distinguish serious errors returning error code and processing errors to be returned in rq_status. I expect that shouldn't be hard.

            tappro Mikhail Pershin added a comment - This should be done in the same manner as MDT handler, using err_serious() primitive to distinguish serious errors returning error code and processing errors to be returned in rq_status. I expect that shouldn't be hard.

            The problem here is that this RPC handler function is returning an error code of -2 (i.e. as if it had some error handling the RPC itself), instead of the function returning 0 and setting a rq_status=2 (i.e. the RPC was handled correctly, but it produced an error).

            adilger Andreas Dilger added a comment - The problem here is that this RPC handler function is returning an error code of -2 (i.e. as if it had some error handling the RPC itself), instead of the function returning 0 and setting a rq_status=2 (i.e. the RPC was handled correctly, but it produced an error).
            behlendorf Brian Behlendorf added a comment - - edited

            With the various console cleanup patches applies things are a little clearer. The error is for the MGS in fact not the MDT, since it's my understanding there's no need for the llog here ENOENT isn't unexpected. What's unclear to me is why we're making this request to the MGS in the first place. Below an except from the MGS lustre log.

             LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2. 

            If someone could propose a patch to resolve this that would be great since we're getting a ton of noise if the console because of it.

            00000100:00100000:7.0:1336148328.439828:0:5585:0:(service.c:1706:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_00:8b011063-f502-34dc-d164-c95bb23ad469+7:6944:x1400718304314232:12345-192.
            20000000:00000001:7.0:1336148328.439830:0:5585:0:(mgs_handler.c:602:mgs_handle()) Process entered
            20000000:01000000:7.0:1336148328.439831:0:5585:0:(mgs_handler.c:694:mgs_handle()) @@@ llog_init  req@ffff880302ff9850 x1400718304314232/t0(0) o501->8b011063-f502-34dc-d164-c95bb23ad469@192.168.65.108@o2ib6:0/0 lens 264/0 e 0
            00000040:00000001:7.0:1336148328.439835:0:5585:0:(llog_server.c:78:llog_origin_handle_open()) Process entered
            00000040:00000040:7.0:1336148328.439837:0:5585:0:(llog_server.c:91:llog_origin_handle_open()) MGS: opening log lcz-sptlrpc
            00000040:00000040:7.0:1336148328.439838:0:5585:0:(lustre_log.h:529:llog_ctxt_get()) GETting ctxt ffff8803143d8780 : new refcount 2
            00000040:00000001:7.0:1336148328.439839:0:5585:0:(llog.c:691:llog_open()) Process entered
            00000040:00000001:7.0:1336148328.439840:0:5585:0:(llog.c:70:llog_alloc_handle()) Process entered
            00000040:00000010:7.0:1336148328.439841:0:5585:0:(llog.c:72:llog_alloc_handle()) kmalloced 'loghandle': 184 at ffff8805f800e680.
            00000040:00000001:7.0:1336148328.439841:0:5585:0:(llog.c:80:llog_alloc_handle()) Process leaving (rc=18446612157949863552 : -131915759688064 : ffff8805f800e680)
            00000040:00000001:7.0:1336148328.439843:0:5585:0:(llog_osd.c:925:llog_osd_open()) Process entered
            00000004:00000001:7.0:1336148328.439845:0:5585:0:(osd_handler.c:1231:osd_object_read_lock()) obj=ffff880303190bf8 owner=(null) r_locks=0 w_locks=0
            00000004:00000001:7.0:1336148328.439846:0:5585:0:(osd_handler.c:2349:osd_index_lookup()) Process entered
            00000004:00000001:7.0:1336148328.439859:0:5585:0:(osd_handler.c:2358:osd_index_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            00000004:00000001:7.0:1336148328.439860:0:5585:0:(osd_handler.c:1262:osd_object_read_unlock()) obj=ffff880303190bf8 owner=(null) r_locks=1 w_locks=0
            00000040:00000001:7.0:1336148328.439861:0:5585:0:(llog_osd.c:951:llog_osd_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
            00000040:00000001:7.0:1336148328.439862:0:5585:0:(llog_osd.c:988:llog_osd_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            00000040:00000010:7.0:1336148328.439863:0:5585:0:(llog.c:100:llog_free_handle()) kfreed 'loghandle': 184 at ffff8805f800e680.
            00000040:00000001:7.0:1336148328.439864:0:5585:0:(llog.c:717:llog_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            00000040:00000001:7.0:1336148328.439865:0:5585:0:(llog_server.c:104:llog_origin_handle_open()) Process leaving via out_pop (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
            00000040:00000040:7.0:1336148328.439866:0:5585:0:(lustre_log.h:539:llog_ctxt_put()) PUTting ctxt ffff8803143d8780 : new refcount 1
            00000040:00000001:7.0:1336148328.439867:0:5585:0:(llog_server.c:117:llog_origin_handle_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            
            behlendorf Brian Behlendorf added a comment - - edited With the various console cleanup patches applies things are a little clearer. The error is for the MGS in fact not the MDT, since it's my understanding there's no need for the llog here ENOENT isn't unexpected. What's unclear to me is why we're making this request to the MGS in the first place. Below an except from the MGS lustre log. LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2. If someone could propose a patch to resolve this that would be great since we're getting a ton of noise if the console because of it. 00000100:00100000:7.0:1336148328.439828:0:5585:0:(service.c:1706:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_00:8b011063-f502-34dc-d164-c95bb23ad469+7:6944:x1400718304314232:12345-192. 20000000:00000001:7.0:1336148328.439830:0:5585:0:(mgs_handler.c:602:mgs_handle()) Process entered 20000000:01000000:7.0:1336148328.439831:0:5585:0:(mgs_handler.c:694:mgs_handle()) @@@ llog_init req@ffff880302ff9850 x1400718304314232/t0(0) o501->8b011063-f502-34dc-d164-c95bb23ad469@192.168.65.108@o2ib6:0/0 lens 264/0 e 0 00000040:00000001:7.0:1336148328.439835:0:5585:0:(llog_server.c:78:llog_origin_handle_open()) Process entered 00000040:00000040:7.0:1336148328.439837:0:5585:0:(llog_server.c:91:llog_origin_handle_open()) MGS: opening log lcz-sptlrpc 00000040:00000040:7.0:1336148328.439838:0:5585:0:(lustre_log.h:529:llog_ctxt_get()) GETting ctxt ffff8803143d8780 : new refcount 2 00000040:00000001:7.0:1336148328.439839:0:5585:0:(llog.c:691:llog_open()) Process entered 00000040:00000001:7.0:1336148328.439840:0:5585:0:(llog.c:70:llog_alloc_handle()) Process entered 00000040:00000010:7.0:1336148328.439841:0:5585:0:(llog.c:72:llog_alloc_handle()) kmalloced 'loghandle' : 184 at ffff8805f800e680. 00000040:00000001:7.0:1336148328.439841:0:5585:0:(llog.c:80:llog_alloc_handle()) Process leaving (rc=18446612157949863552 : -131915759688064 : ffff8805f800e680) 00000040:00000001:7.0:1336148328.439843:0:5585:0:(llog_osd.c:925:llog_osd_open()) Process entered 00000004:00000001:7.0:1336148328.439845:0:5585:0:(osd_handler.c:1231:osd_object_read_lock()) obj=ffff880303190bf8 owner=( null ) r_locks=0 w_locks=0 00000004:00000001:7.0:1336148328.439846:0:5585:0:(osd_handler.c:2349:osd_index_lookup()) Process entered 00000004:00000001:7.0:1336148328.439859:0:5585:0:(osd_handler.c:2358:osd_index_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000004:00000001:7.0:1336148328.439860:0:5585:0:(osd_handler.c:1262:osd_object_read_unlock()) obj=ffff880303190bf8 owner=( null ) r_locks=1 w_locks=0 00000040:00000001:7.0:1336148328.439861:0:5585:0:(llog_osd.c:951:llog_osd_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) 00000040:00000001:7.0:1336148328.439862:0:5585:0:(llog_osd.c:988:llog_osd_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00000010:7.0:1336148328.439863:0:5585:0:(llog.c:100:llog_free_handle()) kfreed 'loghandle' : 184 at ffff8805f800e680. 00000040:00000001:7.0:1336148328.439864:0:5585:0:(llog.c:717:llog_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00000001:7.0:1336148328.439865:0:5585:0:(llog_server.c:104:llog_origin_handle_open()) Process leaving via out_pop (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) 00000040:00000040:7.0:1336148328.439866:0:5585:0:(lustre_log.h:539:llog_ctxt_put()) PUTting ctxt ffff8803143d8780 : new refcount 1 00000040:00000001:7.0:1336148328.439867:0:5585:0:(llog_server.c:117:llog_origin_handle_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)

            Here's one of our Lustre syslog files.

            zwicky[101-164] are compute nodes.
            zwicky[1-2],zwicky-mds1 are the Orion servers.

            Let me know if any other information or logs would be helpful.

            prakash Prakash Surya (Inactive) added a comment - Here's one of our Lustre syslog files. zwicky [101-164] are compute nodes. zwicky [1-2] ,zwicky-mds1 are the Orion servers. Let me know if any other information or logs would be helpful.

            People

              tappro Mikhail Pershin
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: