Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2145

The llog_origin_handle_create operation failed with -2

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.5.0
    • 3
    • Orion
    • 2852

    Description

      When running IOR's writing to our Orion test system, I am seeing many Lustre error messages of the form:

      Jan  4 16:24:52 zwicky142 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:52 zwicky146 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky150 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky125 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky123 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky151 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:53 zwicky115 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      Jan  4 16:24:54 zwicky147 kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.211@o2ib9. The llog_origin_handle_create operation failed with -2
      

      These are all messages from the clients, with 10.1.1.211 being the address of our MDS.

      The clients are running:

      $ rpm -qa | grep lustre-modules
      lustre-modules-2.1.0-15chaos_2.6.32_220.1chaos.ch5.x86_64.x86_64
      

      And the servers:

      # zwicky1 /root > rpm -qa | grep lustre-orion-modules
      lustre-orion-modules-2.2.49.50-2chaos_2.6.32_220.2.1.1chaos.ch5.x86_64.x86_64
      

      Attachments

        Issue Links

          Activity

            [LU-2145] The llog_origin_handle_create operation failed with -2

            This should be done in the same manner as MDT handler, using err_serious() primitive to distinguish serious errors returning error code and processing errors to be returned in rq_status. I expect that shouldn't be hard.

            tappro Mikhail Pershin added a comment - This should be done in the same manner as MDT handler, using err_serious() primitive to distinguish serious errors returning error code and processing errors to be returned in rq_status. I expect that shouldn't be hard.

            The problem here is that this RPC handler function is returning an error code of -2 (i.e. as if it had some error handling the RPC itself), instead of the function returning 0 and setting a rq_status=2 (i.e. the RPC was handled correctly, but it produced an error).

            adilger Andreas Dilger added a comment - The problem here is that this RPC handler function is returning an error code of -2 (i.e. as if it had some error handling the RPC itself), instead of the function returning 0 and setting a rq_status=2 (i.e. the RPC was handled correctly, but it produced an error).
            behlendorf Brian Behlendorf added a comment - - edited

            With the various console cleanup patches applies things are a little clearer. The error is for the MGS in fact not the MDT, since it's my understanding there's no need for the llog here ENOENT isn't unexpected. What's unclear to me is why we're making this request to the MGS in the first place. Below an except from the MGS lustre log.

             LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2. 

            If someone could propose a patch to resolve this that would be great since we're getting a ton of noise if the console because of it.

            00000100:00100000:7.0:1336148328.439828:0:5585:0:(service.c:1706:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_00:8b011063-f502-34dc-d164-c95bb23ad469+7:6944:x1400718304314232:12345-192.
            20000000:00000001:7.0:1336148328.439830:0:5585:0:(mgs_handler.c:602:mgs_handle()) Process entered
            20000000:01000000:7.0:1336148328.439831:0:5585:0:(mgs_handler.c:694:mgs_handle()) @@@ llog_init  req@ffff880302ff9850 x1400718304314232/t0(0) o501->8b011063-f502-34dc-d164-c95bb23ad469@192.168.65.108@o2ib6:0/0 lens 264/0 e 0
            00000040:00000001:7.0:1336148328.439835:0:5585:0:(llog_server.c:78:llog_origin_handle_open()) Process entered
            00000040:00000040:7.0:1336148328.439837:0:5585:0:(llog_server.c:91:llog_origin_handle_open()) MGS: opening log lcz-sptlrpc
            00000040:00000040:7.0:1336148328.439838:0:5585:0:(lustre_log.h:529:llog_ctxt_get()) GETting ctxt ffff8803143d8780 : new refcount 2
            00000040:00000001:7.0:1336148328.439839:0:5585:0:(llog.c:691:llog_open()) Process entered
            00000040:00000001:7.0:1336148328.439840:0:5585:0:(llog.c:70:llog_alloc_handle()) Process entered
            00000040:00000010:7.0:1336148328.439841:0:5585:0:(llog.c:72:llog_alloc_handle()) kmalloced 'loghandle': 184 at ffff8805f800e680.
            00000040:00000001:7.0:1336148328.439841:0:5585:0:(llog.c:80:llog_alloc_handle()) Process leaving (rc=18446612157949863552 : -131915759688064 : ffff8805f800e680)
            00000040:00000001:7.0:1336148328.439843:0:5585:0:(llog_osd.c:925:llog_osd_open()) Process entered
            00000004:00000001:7.0:1336148328.439845:0:5585:0:(osd_handler.c:1231:osd_object_read_lock()) obj=ffff880303190bf8 owner=(null) r_locks=0 w_locks=0
            00000004:00000001:7.0:1336148328.439846:0:5585:0:(osd_handler.c:2349:osd_index_lookup()) Process entered
            00000004:00000001:7.0:1336148328.439859:0:5585:0:(osd_handler.c:2358:osd_index_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            00000004:00000001:7.0:1336148328.439860:0:5585:0:(osd_handler.c:1262:osd_object_read_unlock()) obj=ffff880303190bf8 owner=(null) r_locks=1 w_locks=0
            00000040:00000001:7.0:1336148328.439861:0:5585:0:(llog_osd.c:951:llog_osd_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
            00000040:00000001:7.0:1336148328.439862:0:5585:0:(llog_osd.c:988:llog_osd_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            00000040:00000010:7.0:1336148328.439863:0:5585:0:(llog.c:100:llog_free_handle()) kfreed 'loghandle': 184 at ffff8805f800e680.
            00000040:00000001:7.0:1336148328.439864:0:5585:0:(llog.c:717:llog_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            00000040:00000001:7.0:1336148328.439865:0:5585:0:(llog_server.c:104:llog_origin_handle_open()) Process leaving via out_pop (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
            00000040:00000040:7.0:1336148328.439866:0:5585:0:(lustre_log.h:539:llog_ctxt_put()) PUTting ctxt ffff8803143d8780 : new refcount 1
            00000040:00000001:7.0:1336148328.439867:0:5585:0:(llog_server.c:117:llog_origin_handle_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
            
            behlendorf Brian Behlendorf added a comment - - edited With the various console cleanup patches applies things are a little clearer. The error is for the MGS in fact not the MDT, since it's my understanding there's no need for the llog here ENOENT isn't unexpected. What's unclear to me is why we're making this request to the MGS in the first place. Below an except from the MGS lustre log. LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2. If someone could propose a patch to resolve this that would be great since we're getting a ton of noise if the console because of it. 00000100:00100000:7.0:1336148328.439828:0:5585:0:(service.c:1706:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_00:8b011063-f502-34dc-d164-c95bb23ad469+7:6944:x1400718304314232:12345-192. 20000000:00000001:7.0:1336148328.439830:0:5585:0:(mgs_handler.c:602:mgs_handle()) Process entered 20000000:01000000:7.0:1336148328.439831:0:5585:0:(mgs_handler.c:694:mgs_handle()) @@@ llog_init req@ffff880302ff9850 x1400718304314232/t0(0) o501->8b011063-f502-34dc-d164-c95bb23ad469@192.168.65.108@o2ib6:0/0 lens 264/0 e 0 00000040:00000001:7.0:1336148328.439835:0:5585:0:(llog_server.c:78:llog_origin_handle_open()) Process entered 00000040:00000040:7.0:1336148328.439837:0:5585:0:(llog_server.c:91:llog_origin_handle_open()) MGS: opening log lcz-sptlrpc 00000040:00000040:7.0:1336148328.439838:0:5585:0:(lustre_log.h:529:llog_ctxt_get()) GETting ctxt ffff8803143d8780 : new refcount 2 00000040:00000001:7.0:1336148328.439839:0:5585:0:(llog.c:691:llog_open()) Process entered 00000040:00000001:7.0:1336148328.439840:0:5585:0:(llog.c:70:llog_alloc_handle()) Process entered 00000040:00000010:7.0:1336148328.439841:0:5585:0:(llog.c:72:llog_alloc_handle()) kmalloced 'loghandle' : 184 at ffff8805f800e680. 00000040:00000001:7.0:1336148328.439841:0:5585:0:(llog.c:80:llog_alloc_handle()) Process leaving (rc=18446612157949863552 : -131915759688064 : ffff8805f800e680) 00000040:00000001:7.0:1336148328.439843:0:5585:0:(llog_osd.c:925:llog_osd_open()) Process entered 00000004:00000001:7.0:1336148328.439845:0:5585:0:(osd_handler.c:1231:osd_object_read_lock()) obj=ffff880303190bf8 owner=( null ) r_locks=0 w_locks=0 00000004:00000001:7.0:1336148328.439846:0:5585:0:(osd_handler.c:2349:osd_index_lookup()) Process entered 00000004:00000001:7.0:1336148328.439859:0:5585:0:(osd_handler.c:2358:osd_index_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000004:00000001:7.0:1336148328.439860:0:5585:0:(osd_handler.c:1262:osd_object_read_unlock()) obj=ffff880303190bf8 owner=( null ) r_locks=1 w_locks=0 00000040:00000001:7.0:1336148328.439861:0:5585:0:(llog_osd.c:951:llog_osd_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) 00000040:00000001:7.0:1336148328.439862:0:5585:0:(llog_osd.c:988:llog_osd_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00000010:7.0:1336148328.439863:0:5585:0:(llog.c:100:llog_free_handle()) kfreed 'loghandle' : 184 at ffff8805f800e680. 00000040:00000001:7.0:1336148328.439864:0:5585:0:(llog.c:717:llog_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 00000040:00000001:7.0:1336148328.439865:0:5585:0:(llog_server.c:104:llog_origin_handle_open()) Process leaving via out_pop (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) 00000040:00000040:7.0:1336148328.439866:0:5585:0:(lustre_log.h:539:llog_ctxt_put()) PUTting ctxt ffff8803143d8780 : new refcount 1 00000040:00000001:7.0:1336148328.439867:0:5585:0:(llog_server.c:117:llog_origin_handle_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)

            Here's one of our Lustre syslog files.

            zwicky[101-164] are compute nodes.
            zwicky[1-2],zwicky-mds1 are the Orion servers.

            Let me know if any other information or logs would be helpful.

            prakash Prakash Surya (Inactive) added a comment - Here's one of our Lustre syslog files. zwicky [101-164] are compute nodes. zwicky [1-2] ,zwicky-mds1 are the Orion servers. Let me know if any other information or logs would be helpful.

            Yes, please, upload some logs, I agree also that these messages are too noisy and can be avoided if harmless

            tappro Mikhail Pershin added a comment - Yes, please, upload some logs, I agree also that these messages are too noisy and can be avoided if harmless

            I haven't checked during mount time, but they are definitely seen during IOR execution itself.

            We'll need to get confirmation from Chris, but I think it can be demoted from a blocker since they don't appear to be reaching the application and causing failures.

            As far as I can tell, they are relatively harmless, although it does muddy the syslog messages quite a bit. If these messages can be safely ignored, they shouldn't make it to syslog at all IMO.

            I can upload the client syslog info if it's helpful..?

            prakash Prakash Surya (Inactive) added a comment - I haven't checked during mount time, but they are definitely seen during IOR execution itself. We'll need to get confirmation from Chris, but I think it can be demoted from a blocker since they don't appear to be reaching the application and causing failures. As far as I can tell, they are relatively harmless, although it does muddy the syslog messages quite a bit. If these messages can be safely ignored, they shouldn't make it to syslog at all IMO. I can upload the client syslog info if it's helpful..?

            are these messages seen upon client mount or during IOR itself? This can be harmless messages, related to sptlrpc llog. Is there some log (syslog or console) from any client?

            This bug is set as blocker, are these messages affecting IOR execution somehow?

            tappro Mikhail Pershin added a comment - are these messages seen upon client mount or during IOR itself? This can be harmless messages, related to sptlrpc llog. Is there some log (syslog or console) from any client? This bug is set as blocker, are these messages affecting IOR execution somehow?
            pjones Peter Jones added a comment -

            No problem. I have moved it to the Orion project.

            pjones Peter Jones added a comment - No problem. I have moved it to the Orion project.

            Yeah, sorry, it probably should have been an ORI ticket. It was too late to fix by the time I realized it was an LU.

            prakash Prakash Surya (Inactive) added a comment - Yeah, sorry, it probably should have been an ORI ticket. It was too late to fix by the time I realized it was an LU.
            pjones Peter Jones added a comment -

            Should this ticket have been opened under the ORI project rather than the LU project?

            pjones Peter Jones added a comment - Should this ticket have been opened under the ORI project rather than the LU project?

            Blocker for ORI on Sequoia.

            morrone Christopher Morrone (Inactive) added a comment - Blocker for ORI on Sequoia.

            People

              tappro Mikhail Pershin
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: