Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17535

lsvcgssd daemon random crashes/SEGVs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • Lustre versions : 2.15, master.
    • 3
    • 9223372036854775807

    Description

      All the corefiles look very similar with a backtrace like following :

      Core was generated by `/usr/sbin/lsvcgssd'.
      Program terminated with signal SIGSEGV, Segmentation fault.
      #0  0x00007f0b31369e01 in gssint_get_mechanism (oid=0x6d6f) at g_initialize.c:1137
      #0  0x00007f0b31369e01 in gssint_get_mechanism (oid=0x6d6f) at g_initialize.c:1137 
      #1  0x00007f0b3136408d in gssint_delete_internal_sec_context (minor_status=0x7ffc5c429310, mech_type=<optimized out>, internal_ctx=0x1000020, output_token=0x7ffc5c4292f0) at g_glue.c:603 
      #2  0x00007f0b31361e0a in gss_delete_sec_context (minor_status=minor_status@entry=0x7ffc5c429310, context_handle=context_handle@entry=0x7ffc5c4293f0, output_token=output_token@entry=0x7ffc5c4292f0) at g_delete_sec_context.c:91 
      #3  0x000000000040b6ee in handle_krb (snd=0x7ffc5c429370) at svcgssd_proc.c:879 
      #4  handle_channel_request (fd=fd@entry=6) at svcgssd_proc.c:1057 
      #5  0x000000000040984f in svcgssd_run () at svcgssd_main_loop.c:119 
      #6  0x00000000004039f6 in main (argc=8, argv=<optimized out>) at svcgssd.c:335
      

       

      Doing more investigations in the corefiles, it appears that the SEGV occurs when dereferencing oid->elements in g_OID_equal() macro because oid pointer seems to be corrupted.

      oid has been found in snd->ctx->mech_type :

      *snd = {lustre_svc = 0x2, nid = 0x50000ac100053, handle_seq = 0x6e47b94ae23aa8c4, nm_name = {0x64, 0x65, 0x66, 0x61, 0x75, 0x6c, 0x74, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, in_tok = {length = 0x77b, value = 0x1020ae0}, out_tok = {
          length = 0x9c, value = 0x101f130}, in_handle = {length = 0x0, value = 0x7ffc5c4292d2}, out_handle = {length = 0x8, value = 0x7ffc5c4292e1}, maj_stat = 0x0, min_stat = 0x0, mech = 0x611520, ctx = 0x1000010, ctx_token = {length = 0x80, 
          value = 0x10231b0}}
      

      and

       *(snd->ctx) = {loopback = 0x1000010, mech_type = 0x6d6f, internal_ctx_id = 0x20}

       

      Looking to the corresponding heap memory content :

      0xffff00:       0x0000000000000007      0x00000000010045e0
      0xffff10:       0x0000000000000007      0x0000000001004600
      0xffff20:       0x0000000000000000      0x0000000000000021  <-
      0xffff30:       0x0000000001004b6f      0x000000000073746c     
      0xffff40:       0x0000000000000020      0x00000000000000c1  <- previous malloc_chunk
      0xffff50:       0x0000000000000fff      0x1bbfbabed19442ca
      0xffff60:       0x6d6f632e00000007      0x00000000010064b0
      0xffff70:       0x64695f3500000007      0x0000000001001e70
      0xffff80:       0x0000000000000007      0x0000000001000010
      0xffff90:       0x0000000000000007      0x0000000001002910
      0xffffa0:       0x0000000000000007      0x0000000001002ae0
      0xffffb0:       0x0000000000000007      0x00000000010017b0
      0xffffc0:       0x0000000000000007      0x0000000001001710
      0xffffd0:       0x632d383200000007      0x0000000000fffaa0
      0xffffe0:       0x0000000000000007      0x0000000000fffe60
      0xfffff0:       0x0000000000000007      0x0000000001006870         
      0x1000000:      0x0000000000000000      0x0000000000000021  <- ctx malloc_chunk
      0x1000010:      0x0000000001000010      0x0000000000006d6f
      0x1000020:      0x0000000000000020      0x00000000000000d1 <- next malloc_chunk
      0x1000030:      0x0000000000001000      0x1bbfbabed19442ca
      

      we can see that the heap-area/malloc_chunk originally containing the snd->ctx (a <*gss_union_ctx_id_t/struct gss_union_ctx_id_struct> made of 3 pointers, ie of size 24/0x18 chars) should have been already freed and reallocated at the time of the crash, because the corresponding malloc_chunk at the time of the crash is now of size 16/0x10 !!...

       

      Looking into the concerned lustre-gss/krb5-lib concerned source code, it seems that problem comes from the fact handle_krb() calls gss_delete_sec_context() upon success from serialize_context_for_kernel()/serialize_krb5_ctx()/gss_krb5_export_lucid_sec_context() call path when it should not as this may have been already done by the choosen implementation.

      And this is the case with the lucid implementation, but there is an other problem in the lustre-gss interfacing code which prevents the GSS_C_NO_CONTEXT value to be set in snd->ctx , because its value and not address is being passed to serialize_context_for_kernel()/serialize_krb5_ctx() causing its on-stack copy to be set instead.

      We have successfully tested a fix on-site for these 2 issues and I will push it soon.

      Attachments

        Activity

          People

            bfaccini-nvda Bruno Faccini
            bfaccini-nvda Bruno Faccini
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: