Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
Lustre versions : 2.15, master.
-
3
-
9223372036854775807
Description
All the corefiles look very similar with a backtrace like following :
Core was generated by `/usr/sbin/lsvcgssd'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f0b31369e01 in gssint_get_mechanism (oid=0x6d6f) at g_initialize.c:1137
#0 0x00007f0b31369e01 in gssint_get_mechanism (oid=0x6d6f) at g_initialize.c:1137 #1 0x00007f0b3136408d in gssint_delete_internal_sec_context (minor_status=0x7ffc5c429310, mech_type=<optimized out>, internal_ctx=0x1000020, output_token=0x7ffc5c4292f0) at g_glue.c:603 #2 0x00007f0b31361e0a in gss_delete_sec_context (minor_status=minor_status@entry=0x7ffc5c429310, context_handle=context_handle@entry=0x7ffc5c4293f0, output_token=output_token@entry=0x7ffc5c4292f0) at g_delete_sec_context.c:91 #3 0x000000000040b6ee in handle_krb (snd=0x7ffc5c429370) at svcgssd_proc.c:879 #4 handle_channel_request (fd=fd@entry=6) at svcgssd_proc.c:1057 #5 0x000000000040984f in svcgssd_run () at svcgssd_main_loop.c:119 #6 0x00000000004039f6 in main (argc=8, argv=<optimized out>) at svcgssd.c:335
Doing more investigations in the corefiles, it appears that the SEGV occurs when dereferencing oid->elements in g_OID_equal() macro because oid pointer seems to be corrupted.
oid has been found in snd->ctx->mech_type :
*snd = {lustre_svc = 0x2, nid = 0x50000ac100053, handle_seq = 0x6e47b94ae23aa8c4, nm_name = {0x64, 0x65, 0x66, 0x61, 0x75, 0x6c, 0x74, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, in_tok = {length = 0x77b, value = 0x1020ae0}, out_tok = { length = 0x9c, value = 0x101f130}, in_handle = {length = 0x0, value = 0x7ffc5c4292d2}, out_handle = {length = 0x8, value = 0x7ffc5c4292e1}, maj_stat = 0x0, min_stat = 0x0, mech = 0x611520, ctx = 0x1000010, ctx_token = {length = 0x80, value = 0x10231b0}}
and
*(snd->ctx) = {loopback = 0x1000010, mech_type = 0x6d6f, internal_ctx_id = 0x20}
Looking to the corresponding heap memory content :
0xffff00: 0x0000000000000007 0x00000000010045e0 0xffff10: 0x0000000000000007 0x0000000001004600 0xffff20: 0x0000000000000000 0x0000000000000021 <- 0xffff30: 0x0000000001004b6f 0x000000000073746c 0xffff40: 0x0000000000000020 0x00000000000000c1 <- previous malloc_chunk 0xffff50: 0x0000000000000fff 0x1bbfbabed19442ca 0xffff60: 0x6d6f632e00000007 0x00000000010064b0 0xffff70: 0x64695f3500000007 0x0000000001001e70 0xffff80: 0x0000000000000007 0x0000000001000010 0xffff90: 0x0000000000000007 0x0000000001002910 0xffffa0: 0x0000000000000007 0x0000000001002ae0 0xffffb0: 0x0000000000000007 0x00000000010017b0 0xffffc0: 0x0000000000000007 0x0000000001001710 0xffffd0: 0x632d383200000007 0x0000000000fffaa0 0xffffe0: 0x0000000000000007 0x0000000000fffe60 0xfffff0: 0x0000000000000007 0x0000000001006870 0x1000000: 0x0000000000000000 0x0000000000000021 <- ctx malloc_chunk 0x1000010: 0x0000000001000010 0x0000000000006d6f 0x1000020: 0x0000000000000020 0x00000000000000d1 <- next malloc_chunk 0x1000030: 0x0000000000001000 0x1bbfbabed19442ca
we can see that the heap-area/malloc_chunk originally containing the snd->ctx (a <*gss_union_ctx_id_t/struct gss_union_ctx_id_struct> made of 3 pointers, ie of size 24/0x18 chars) should have been already freed and reallocated at the time of the crash, because the corresponding malloc_chunk at the time of the crash is now of size 16/0x10 !!...
Looking into the concerned lustre-gss/krb5-lib concerned source code, it seems that problem comes from the fact handle_krb() calls gss_delete_sec_context() upon success from serialize_context_for_kernel()/serialize_krb5_ctx()/gss_krb5_export_lucid_sec_context() call path when it should not as this may have been already done by the choosen implementation.
And this is the case with the lucid implementation, but there is an other problem in the lustre-gss interfacing code which prevents the GSS_C_NO_CONTEXT value to be set in snd->ctx , because its value and not address is being passed to serialize_context_for_kernel()/serialize_krb5_ctx() causing its on-stack copy to be set instead.
We have successfully tested a fix on-site for these 2 issues and I will push it soon.