Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
None
Environment:
Lustre versions : 2.15, master.

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

All the corefiles look very similar with a backtrace like following :

Core was generated by `/usr/sbin/lsvcgssd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f0b31369e01 in gssint_get_mechanism (oid=0x6d6f) at g_initialize.c:1137

#0  0x00007f0b31369e01 in gssint_get_mechanism (oid=0x6d6f) at g_initialize.c:1137 
#1  0x00007f0b3136408d in gssint_delete_internal_sec_context (minor_status=0x7ffc5c429310, mech_type=<optimized out>, internal_ctx=0x1000020, output_token=0x7ffc5c4292f0) at g_glue.c:603 
#2  0x00007f0b31361e0a in gss_delete_sec_context (minor_status=minor_status@entry=0x7ffc5c429310, context_handle=context_handle@entry=0x7ffc5c4293f0, output_token=output_token@entry=0x7ffc5c4292f0) at g_delete_sec_context.c:91 
#3  0x000000000040b6ee in handle_krb (snd=0x7ffc5c429370) at svcgssd_proc.c:879 
#4  handle_channel_request (fd=fd@entry=6) at svcgssd_proc.c:1057 
#5  0x000000000040984f in svcgssd_run () at svcgssd_main_loop.c:119 
#6  0x00000000004039f6 in main (argc=8, argv=<optimized out>) at svcgssd.c:335

Doing more investigations in the corefiles, it appears that the SEGV occurs when dereferencing oid->elements in g_OID_equal() macro because oid pointer seems to be corrupted.

oid has been found in snd->ctx->mech_type :

*snd = {lustre_svc = 0x2, nid = 0x50000ac100053, handle_seq = 0x6e47b94ae23aa8c4, nm_name = {0x64, 0x65, 0x66, 0x61, 0x75, 0x6c, 0x74, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, in_tok = {length = 0x77b, value = 0x1020ae0}, out_tok = {
    length = 0x9c, value = 0x101f130}, in_handle = {length = 0x0, value = 0x7ffc5c4292d2}, out_handle = {length = 0x8, value = 0x7ffc5c4292e1}, maj_stat = 0x0, min_stat = 0x0, mech = 0x611520, ctx = 0x1000010, ctx_token = {length = 0x80, 
    value = 0x10231b0}}

and

 *(snd->ctx) = {loopback = 0x1000010, mech_type = 0x6d6f, internal_ctx_id = 0x20}

Looking to the corresponding heap memory content :

0xffff00:       0x0000000000000007      0x00000000010045e0
0xffff10:       0x0000000000000007      0x0000000001004600
0xffff20:       0x0000000000000000      0x0000000000000021  <-
0xffff30:       0x0000000001004b6f      0x000000000073746c     
0xffff40:       0x0000000000000020      0x00000000000000c1  <- previous malloc_chunk
0xffff50:       0x0000000000000fff      0x1bbfbabed19442ca
0xffff60:       0x6d6f632e00000007      0x00000000010064b0
0xffff70:       0x64695f3500000007      0x0000000001001e70
0xffff80:       0x0000000000000007      0x0000000001000010
0xffff90:       0x0000000000000007      0x0000000001002910
0xffffa0:       0x0000000000000007      0x0000000001002ae0
0xffffb0:       0x0000000000000007      0x00000000010017b0
0xffffc0:       0x0000000000000007      0x0000000001001710
0xffffd0:       0x632d383200000007      0x0000000000fffaa0
0xffffe0:       0x0000000000000007      0x0000000000fffe60
0xfffff0:       0x0000000000000007      0x0000000001006870         
0x1000000:      0x0000000000000000      0x0000000000000021  <- ctx malloc_chunk
0x1000010:      0x0000000001000010      0x0000000000006d6f
0x1000020:      0x0000000000000020      0x00000000000000d1 <- next malloc_chunk
0x1000030:      0x0000000000001000      0x1bbfbabed19442ca

we can see that the heap-area/malloc_chunk originally containing the snd->ctx (a <*gss_union_ctx_id_t/struct gss_union_ctx_id_struct> made of 3 pointers, ie of size 24/0x18 chars) should have been already freed and reallocated at the time of the crash, because the corresponding malloc_chunk at the time of the crash is now of size 16/0x10 !!...

Looking into the concerned lustre-gss/krb5-lib concerned source code, it seems that problem comes from the fact handle_krb() calls gss_delete_sec_context() upon success from serialize_context_for_kernel()/serialize_krb5_ctx()/gss_krb5_export_lucid_sec_context() call path when it should not as this may have been already done by the choosen implementation.

And this is the case with the lucid implementation, but there is an other problem in the lustre-gss interfacing code which prevents the GSS_C_NO_CONTEXT value to be set in snd->ctx , because its value and not address is being passed to serialize_context_for_kernel()/serialize_krb5_ctx() causing its on-stack copy to be set instead.

We have successfully tested a fix on-site for these 2 issues and I will push it soon.

Attachments

Activity

People

Assignee:: Bruno Faccini

Reporter:: Bruno Faccini

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Feb/24 11:13 AM

Updated:: 24/Feb/24 3:46 AM

Resolved:: 23/Feb/24 2:26 PM