[LU-7182] LBUG during key reestablishment with GSS clients Created: 18/Sep/15  Updated: 01/Jun/16  Resolved: 24/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Jeremy Filizetti Assignee: John Hammond
Resolution: Duplicate Votes: 0
Labels: SSK, kerberos

Issue Links:
Related
is related to LU-5951 sanity test_39k: mtime is lost on close Resolved
is related to LU-3289 IU Shared Secret Key authentication a... Resolved
Epic/Theme: gss
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error during reconnect and during key expiration when the client attempts to instantiate a new key with the server. The server gets an LBUG:

<3>LustreError: 10260:0:(tgt_handler.c:679:tgt_request_handle()) @@@ Unexpected xid 0 vs. last_xid 55e7a7650006b

See following dmesg output:
<4>Lustre: 9770:0:(sec_gss.c:2086:gss_svc_handle_init()) create svc ctx ffff8800a3f2d240: user from 192.168.1.108@tcp authenticated as root
<4>Lustre: 9770:0:(sec_gss.c:394:gss_cli_ctx_uptodate()) server installed reverse ctx ffff8800a8d74b40 idx 0xbf1831de90d710e6, expiry 3588728678(+2147483647s)
<4>Lustre: 9770:0:(tgt_handler.c:850:tgt_init_sec_level()) client 192.168.1.108@tcp -> target lustre-MDT0000 uses old version, run under security level 0.
<4>Lustre: 10260:0:(sec_gss.c:2346:gss_svc_handle_destroy()) destroy svc ctx ffff8800a3f2d240 idx 0xbd752af10f590f8d (0->192.168.1.108@tcp)
<3>LustreError: 10260:0:(tgt_handler.c:679:tgt_request_handle()) @@@ Unexpected xid 0 vs. last_xid 55e7a7650006b
<3>  req@ffff8800b3b85cc0 x0/t0(0) o803->2fc4a29f-142e-0b7c-f3c3-9b04e56ce460@192.168.1.108@tcp:0/0 lens 224/0 e 0 to 0 dl 1441245045 ref 1 fl Interpret:/0/ffffffff rc 0/-1
<4>Lustre: 10280:0:(gss_svc_upcall.c:1076:gss_svc_upcall_get_ctx()) Invalid gss ctx idx 0xbd752af10f590f8d from 192.168.1.108@tcp
<0>LustreError: 10260:0:(tgt_handler.c:681:tgt_request_handle()) LBUG
<4>Pid: 10260, comm: mdt00_004
<4>
<4>Call Trace:
<4> [<ffffffffa100d875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa100de77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa069db98>] tgt_request_handle+0xc68/0x1230 [ptlrpc]
<4> [<ffffffffa06455d1>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
<4> [<ffffffffa0644790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 10260, comm: mdt00_004 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8152896c>] ? panic+0xa7/0x16f
<4> [<ffffffffa100decb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa069db98>] ? tgt_request_handle+0xc68/0x1230 [ptlrpc]
<4> [<ffffffffa06455d1>] ? ptlrpc_main+0xe41/0x1920 [ptlrpc]
<4> [<ffffffffa0644790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20


 Comments   
Comment by Jeremy Filizetti [ 18/Sep/15 ]

Linking to shared key but I believe this affect kerberos as well

Comment by Joseph Gmitter (Inactive) [ 18/Sep/15 ]

Hi John,
Can you take a look at this one as well?
Thanks.
Joe

Comment by Sebastien Buisson (Inactive) [ 21/Sep/15 ]

Hi,

I think the following patch from Niu should land before this issue is investigated:
http://review.whamcloud.com/15473

Indeed, Kerberos revival and multi-slot last-rcvd features were developed at the same time. And because they both change portions of code in CLIO, it can lead to such issues.

Sebastien.

Comment by Joseph Gmitter (Inactive) [ 02/Oct/15 ]

Update: http://review.whamcloud.com/#/c/15473/ has landed to master on 10/2

Comment by Peter Jones [ 26/Oct/15 ]

Jeremy

Do you still see this as an issue with the tip of master?

Peter

Comment by Andreas Dilger [ 24/Nov/15 ]

The patch http://review.whamcloud.com/15473 "LU-5951 ptlrpc: track unreplied requests" was landed to master for 2.8.0.

Since we haven't hit this issue ourselves, I'm closing this as a duplicate of that bug. Jeremy, please reopen if that patch did not resolve your probem.

Comment by Jeremy Filizetti [ 01/Dec/15 ]

I'm not sure about 15473, I ended up using http://review.whamcloud.com/#/c/16759/.

Comment by Jeremy Filizetti [ 01/Dec/15 ]

I don't have access to reopen this or I don't know how. Either way it needs to be reopen until 16759 lands. After looking in my test system I can see that without 16759 I was still having the LBUG.

Comment by Peter Jones [ 01/Dec/15 ]

Jeremy

I have now given you permissions to reopen tickets, but in this case, wouldn't it be better to just track LU-5951 for 2.8?

Peter

Comment by Jeremy Filizetti [ 01/Dec/15 ]

Thanks. That is fine, we can keep it in LU-5951.

Generated at Sat Feb 10 02:06:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.