[LU-634] LBUG in Kerberos sec.c::sptlrpc_req_ctx_switch() Created: 25/Aug/11  Updated: 17/Feb/21  Resolved: 28/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Josephine Palencia Assignee: WC Triage
Resolution: Incomplete Votes: 0
Labels: None
Environment:

Lustre 2.0.63 clients
Kernel 2.6.18-238.12.1.el5xen

Crashes client only even with simple ls. Apparent no pattern and not easily reproducible.
Patched/fixed in later versions?


Attachments: Text File LU-634-crash.log     Text File LU-634-crash2.log     File LU634-lfs-2.1.54.patch     File console-07.09.12    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 8152

 Description   

Aug 18 14:14:36 extenci kernel: LustreError: 3532:0:(sec.c:468:sptlrpc_req_ctx_switch()) ASSERTION(req->rq_reqmsg) failed
Aug 18 14:14:36 extenci kernel: LustreError: 3532:0:(sec.c:468:sptlrpc_req_ctx_switch()) LBUG



 Comments   
Comment by Oleg Drokin [ 20/Sep/11 ]

Can we at least get some logs, it's impossible to evaluate this otherwise.

Comment by Dave Dykstra [ 19/Jan/12 ]

Here's a much more detailed console log of the crash that Josephine reported. This is with the latest lustre tag from which I built lustre-client-2.1.54-2.6.18_274.12.1.el5.x86_64. In case it matters, the configure options were --disable-server --enable-dependency-tracking --enable-posix-osd --enable-panic_dumplog --enable-health_write --enable-lru-resize --enable-gss --enable-quota --enable-ext4 --enable-mindf.

Comment by Josephine Palencia [ 24/Jan/12 ]

Also for kernel-2.6.18-274.17.1.el5xen,
lustre: 2.1.54

Jan 24 10:13:09 extenci kernel: Lustre: 1775:0:(gss_keyring.c:970:gss_sec_gc_ctx_kr()) running gc
Jan 24 10:25:23 extenci kernel: Lustre: 3695:0:(sec_gss.c:405:gss_cli_ctx_uptodate()) client refreshed ctx ffff88030bfa9c80 idx 0xbf75804b3ef9f4af (77513->extenci-MDT0000_UUID), expiry 1327471191(+52468s)
Jan 24 10:26:40 extenci kernel: Lustre: 3740:0:(sec_gss.c:405:gss_cli_ctx_uptodate()) client refreshed ctx ffff88030bfa9880 idx 0xbf75804b3ef9f4b0 (77513->extenci-MDT0000_UUID), expiry 1327471191(+52391s)
Jan 24 10:26:43 extenci kernel: Lustre: 3784:0:(sec_gss.c:405:gss_cli_ctx_uptodate()) client refreshed ctx ffff8802640873c0 idx 0xbf75804b3ef9f4b1 (77513->extenci-MDT0000_UUID), expiry 1327471191(+52388s)
Jan 24 11:13:09 extenci kernel: Lustre: 1775:0:(gss_keyring.c:970:gss_sec_gc_ctx_kr()) running gc
Jan 24 11:57:16 extenci kernel: Lustre: 31454:0:(sec_gss.c:345:cli_ctx_expire()) ctx ffff880236f42280(77602->extenci-MDT0000_UUID) get expired: 1327422052(-2184s)
Jan 24 11:57:16 extenci kernel: LustreError: 31454:0:(sec.c:468:sptlrpc_req_ctx_switch()) ASSERTION(req->rq_reqmsg) failed
Jan 24 11:57:16 extenci kernel: LustreError: 31454:0:(sec.c:468:sptlrpc_req_ctx_switch()) LBUG
Jan 24 11:57:16 extenci kernel: Pid: 31454, comm: bash
Jan 24 11:57:16 extenci kernel:
Jan 24 11:57:16 extenci kernel: Call Trace:
Jan 24 11:57:16 extenci kernel: [<ffffffff88425641>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Jan 24 11:57:16 extenci kernel: [<ffffffff88425b7a>] lbug_with_loc+0x7a/0xd0 [libcfs]
Jan 24 11:57:16 extenci kernel: [<ffffffff88430cf0>] cfs_tracefile_init+0x0/0x10a [libcfs]
Jan 24 11:57:16 extenci kernel: [<ffffffff8860f81c>] sptlrpc_req_replace_dead_ctx+0x24c/0xad0 [ptlrpc]
Jan 24 11:57:16 extenci kernel: [<ffffffff8031b3d1>] request_key_and_link+0x41/0x4e9
Jan 24 11:57:16 extenci kernel: [<ffffffff8860c3bc>] sptlrpc_import_sec_ref+0x1c/0x30 [ptlrpc]
Jan 24 11:57:16 extenci kernel: [<ffffffff8860f1f8>] import_sec_validate_get+0xd8/0x1f0 [ptlrpc]
Jan 24 17:43:10 extenci syslogd 1.4.1: restart.

Comment by Dave Dykstra [ 01/Feb/12 ]

On kernel-2.6.18-274.17.1.el5 (with lustre 2.1.54) it seems to take a lot longer to reproduce the problem, but it still happens. Another crash log is attached.

Comment by Oleg Drokin [ 08/Feb/12 ]

Ah, so it's some sort of kerberos deployment?
This is not really supported at the moment, sorry.

Comment by Dave Dykstra [ 08/Feb/12 ]

That's disappointing news to me. Can you please tell us who contributed the Kerberos code then? Perhaps they would be interested in improving its quality, or could at least advise us on how to debug it. We are part of an NSF-funded project (extenci.org) that is in part evaluating wide-area lustre for use by Large Hadron Collider experiments. Working, non-crashing Kerberos functionality is a vital part of that.

Comment by Peter Jones [ 09/Feb/12 ]

Dave

I'll reach out to you directly to discuss this

Peter

Comment by santosh kulkarni (Inactive) [ 25/Oct/12 ]

Details regarding the issues and the patch.

Fixes are specific to lustre-2.1.54 code for providing kerberos support.
LU634 was not just specific to ASSERT crashes but had other issues too.Attaching the patch and the below write-up describes the issues and the corresponding fix.

1. ASSERTION Crashes

After a TGT expires, the ticket is no longer refreshed automatically. The user must authenticate with Kerberos again to get a new TGT.

For this assertion failure which is currently resulting in crash we are removing the ASSERT check which is being carried out in sptlrpc_req_ctx_switch() as the request buffers are already being allocated before sptlrpc_req_ctx_switch() is called but in case of an import check context,it creates a fake request which does not have a request message buffer allocated.So the check is not required as the code further down the line is taking care of it.

NOTE: root user will not encounter this problem, because root use a pre-installed keytable service credential, hence can refresh its tickets automatically.

2. ldlm_cli_cancel_local LBUG

During the case when a new lock gets created by ldlm_cli_enqueue,later when it fails to allocate memory for a new request issued by ptlrpc_request,control reaches ldlm_cli_cancel_local().

Fixed as part of the latest lustre code.Since the relevant structure are not filled up (as lock->l_conn_export is NULL).This results in control entering the else-statement because the lock->l_conn_export has not be set so far and as a result LBUG catches it as it a client side lock part of the fix is to fill in some lock fields before the memory allocation for a new request.

3. mdc_lock NULL pointer dereference

During reserving memory in mdc_enqueue when a call from ldlm_cli_enqueue() fails and control enters mdc_clear_replay_flag() which tries to clear off the flags, so as to not hold any error requests for replay and as ptlrpc_request is still NULL and the code fails to handle a NULL pointer dereferencing.So code has been added to handle NULL pointers.

Comment by santosh kulkarni (Inactive) [ 25/Oct/12 ]

> cd srcdir
> patch -p1 < LU634-lfs-2.1.54.patch

Comment by Andreas Dilger [ 11/Feb/13 ]

Please submit this patch to Gerrit for inspection, testing, and landing. The patch submission process is described at http://wiki.whamcloud.com/display/PUB/Patch+Landing+Process+Summary

Comment by Andreas Dilger [ 07/May/13 ]

Alex, Josephine, Santosh, Dave,
is someone with an interest in functioning Kerberos able to update these patches to match the Lustre Coding Guidelines (https://wiki.hpdd.intel.com/display/PUB/Coding+Guidelines), test them locally against 2.1 and/or 2.4, and submit them to Gerrit for review and regression testing?

The patches themselves need a bit of work, and cannot be landed as-is, since they would spew messages onto the console under normal operation, but that should be apparent during your local testing. Unfortunately, we do not have any facilities or expertise to test Kerberos-enabled Lustre ourselves, and no funding to hire someone to do this.

That said, I'm always interested to get bug fixes into the released versions of Lustre, so if these patches are important to you it is in your own best interest to move them through the process for landing.

Comment by Peter Jones [ 12/Sep/13 ]

AS per Xyratex on the CDWG call, this patch was only a prototype and is not being upstreamed - http://wiki.opensfs.org/CDWG_Minutes_2013-09-11

Comment by Andreas Dilger [ 28/May/17 ]

Close old issue.

Generated at Sat Feb 10 01:08:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.