[LU-634] LBUG in Kerberos sec.c::sptlrpc_req_ctx_switch() Created: 25/Aug/11 Updated: 17/Feb/21 Resolved: 28/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Josephine Palencia | Assignee: | WC Triage |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.0.63 clients Crashes client only even with simple ls. Apparent no pattern and not easily reproducible. |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 8152 | ||||
| Description |
|
Aug 18 14:14:36 extenci kernel: LustreError: 3532:0:(sec.c:468:sptlrpc_req_ctx_switch()) ASSERTION(req->rq_reqmsg) failed |
| Comments |
| Comment by Oleg Drokin [ 20/Sep/11 ] |
|
Can we at least get some logs, it's impossible to evaluate this otherwise. |
| Comment by Dave Dykstra [ 19/Jan/12 ] |
|
Here's a much more detailed console log of the crash that Josephine reported. This is with the latest lustre tag from which I built lustre-client-2.1.54-2.6.18_274.12.1.el5.x86_64. In case it matters, the configure options were --disable-server --enable-dependency-tracking --enable-posix-osd --enable-panic_dumplog --enable-health_write --enable-lru-resize --enable-gss --enable-quota --enable-ext4 --enable-mindf. |
| Comment by Josephine Palencia [ 24/Jan/12 ] |
|
Also for kernel-2.6.18-274.17.1.el5xen, Jan 24 10:13:09 extenci kernel: Lustre: 1775:0:(gss_keyring.c:970:gss_sec_gc_ctx_kr()) running gc |
| Comment by Dave Dykstra [ 01/Feb/12 ] |
|
On kernel-2.6.18-274.17.1.el5 (with lustre 2.1.54) it seems to take a lot longer to reproduce the problem, but it still happens. Another crash log is attached. |
| Comment by Oleg Drokin [ 08/Feb/12 ] |
|
Ah, so it's some sort of kerberos deployment? |
| Comment by Dave Dykstra [ 08/Feb/12 ] |
|
That's disappointing news to me. Can you please tell us who contributed the Kerberos code then? Perhaps they would be interested in improving its quality, or could at least advise us on how to debug it. We are part of an NSF-funded project (extenci.org) that is in part evaluating wide-area lustre for use by Large Hadron Collider experiments. Working, non-crashing Kerberos functionality is a vital part of that. |
| Comment by Peter Jones [ 09/Feb/12 ] |
|
Dave I'll reach out to you directly to discuss this Peter |
| Comment by santosh kulkarni (Inactive) [ 25/Oct/12 ] |
|
Details regarding the issues and the patch. Fixes are specific to lustre-2.1.54 code for providing kerberos support. 1. ASSERTION Crashes After a TGT expires, the ticket is no longer refreshed automatically. The user must authenticate with Kerberos again to get a new TGT. For this assertion failure which is currently resulting in crash we are removing the ASSERT check which is being carried out in sptlrpc_req_ctx_switch() as the request buffers are already being allocated before sptlrpc_req_ctx_switch() is called but in case of an import check context,it creates a fake request which does not have a request message buffer allocated.So the check is not required as the code further down the line is taking care of it. NOTE: root user will not encounter this problem, because root use a pre-installed keytable service credential, hence can refresh its tickets automatically. 2. ldlm_cli_cancel_local LBUG During the case when a new lock gets created by ldlm_cli_enqueue,later when it fails to allocate memory for a new request issued by ptlrpc_request,control reaches ldlm_cli_cancel_local(). Fixed as part of the latest lustre code.Since the relevant structure are not filled up (as lock->l_conn_export is NULL).This results in control entering the else-statement because the lock->l_conn_export has not be set so far and as a result LBUG catches it as it a client side lock part of the fix is to fill in some lock fields before the memory allocation for a new request. 3. mdc_lock NULL pointer dereference During reserving memory in mdc_enqueue when a call from ldlm_cli_enqueue() fails and control enters mdc_clear_replay_flag() which tries to clear off the flags, so as to not hold any error requests for replay and as ptlrpc_request is still NULL and the code fails to handle a NULL pointer dereferencing.So code has been added to handle NULL pointers. |
| Comment by santosh kulkarni (Inactive) [ 25/Oct/12 ] |
|
> cd srcdir |
| Comment by Andreas Dilger [ 11/Feb/13 ] |
|
Please submit this patch to Gerrit for inspection, testing, and landing. The patch submission process is described at http://wiki.whamcloud.com/display/PUB/Patch+Landing+Process+Summary |
| Comment by Andreas Dilger [ 07/May/13 ] |
|
Alex, Josephine, Santosh, Dave, The patches themselves need a bit of work, and cannot be landed as-is, since they would spew messages onto the console under normal operation, but that should be apparent during your local testing. Unfortunately, we do not have any facilities or expertise to test Kerberos-enabled Lustre ourselves, and no funding to hire someone to do this. That said, I'm always interested to get bug fixes into the released versions of Lustre, so if these patches are important to you it is in your own best interest to move them through the process for landing. |
| Comment by Peter Jones [ 12/Sep/13 ] |
|
AS per Xyratex on the CDWG call, this patch was only a prototype and is not being upstreamed - http://wiki.opensfs.org/CDWG_Minutes_2013-09-11 |
| Comment by Andreas Dilger [ 28/May/17 ] |
|
Close old issue. |