[LU-7030] import_sec_validate_get()) import ffff8819c6497800 (FULL) with no sec (again) Created: 21/Aug/15  Updated: 29/Mar/22  Resolved: 29/Mar/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: John Hammond
Resolution: Incomplete Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-3353 import_sec_validate_get() import ffff... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The patch from LU-3353 does not appear to have eliminated the stated error message. We are still seeing it with Lustre 2.5.4 + the patch from LU-3353.

Please try again.



 Comments   
Comment by Peter Jones [ 22/Aug/15 ]

Niu

You worked on LU-3353 previously. What do you suggest here?

Peter

Comment by Niu Yawei (Inactive) [ 24/Aug/15 ]

Looks the fix of LU-3353 isn't quite right, see client_destroy_import(): It calls sptlrpc_import_sec_put() to clear imp_sec first, then calls class_import_put() to add the import onto zombie list.

Then look at the import_sec_validate_get():

        /* Only output an error when the import is still active */
        if (*sec == NULL) {
                if (list_empty(&imp->imp_zombie_chain))
                        CERROR("import %p (%s) with no sec\n",
                                imp, ptlrpc_import_state_name(imp->imp_state));
                return -EACCES;
        }

When checking imp->imp_zombie_chain, the import may not be put on zombie list yet.

Amir, could you take a look?

Comment by Gerrit Updater [ 25/Aug/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16071
Subject: LU-7030 security: put imp_sec after all requests drained off
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e545cf24ca487432d7093e7bf0e474f8f5b84efe

Comment by Gerrit Updater [ 13/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16071/
Subject: LU-7030 security: put imp_sec after all requests drained off
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 504ca288d99779812495a91345421ad4ad8f7d95

Comment by Peter Jones [ 14/Dec/15 ]

Landed for 2.8

Comment by Jeremy Filizetti [ 22/Dec/15 ]

Patch 16071 breaks Lustre security for everything but the default flavor. Every security flavor except the default NULL (no security) grabs a reference on the import in create_sec. There is an equivalent destroy_sec but that is called from the sptlrpc_sec_put. With this patch that is no longer called until the import is released but this creates a circular dependency for everything but the default security flavor. We should revert the patch and create one that doesn't affect other security flavors.

Comment by Jeremy Filizetti [ 22/Dec/15 ]

A proper fix is needed that won't break security flavors other than the default

Comment by Niu Yawei (Inactive) [ 22/Dec/15 ]

Indeed. The purpose of sec flavor holding import refcount looks not clear to me, if import needs sec flavor until all RPCs drained off and the flavor will be destroyed before freeing import, why should sec flavor holds import refcount?

I suggest to remove the import get/put in the create_sec/destroy_sec, what do you think?

Comment by Gerrit Updater [ 22/Dec/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17709
Subject: Revert "LU-7030 security: put imp_sec after all requests drained off"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1335d385d7e22ae184cb7f3b78fdecc0a5b1e972

Comment by Jeremy Filizetti [ 22/Dec/15 ]

From what I can tell its reasonable to remove class_import_get from gss_sec_create_common and plain_create_sec as well as class_import_put from the destroy functions. I'm running it through sanity now just to confirm but it does avoid the earlier mentioned problem.

Comment by Andreas Dilger [ 23/Dec/15 ]

Jeremy, I'd be happy to drop the reversion patch if you can submit a patch to remove those extra refcounts for the security flavours.

Comment by Gerrit Updater [ 04/Jan/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17798
Subject: LU-7030 gss: don't get refcount on import
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a0b58607782fd890dd98627dad2e0d87d2b1c798

Comment by Sebastien Buisson (Inactive) [ 05/Jan/16 ]

Hi,

When running with patch http://review.whamcloud.com/17798 on top of master branch, I do not have stalled import anymore at unmount, but now I hit the following LBUG.

LustreError: 2027:0:(connection.c:99:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed:
LustreError: 2027:0:(connection.c:99:ptlrpc_connection_put()) LBUG
Pid: 2027, comm: obd_zombid

Call Trace:
[ffffffffa01eb7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
[ffffffffa01ebd75>] lbug_with_loc+0x45/0xc0 [libcfs]
[ffffffffa0591db3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc]
[ffffffffa031318c>] obd_zombie_impexp_cull+0x1bc/0xa80 [obdclass]
[ffffffffa0313abd>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
[ffffffff810a9510>] ? default_wake_function+0x0/0x20
[ffffffffa0313a50>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
[ffffffff8109727f>] kthread+0xcf/0xe0
[ffffffff810971b0>] ? kthread+0x0/0xe0
[ffffffff81614358>] ret_from_fork+0x58/0x90
[ffffffff810971b0>] ? kthread+0x0/0xe0

Kernel panic - not syncing: LBUG
CPU: 3 PID: 2027 Comm: obd_zombid Tainted: GF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
ffffffffa0207e0f 000000009fc69d5f ffff88040771bd40 ffffffff816045b6
ffff88040771bdc0 ffffffff815fde5a ffffffff00000008 ffff88040771bdd0
ffff88040771bd70 000000009fc69d5f ffffffffa06305c0 0000000000000000
Call Trace:
[ffffffff816045b6>] dump_stack+0x19/0x1b
[ffffffff815fde5a>] panic+0xd8/0x1e7
[ffffffffa01ebddb>] lbug_with_loc+0xab/0xc0 [libcfs]
[ffffffffa0591db3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc]
[ffffffffa031318c>] obd_zombie_impexp_cull+0x1bc/0xa80 [obdclass]
[ffffffffa0313abd>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
[ffffffff810a9510>] ? wake_up_state+0x20/0x20
[ffffffffa0313a50>] ? obd_zombie_impexp_cull+0xa80/0xa80 [obdclass]
[ffffffff8109727f>] kthread+0xcf/0xe0
[ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
[ffffffff81614358>] ret_from_fork+0x58/0x90
[ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140

I have krb5n connection from client to MDS. Mount is just fine, but when I unmount the client, the assertion is triggered.
Before the LBUG, there is this error message on the client:

(gss_cli_upcall.c:432:gss_do_ctx_fini_rpc()) ctx ffff880407223b40(0->ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ): rpc error -19, destroy locally

Maybe this is the reason why the connection is put one time too many?

Sebastien.

Comment by Sebastien Buisson (Inactive) [ 06/Jan/16 ]

And, as Jeremy said, because of http://review.whamcloud.com/16071 landed in master, all security flavors except null are broken at the moment in master branch: when the security flavor is set to something else than null, some imports stay stalled at unmount.

Comment by Niu Yawei (Inactive) [ 06/Jan/16 ]

Let's revert the original fix first, then take some time to investigate why can't we remove the import refcounting code in security flavors.

The patch to revert the original fix: http://review.whamcloud.com/#/c/17709/

Comment by Sebastien Buisson (Inactive) [ 06/Jan/16 ]

It seems the 'ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed' occurs because of a race between the following 2 processes:

 0xffffffffa057e5b3 : class_import_put+0xe3/0x400 [obdclass]
 0xffffffffa056cf61 : llog_ctxt_destroy+0x31/0x360 [obdclass]
 0xffffffffa056d32e : __llog_ctxt_put+0x9e/0x140 [obdclass]
 0xffffffffa056d813 : llog_cleanup+0xc3/0x490 [obdclass]
 0xffffffffa03e251b : mdc_precleanup+0x11b/0x450 [mdc]
 0xffffffffa05a010c : class_cleanup+0x29c/0xcc0 [obdclass]
 0xffffffffa05a2e43 : class_process_config+0x1bf3/0x2cf0 [obdclass]
 0xffffffffa05a402f : class_manual_cleanup+0xef/0xba0 [obdclass]
 0xffffffffa07911f0 : ll_put_super+0x120/0xf00 [lustre]
 0xffffffffa05a7d15 : lustre_kill_super+0x45/0x50 [obdclass] (inexact)

The process above will put the import in the zombie list, and then the import will be killed in obd_zombie_impexp_cull(). But obd_zombie_impexp_cull() clears import->imp_zombie_chain.

 0xffffffffa057e5b3 : class_import_put+0xe3/0x400 [obdclass]
 0xffffffffa09a6384 : __ptlrpc_req_finished+0x1b4/0x690 [ptlrpc]
 0xffffffffa09a6870 : ptlrpc_req_finished+0x10/0x20 [ptlrpc]
 0xffffffffa033bdf8 : gss_do_ctx_fini_rpc+0x178/0x500 [ptlrpc_gss]
 0xffffffffa0334d3c : gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss]
 0xffffffffa034ab08 : ctx_destroy_kr+0x88/0x580 [ptlrpc_gss]
 0xffffffffa034b45d : gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss]
 0xffffffffa09e1b62 : sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc]
 0xffffffffa09ed77d : sec_process_ctx_list+0xdd/0x180 [ptlrpc]
 0xffffffffa09ed9c1 : sec_gc_main+0x71/0x400 [ptlrpc]

Then this process will need to send a request to finish the security context, by using the import that was killed by the other process

Comment by Gerrit Updater [ 06/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17709/
Subject: Revert "LU-7030 security: put imp_sec after all requests drained off"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 952a681628d685f1afa8e68c09f10179d03c7244

Comment by Teresa Kamakea [ 06/Jan/17 ]

We are seeing this again in lustre 2.5.5:

LustreError: 22883:0:(sec.c:394:import_sec_validate_get()) import ffff880d5697b800 (FULL) with no sec



Comment by Peter Jones [ 21/Jul/17 ]

John

What do you advise here?

Peter

Comment by John Hammond [ 03/Aug/17 ]

Hi Teresa,

What security flavor are you using when you see this?

Generated at Sat Feb 10 02:05:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.