[LU-7030] import_sec_validate_get()) import ffff8819c6497800 (FULL) with no sec (again) Created: 21/Aug/15 Updated: 29/Mar/22 Resolved: 29/Mar/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | John Hammond |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The patch from Please try again. |
| Comments |
| Comment by Peter Jones [ 22/Aug/15 ] |
|
Niu You worked on Peter |
| Comment by Niu Yawei (Inactive) [ 24/Aug/15 ] |
|
Looks the fix of Then look at the import_sec_validate_get(): /* Only output an error when the import is still active */ if (*sec == NULL) { if (list_empty(&imp->imp_zombie_chain)) CERROR("import %p (%s) with no sec\n", imp, ptlrpc_import_state_name(imp->imp_state)); return -EACCES; } When checking imp->imp_zombie_chain, the import may not be put on zombie list yet. Amir, could you take a look? |
| Comment by Gerrit Updater [ 25/Aug/15 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16071 |
| Comment by Gerrit Updater [ 13/Dec/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16071/ |
| Comment by Peter Jones [ 14/Dec/15 ] |
|
Landed for 2.8 |
| Comment by Jeremy Filizetti [ 22/Dec/15 ] |
|
Patch 16071 breaks Lustre security for everything but the default flavor. Every security flavor except the default NULL (no security) grabs a reference on the import in create_sec. There is an equivalent destroy_sec but that is called from the sptlrpc_sec_put. With this patch that is no longer called until the import is released but this creates a circular dependency for everything but the default security flavor. We should revert the patch and create one that doesn't affect other security flavors. |
| Comment by Jeremy Filizetti [ 22/Dec/15 ] |
|
A proper fix is needed that won't break security flavors other than the default |
| Comment by Niu Yawei (Inactive) [ 22/Dec/15 ] |
|
Indeed. The purpose of sec flavor holding import refcount looks not clear to me, if import needs sec flavor until all RPCs drained off and the flavor will be destroyed before freeing import, why should sec flavor holds import refcount? I suggest to remove the import get/put in the create_sec/destroy_sec, what do you think? |
| Comment by Gerrit Updater [ 22/Dec/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17709 |
| Comment by Jeremy Filizetti [ 22/Dec/15 ] |
|
From what I can tell its reasonable to remove class_import_get from gss_sec_create_common and plain_create_sec as well as class_import_put from the destroy functions. I'm running it through sanity now just to confirm but it does avoid the earlier mentioned problem. |
| Comment by Andreas Dilger [ 23/Dec/15 ] |
|
Jeremy, I'd be happy to drop the reversion patch if you can submit a patch to remove those extra refcounts for the security flavours. |
| Comment by Gerrit Updater [ 04/Jan/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17798 |
| Comment by Sebastien Buisson (Inactive) [ 05/Jan/16 ] |
|
Hi, When running with patch http://review.whamcloud.com/17798 on top of master branch, I do not have stalled import anymore at unmount, but now I hit the following LBUG. LustreError: 2027:0:(connection.c:99:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed: LustreError: 2027:0:(connection.c:99:ptlrpc_connection_put()) LBUG Pid: 2027, comm: obd_zombid Call Trace: [ffffffffa01eb7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [ffffffffa01ebd75>] lbug_with_loc+0x45/0xc0 [libcfs] [ffffffffa0591db3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc] [ffffffffa031318c>] obd_zombie_impexp_cull+0x1bc/0xa80 [obdclass] [ffffffffa0313abd>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [ffffffff810a9510>] ? default_wake_function+0x0/0x20 [ffffffffa0313a50>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [ffffffff8109727f>] kthread+0xcf/0xe0 [ffffffff810971b0>] ? kthread+0x0/0xe0 [ffffffff81614358>] ret_from_fork+0x58/0x90 [ffffffff810971b0>] ? kthread+0x0/0xe0 Kernel panic - not syncing: LBUG CPU: 3 PID: 2027 Comm: obd_zombid Tainted: GF O-------------- 3.10.0-229.20.1.el7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 ffffffffa0207e0f 000000009fc69d5f ffff88040771bd40 ffffffff816045b6 ffff88040771bdc0 ffffffff815fde5a ffffffff00000008 ffff88040771bdd0 ffff88040771bd70 000000009fc69d5f ffffffffa06305c0 0000000000000000 Call Trace: [ffffffff816045b6>] dump_stack+0x19/0x1b [ffffffff815fde5a>] panic+0xd8/0x1e7 [ffffffffa01ebddb>] lbug_with_loc+0xab/0xc0 [libcfs] [ffffffffa0591db3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc] [ffffffffa031318c>] obd_zombie_impexp_cull+0x1bc/0xa80 [obdclass] [ffffffffa0313abd>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [ffffffff810a9510>] ? wake_up_state+0x20/0x20 [ffffffffa0313a50>] ? obd_zombie_impexp_cull+0xa80/0xa80 [obdclass] [ffffffff8109727f>] kthread+0xcf/0xe0 [ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140 [ffffffff81614358>] ret_from_fork+0x58/0x90 [ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140 I have krb5n connection from client to MDS. Mount is just fine, but when I unmount the client, the assertion is triggered. (gss_cli_upcall.c:432:gss_do_ctx_fini_rpc()) ctx ffff880407223b40(0->ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ): rpc error -19, destroy locally Maybe this is the reason why the connection is put one time too many? Sebastien. |
| Comment by Sebastien Buisson (Inactive) [ 06/Jan/16 ] |
|
And, as Jeremy said, because of http://review.whamcloud.com/16071 landed in master, all security flavors except null are broken at the moment in master branch: when the security flavor is set to something else than null, some imports stay stalled at unmount. |
| Comment by Niu Yawei (Inactive) [ 06/Jan/16 ] |
|
Let's revert the original fix first, then take some time to investigate why can't we remove the import refcounting code in security flavors. The patch to revert the original fix: http://review.whamcloud.com/#/c/17709/ |
| Comment by Sebastien Buisson (Inactive) [ 06/Jan/16 ] |
|
It seems the 'ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed' occurs because of a race between the following 2 processes: 0xffffffffa057e5b3 : class_import_put+0xe3/0x400 [obdclass] 0xffffffffa056cf61 : llog_ctxt_destroy+0x31/0x360 [obdclass] 0xffffffffa056d32e : __llog_ctxt_put+0x9e/0x140 [obdclass] 0xffffffffa056d813 : llog_cleanup+0xc3/0x490 [obdclass] 0xffffffffa03e251b : mdc_precleanup+0x11b/0x450 [mdc] 0xffffffffa05a010c : class_cleanup+0x29c/0xcc0 [obdclass] 0xffffffffa05a2e43 : class_process_config+0x1bf3/0x2cf0 [obdclass] 0xffffffffa05a402f : class_manual_cleanup+0xef/0xba0 [obdclass] 0xffffffffa07911f0 : ll_put_super+0x120/0xf00 [lustre] 0xffffffffa05a7d15 : lustre_kill_super+0x45/0x50 [obdclass] (inexact) The process above will put the import in the zombie list, and then the import will be killed in obd_zombie_impexp_cull(). But obd_zombie_impexp_cull() clears import->imp_zombie_chain. 0xffffffffa057e5b3 : class_import_put+0xe3/0x400 [obdclass] 0xffffffffa09a6384 : __ptlrpc_req_finished+0x1b4/0x690 [ptlrpc] 0xffffffffa09a6870 : ptlrpc_req_finished+0x10/0x20 [ptlrpc] 0xffffffffa033bdf8 : gss_do_ctx_fini_rpc+0x178/0x500 [ptlrpc_gss] 0xffffffffa0334d3c : gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss] 0xffffffffa034ab08 : ctx_destroy_kr+0x88/0x580 [ptlrpc_gss] 0xffffffffa034b45d : gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss] 0xffffffffa09e1b62 : sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc] 0xffffffffa09ed77d : sec_process_ctx_list+0xdd/0x180 [ptlrpc] 0xffffffffa09ed9c1 : sec_gc_main+0x71/0x400 [ptlrpc] Then this process will need to send a request to finish the security context, by using the import that was killed by the other process |
| Comment by Gerrit Updater [ 06/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17709/ |
| Comment by Teresa Kamakea [ 06/Jan/17 ] |
|
We are seeing this again in lustre 2.5.5: LustreError: 22883:0:(sec.c:394:import_sec_validate_get()) import ffff880d5697b800 (FULL) with no sec
|
| Comment by Peter Jones [ 21/Jul/17 ] |
|
John What do you advise here? Peter |
| Comment by John Hammond [ 03/Aug/17 ] |
|
Hi Teresa, What security flavor are you using when you see this? |