Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7030

import_sec_validate_get()) import ffff8819c6497800 (FULL) with no sec (again)

Details

    • Bug
    • Resolution: Incomplete
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The patch from LU-3353 does not appear to have eliminated the stated error message. We are still seeing it with Lustre 2.5.4 + the patch from LU-3353.

      Please try again.

      Attachments

        Issue Links

          Activity

            [LU-7030] import_sec_validate_get()) import ffff8819c6497800 (FULL) with no sec (again)
            jhammond John Hammond added a comment -

            Hi Teresa,

            What security flavor are you using when you see this?

            jhammond John Hammond added a comment - Hi Teresa, What security flavor are you using when you see this?
            pjones Peter Jones added a comment -

            John

            What do you advise here?

            Peter

            pjones Peter Jones added a comment - John What do you advise here? Peter

            We are seeing this again in lustre 2.5.5:

            LustreError: 22883:0:(sec.c:394:import_sec_validate_get()) import ffff880d5697b800 (FULL) with no sec
            
            
            
            
            kamakea1 Teresa Kamakea (Inactive) added a comment - - edited We are seeing this again in lustre 2.5.5: LustreError: 22883:0:(sec.c:394:import_sec_validate_get()) import ffff880d5697b800 (FULL) with no sec

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17709/
            Subject: Revert "LU-7030 security: put imp_sec after all requests drained off"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 952a681628d685f1afa8e68c09f10179d03c7244

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17709/ Subject: Revert " LU-7030 security: put imp_sec after all requests drained off" Project: fs/lustre-release Branch: master Current Patch Set: Commit: 952a681628d685f1afa8e68c09f10179d03c7244

            It seems the 'ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed' occurs because of a race between the following 2 processes:

             0xffffffffa057e5b3 : class_import_put+0xe3/0x400 [obdclass]
             0xffffffffa056cf61 : llog_ctxt_destroy+0x31/0x360 [obdclass]
             0xffffffffa056d32e : __llog_ctxt_put+0x9e/0x140 [obdclass]
             0xffffffffa056d813 : llog_cleanup+0xc3/0x490 [obdclass]
             0xffffffffa03e251b : mdc_precleanup+0x11b/0x450 [mdc]
             0xffffffffa05a010c : class_cleanup+0x29c/0xcc0 [obdclass]
             0xffffffffa05a2e43 : class_process_config+0x1bf3/0x2cf0 [obdclass]
             0xffffffffa05a402f : class_manual_cleanup+0xef/0xba0 [obdclass]
             0xffffffffa07911f0 : ll_put_super+0x120/0xf00 [lustre]
             0xffffffffa05a7d15 : lustre_kill_super+0x45/0x50 [obdclass] (inexact)
            

            The process above will put the import in the zombie list, and then the import will be killed in obd_zombie_impexp_cull(). But obd_zombie_impexp_cull() clears import->imp_zombie_chain.

             0xffffffffa057e5b3 : class_import_put+0xe3/0x400 [obdclass]
             0xffffffffa09a6384 : __ptlrpc_req_finished+0x1b4/0x690 [ptlrpc]
             0xffffffffa09a6870 : ptlrpc_req_finished+0x10/0x20 [ptlrpc]
             0xffffffffa033bdf8 : gss_do_ctx_fini_rpc+0x178/0x500 [ptlrpc_gss]
             0xffffffffa0334d3c : gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss]
             0xffffffffa034ab08 : ctx_destroy_kr+0x88/0x580 [ptlrpc_gss]
             0xffffffffa034b45d : gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss]
             0xffffffffa09e1b62 : sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc]
             0xffffffffa09ed77d : sec_process_ctx_list+0xdd/0x180 [ptlrpc]
             0xffffffffa09ed9c1 : sec_gc_main+0x71/0x400 [ptlrpc]
            

            Then this process will need to send a request to finish the security context, by using the import that was killed by the other process

            sbuisson Sebastien Buisson (Inactive) added a comment - It seems the 'ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed' occurs because of a race between the following 2 processes: 0xffffffffa057e5b3 : class_import_put+0xe3/0x400 [obdclass] 0xffffffffa056cf61 : llog_ctxt_destroy+0x31/0x360 [obdclass] 0xffffffffa056d32e : __llog_ctxt_put+0x9e/0x140 [obdclass] 0xffffffffa056d813 : llog_cleanup+0xc3/0x490 [obdclass] 0xffffffffa03e251b : mdc_precleanup+0x11b/0x450 [mdc] 0xffffffffa05a010c : class_cleanup+0x29c/0xcc0 [obdclass] 0xffffffffa05a2e43 : class_process_config+0x1bf3/0x2cf0 [obdclass] 0xffffffffa05a402f : class_manual_cleanup+0xef/0xba0 [obdclass] 0xffffffffa07911f0 : ll_put_super+0x120/0xf00 [lustre] 0xffffffffa05a7d15 : lustre_kill_super+0x45/0x50 [obdclass] (inexact) The process above will put the import in the zombie list, and then the import will be killed in obd_zombie_impexp_cull(). But obd_zombie_impexp_cull() clears import->imp_zombie_chain. 0xffffffffa057e5b3 : class_import_put+0xe3/0x400 [obdclass] 0xffffffffa09a6384 : __ptlrpc_req_finished+0x1b4/0x690 [ptlrpc] 0xffffffffa09a6870 : ptlrpc_req_finished+0x10/0x20 [ptlrpc] 0xffffffffa033bdf8 : gss_do_ctx_fini_rpc+0x178/0x500 [ptlrpc_gss] 0xffffffffa0334d3c : gss_cli_ctx_fini_common+0x5c/0x2d0 [ptlrpc_gss] 0xffffffffa034ab08 : ctx_destroy_kr+0x88/0x580 [ptlrpc_gss] 0xffffffffa034b45d : gss_sec_release_ctx_kr+0x2d/0xa0 [ptlrpc_gss] 0xffffffffa09e1b62 : sptlrpc_cli_ctx_put+0x42/0xb0 [ptlrpc] 0xffffffffa09ed77d : sec_process_ctx_list+0xdd/0x180 [ptlrpc] 0xffffffffa09ed9c1 : sec_gc_main+0x71/0x400 [ptlrpc] Then this process will need to send a request to finish the security context, by using the import that was killed by the other process

            Let's revert the original fix first, then take some time to investigate why can't we remove the import refcounting code in security flavors.

            The patch to revert the original fix: http://review.whamcloud.com/#/c/17709/

            niu Niu Yawei (Inactive) added a comment - Let's revert the original fix first, then take some time to investigate why can't we remove the import refcounting code in security flavors. The patch to revert the original fix: http://review.whamcloud.com/#/c/17709/

            And, as Jeremy said, because of http://review.whamcloud.com/16071 landed in master, all security flavors except null are broken at the moment in master branch: when the security flavor is set to something else than null, some imports stay stalled at unmount.

            sbuisson Sebastien Buisson (Inactive) added a comment - And, as Jeremy said, because of http://review.whamcloud.com/16071 landed in master, all security flavors except null are broken at the moment in master branch: when the security flavor is set to something else than null, some imports stay stalled at unmount.

            Hi,

            When running with patch http://review.whamcloud.com/17798 on top of master branch, I do not have stalled import anymore at unmount, but now I hit the following LBUG.

            LustreError: 2027:0:(connection.c:99:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed:
            LustreError: 2027:0:(connection.c:99:ptlrpc_connection_put()) LBUG
            Pid: 2027, comm: obd_zombid
            
            Call Trace:
            [ffffffffa01eb7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
            [ffffffffa01ebd75>] lbug_with_loc+0x45/0xc0 [libcfs]
            [ffffffffa0591db3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc]
            [ffffffffa031318c>] obd_zombie_impexp_cull+0x1bc/0xa80 [obdclass]
            [ffffffffa0313abd>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
            [ffffffff810a9510>] ? default_wake_function+0x0/0x20
            [ffffffffa0313a50>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
            [ffffffff8109727f>] kthread+0xcf/0xe0
            [ffffffff810971b0>] ? kthread+0x0/0xe0
            [ffffffff81614358>] ret_from_fork+0x58/0x90
            [ffffffff810971b0>] ? kthread+0x0/0xe0
            
            Kernel panic - not syncing: LBUG
            CPU: 3 PID: 2027 Comm: obd_zombid Tainted: GF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
            Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
            ffffffffa0207e0f 000000009fc69d5f ffff88040771bd40 ffffffff816045b6
            ffff88040771bdc0 ffffffff815fde5a ffffffff00000008 ffff88040771bdd0
            ffff88040771bd70 000000009fc69d5f ffffffffa06305c0 0000000000000000
            Call Trace:
            [ffffffff816045b6>] dump_stack+0x19/0x1b
            [ffffffff815fde5a>] panic+0xd8/0x1e7
            [ffffffffa01ebddb>] lbug_with_loc+0xab/0xc0 [libcfs]
            [ffffffffa0591db3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc]
            [ffffffffa031318c>] obd_zombie_impexp_cull+0x1bc/0xa80 [obdclass]
            [ffffffffa0313abd>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
            [ffffffff810a9510>] ? wake_up_state+0x20/0x20
            [ffffffffa0313a50>] ? obd_zombie_impexp_cull+0xa80/0xa80 [obdclass]
            [ffffffff8109727f>] kthread+0xcf/0xe0
            [ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
            [ffffffff81614358>] ret_from_fork+0x58/0x90
            [ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140
            

            I have krb5n connection from client to MDS. Mount is just fine, but when I unmount the client, the assertion is triggered.
            Before the LBUG, there is this error message on the client:

            (gss_cli_upcall.c:432:gss_do_ctx_fini_rpc()) ctx ffff880407223b40(0->ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ): rpc error -19, destroy locally
            

            Maybe this is the reason why the connection is put one time too many?

            Sebastien.

            sbuisson Sebastien Buisson (Inactive) added a comment - Hi, When running with patch http://review.whamcloud.com/17798 on top of master branch, I do not have stalled import anymore at unmount, but now I hit the following LBUG. LustreError: 2027:0:(connection.c:99:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed: LustreError: 2027:0:(connection.c:99:ptlrpc_connection_put()) LBUG Pid: 2027, comm: obd_zombid Call Trace: [ffffffffa01eb7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [ffffffffa01ebd75>] lbug_with_loc+0x45/0xc0 [libcfs] [ffffffffa0591db3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc] [ffffffffa031318c>] obd_zombie_impexp_cull+0x1bc/0xa80 [obdclass] [ffffffffa0313abd>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [ffffffff810a9510>] ? default_wake_function+0x0/0x20 [ffffffffa0313a50>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] [ffffffff8109727f>] kthread+0xcf/0xe0 [ffffffff810971b0>] ? kthread+0x0/0xe0 [ffffffff81614358>] ret_from_fork+0x58/0x90 [ffffffff810971b0>] ? kthread+0x0/0xe0 Kernel panic - not syncing: LBUG CPU: 3 PID: 2027 Comm: obd_zombid Tainted: GF O-------------- 3.10.0-229.20.1.el7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 ffffffffa0207e0f 000000009fc69d5f ffff88040771bd40 ffffffff816045b6 ffff88040771bdc0 ffffffff815fde5a ffffffff00000008 ffff88040771bdd0 ffff88040771bd70 000000009fc69d5f ffffffffa06305c0 0000000000000000 Call Trace: [ffffffff816045b6>] dump_stack+0x19/0x1b [ffffffff815fde5a>] panic+0xd8/0x1e7 [ffffffffa01ebddb>] lbug_with_loc+0xab/0xc0 [libcfs] [ffffffffa0591db3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc] [ffffffffa031318c>] obd_zombie_impexp_cull+0x1bc/0xa80 [obdclass] [ffffffffa0313abd>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass] [ffffffff810a9510>] ? wake_up_state+0x20/0x20 [ffffffffa0313a50>] ? obd_zombie_impexp_cull+0xa80/0xa80 [obdclass] [ffffffff8109727f>] kthread+0xcf/0xe0 [ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140 [ffffffff81614358>] ret_from_fork+0x58/0x90 [ffffffff810971b0>] ? kthread_create_on_node+0x140/0x140 I have krb5n connection from client to MDS. Mount is just fine, but when I unmount the client, the assertion is triggered. Before the LBUG, there is this error message on the client: (gss_cli_upcall.c:432:gss_do_ctx_fini_rpc()) ctx ffff880407223b40(0->ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ): rpc error -19, destroy locally Maybe this is the reason why the connection is put one time too many? Sebastien.

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17798
            Subject: LU-7030 gss: don't get refcount on import
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a0b58607782fd890dd98627dad2e0d87d2b1c798

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17798 Subject: LU-7030 gss: don't get refcount on import Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a0b58607782fd890dd98627dad2e0d87d2b1c798

            Jeremy, I'd be happy to drop the reversion patch if you can submit a patch to remove those extra refcounts for the security flavours.

            adilger Andreas Dilger added a comment - Jeremy, I'd be happy to drop the reversion patch if you can submit a patch to remove those extra refcounts for the security flavours.

            People

              jhammond John Hammond
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: