Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1194

llog_recov_thread_stop+0x1ae/0x1b0 asserting

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.3.0, Lustre 2.1.3
    • Lustre 2.1.0
    • None
    • RHEL6 + 2.1.0 (xyratex 1.0).
    • 3
    • 4581

    Description

      1329761763787093,7345,4,0,0,"snxs2n003","","kernel","[272274.418338] Call Trace:"
      1329761763794294,7346,4,0,0,"snxs2n003","","kernel","[272274.422654] [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20"
      1329761763794316,7347,4,0,0,"snxs2n003","","kernel","[272274.429093] [<ffffffffa0442855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]"
      1329761763808114,7348,4,0,0,"snxs2n003","","kernel","[272274.436284] [<ffffffffa0442e95>] lbug_with_loc+0x75/0xe0 [libcfs]"
      1329761763815465,7349,4,0,0,"snxs2n003","","kernel","[272274.442714] [<ffffffffa065805e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]"
      1329761763815505,7350,4,0,0,"snxs2n003","","kernel","[272274.450178] [<ffffffffa0658097>] llog_recov_thread_fini+0x37/0x220 [ptlrpc]"
      1329761763820472,7351,4,0,0,"snxs2n003","","kernel","[272274.457480] [<ffffffff81093fae>] ? down+0x2e/0x50"
      1329761763835467,7352,4,0,0,"snxs2n003","","kernel","[272274.462464] [<ffffffffa0a56558>] filter_llog_finish+0x148/0x2f0 [obdfilter]"
      1329761763842393,7353,4,0,0,"snxs2n003","","kernel","[272274.469818] [<ffffffffa0609e5a>] ? target_cleanup_recovery+0x4a/0x320 [ptlrpc]"
      1329761763842426,7354,4,0,0,"snxs2n003","","kernel","[272274.477457] [<ffffffffa0512f58>] obd_llog_finish+0x98/0x230 [obdclass]"
      1329761763857637,7355,4,0,0,"snxs2n003","","kernel","[272274.484336] [<ffffffffa0a54baa>] filter_precleanup+0x12a/0x440 [obdfilter]"
      1329761763871327,7356,4,0,0,"snxs2n003","","kernel","[272274.491538] [<ffffffffa0523316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]"
      1329761763871364,7357,4,0,0,"snxs2n003","","kernel","[272274.499607] [<ffffffffa053b772>] class_cleanup+0x192/0xe30 [obdclass]"
      1329761763878975,7358,4,0,0,"snxs2n003","","kernel","[272274.506423] [<ffffffffa051bc76>] ? class_name2dev+0x56/0xd0 [obdclass]"
      1329761763879013,7359,4,0,0,"snxs2n003","","kernel","[272274.513330] [<ffffffffa053d603>] class_process_config+0x11f3/0x1fd0 [obdclass]"
      1329761763892326,7360,4,0,0,"snxs2n003","","kernel","[272274.520948] [<ffffffffa0443a13>] ? cfs_alloc+0x63/0x90 [libcfs]"
      1329761763899616,7361,4,0,0,"snxs2n003","","kernel","[272274.527229] [<ffffffffa0537fdb>] ? lustre_cfg_new+0x33b/0x880 [obdclass]"
      1329761763899661,7362,4,0,0,"snxs2n003","","kernel","[272274.534335] [<ffffffffa044d6b1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]"
      1329761763907044,7363,4,0,0,"snxs2n003","","kernel","[272274.541589] [<ffffffffa053e545>] class_manual_cleanup+0x165/0x790 [obdclass]"
      1329761763913937,7364,4,0,0,"snxs2n003","","kernel","[272274.549022] [<ffffffffa051bc76>] ? class_name2dev+0x56/0xd0 [obdclass]"
      1329761763920982,7365,4,0,0,"snxs2n003","","kernel","[272274.555894] [<ffffffffa0549170>] server_put_super+0xa50/0xf60 [obdclass]"
      1329761763927266,7366,4,0,0,"snxs2n003","","kernel","[272274.562905] [<ffffffff8118d5b6>] ? invalidate_inodes+0xf6/0x190"
      1329761763933628,7367,4,0,0,"snxs2n003","","kernel","[272274.569150] [<ffffffff811750ab>] generic_shutdown_super+0x5b/0xe0"
      1329761763939475,7368,4,0,0,"snxs2n003","","kernel","[272274.575568] [<ffffffff81175196>] kill_anon_super+0x16/0x60"
      1329761763946424,7369,4,0,0,"snxs2n003","","kernel","[272274.581419] [<ffffffffa05402c6>] lustre_kill_super+0x36/0x60 [obdclass]"
      1329761763952386,7370,4,0,0,"snxs2n003","","kernel","[272274.588352] [<ffffffff81176210>] deactivate_super+0x70/0x90"
      1329761763963859,7371,4,0,0,"snxs2n003","","kernel","[272274.594278] [<ffffffff8119174f>] mntput_no_expire+0xbf/0x110"
      1329761763963903,7372,4,0,0,"snxs2n003","","kernel","[272274.600238] [<ffffffff81191b7b>] sys_umount+0x7b/0x3a0"

      Attachments

        Issue Links

          Activity

            [LU-1194] llog_recov_thread_stop+0x1ae/0x1b0 asserting

            Xyratex-bug-id: MRP-456

            nrutman Nathan Rutman added a comment - Xyratex-bug-id: MRP-456
            bogl Bob Glossman (Inactive) added a comment - http://review.whamcloud.com/#change,3480 back port to b2_1
            pjones Peter Jones added a comment -

            Landed for 2.3

            pjones Peter Jones added a comment - Landed for 2.3

            Hi,

            At IFERC cluster, trying to stop all OSSes with some clients left, they got a bunch of the same LBUG in llog_recov_thread_stop() on multiple OSSes.
            On all the concerned OSSes, the panic stack and preceding messages transmitted by the Support team look like :
            ===============================================================================
            LustreError: 137-5: UUID 'work-OST006a_UUID' is not available for connect (stopping)
            Lustre: work-OST006e: shutting down for failover; client state will be preserved.
            LustreError: 137-5: UUID 'work-OST0069_UUID' is not available for connect (stopping)
            LustreError: Skipped 218 previous similar messages
            LustreError: 783:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff8805a151d600
            LustreError: 783:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG
            Pid: 783, comm: umount

            Call Trace:
            [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
            [<ffffffffa056a855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            [<ffffffffa056ae95>] lbug_with_loc+0x75/0xe0 [libcfs]
            [<ffffffffa073bc7e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]
            [<ffffffffa073bcb7>] llog_recov_thread_fini+0x37/0x220 [ptlrpc]
            [<ffffffff8108009e>] ? down+0x2e/0x50
            [<ffffffffa09db5b8>] filter_llog_finish+0x148/0x2f0 [obdfilter]
            [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass]
            [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70
            [<ffffffffa05f4ec8>] obd_llog_finish+0x98/0x230 [obdclass]
            [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
            [<ffffffffa09d99d1>] filter_precleanup+0x131/0x670 [obdfilter]
            [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]
            [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass]
            [<ffffffffa061d7e5>] class_cleanup+0x1b5/0xe20 [obdclass]
            [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc]
            [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
            [<ffffffffa061f643>] class_process_config+0x11f3/0x1fd0 [obdclass]
            [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs]
            [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
            [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
            [<ffffffffa0620585>] class_manual_cleanup+0x165/0x790 [obdclass]
            [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass]
            [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150
            [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
            [<ffffffffa062b1f0>] server_put_super+0xa50/0xf60 [obdclass]
            [<ffffffff811782bc>] ? dispose_list+0x11c/0x140
            [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0
            [<ffffffff8115feab>] generic_shutdown_super+0x5b/0x110
            [<ffffffff8115ffc6>] kill_anon_super+0x16/0x60
            [<ffffffffa0622306>] lustre_kill_super+0x36/0x60 [obdclass]
            [<ffffffff81161060>] deactivate_super+0x70/0x90
            [<ffffffff8117c76f>] mntput_no_expire+0xbf/0x110
            [<ffffffff8117cba8>] sys_umount+0x78/0x3c0
            [<ffffffff81003172>] system_call_fastpath+0x16/0x1b

            Kernel panic - not syncing: LBUG
            Pid: 783, comm: umount Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1
            Call Trace:
            [<ffffffff8147d803>] ? panic+0x78/0x143
            [<ffffffffa056aeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
            [<ffffffffa073bc7e>] ? llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]
            [<ffffffffa073bcb7>] ? llog_recov_thread_fini+0x37/0x220 [ptlrpc]
            [<ffffffff8108009e>] ? down+0x2e/0x50
            [<ffffffffa09db5b8>] ? filter_llog_finish+0x148/0x2f0 [obdfilter]
            [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass]
            Lustre: OST work-OST006e has stopped.
            Lustre: work-OST0069: shutting down for failover; client state will be preserved
            .
            [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70
            [<ffffffffa05f4ec8>] ? obd_llog_finish+0x98/0x230 [obdclass]
            [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
            [<ffffffffa09d99d1>] ? filter_precleanup+0x131/0x670 [obdfilter]
            [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]
            [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass]
            [<ffffffffa061d7e5>] ? class_cleanup+0x1b5/0xe20 [obdclass]
            [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc]
            [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
            [<ffffffffa061f643>] ? class_process_config+0x11f3/0x1fd0 [obdclass]
            [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs]
            [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
            [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
            [<ffffffffa0620585>] ? class_manual_cleanup+0x165/0x790 [obdclass]
            [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass]
            [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150
            [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
            [<ffffffffa062b1f0>] ? server_put_super+0xa50/0xf60 [obdclass]
            [<ffffffff811782bc>] ? dispose_list+0x11c/0x140
            [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0
            [<ffffffff8115feab>] ? generic_shutdown_super+0x5b/0x110
            [<ffffffff8115ffc6>] ? kill_anon_super+0x16/0x60
            [<ffffffffa0622306>] ? lustre_kill_super+0x36/0x60 [obdclass]
            [<ffffffff81161060>] ? deactivate_super+0x70/0x90
            [<ffffffff8117c76f>] ? mntput_no_expire+0xbf/0x110
            [<ffffffff8117cba8>] ? sys_umount+0x78/0x3c0
            [<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b
            ===============================================================================

            It definitely looks like this issue.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, At IFERC cluster, trying to stop all OSSes with some clients left, they got a bunch of the same LBUG in llog_recov_thread_stop() on multiple OSSes. On all the concerned OSSes, the panic stack and preceding messages transmitted by the Support team look like : =============================================================================== LustreError: 137-5: UUID 'work-OST006a_UUID' is not available for connect (stopping) Lustre: work-OST006e: shutting down for failover; client state will be preserved. LustreError: 137-5: UUID 'work-OST0069_UUID' is not available for connect (stopping) LustreError: Skipped 218 previous similar messages LustreError: 783:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff8805a151d600 LustreError: 783:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG Pid: 783, comm: umount Call Trace: [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa056a855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa056ae95>] lbug_with_loc+0x75/0xe0 [libcfs] [<ffffffffa073bc7e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc] [<ffffffffa073bcb7>] llog_recov_thread_fini+0x37/0x220 [ptlrpc] [<ffffffff8108009e>] ? down+0x2e/0x50 [<ffffffffa09db5b8>] filter_llog_finish+0x148/0x2f0 [obdfilter] [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass] [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70 [<ffffffffa05f4ec8>] obd_llog_finish+0x98/0x230 [obdclass] [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa09d99d1>] filter_precleanup+0x131/0x670 [obdfilter] [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass] [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass] [<ffffffffa061d7e5>] class_cleanup+0x1b5/0xe20 [obdclass] [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc] [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa061f643>] class_process_config+0x11f3/0x1fd0 [obdclass] [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs] [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0620585>] class_manual_cleanup+0x165/0x790 [obdclass] [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass] [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150 [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa062b1f0>] server_put_super+0xa50/0xf60 [obdclass] [<ffffffff811782bc>] ? dispose_list+0x11c/0x140 [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0 [<ffffffff8115feab>] generic_shutdown_super+0x5b/0x110 [<ffffffff8115ffc6>] kill_anon_super+0x16/0x60 [<ffffffffa0622306>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81161060>] deactivate_super+0x70/0x90 [<ffffffff8117c76f>] mntput_no_expire+0xbf/0x110 [<ffffffff8117cba8>] sys_umount+0x78/0x3c0 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b Kernel panic - not syncing: LBUG Pid: 783, comm: umount Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1 Call Trace: [<ffffffff8147d803>] ? panic+0x78/0x143 [<ffffffffa056aeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs] [<ffffffffa073bc7e>] ? llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc] [<ffffffffa073bcb7>] ? llog_recov_thread_fini+0x37/0x220 [ptlrpc] [<ffffffff8108009e>] ? down+0x2e/0x50 [<ffffffffa09db5b8>] ? filter_llog_finish+0x148/0x2f0 [obdfilter] [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass] Lustre: OST work-OST006e has stopped. Lustre: work-OST0069: shutting down for failover; client state will be preserved . [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70 [<ffffffffa05f4ec8>] ? obd_llog_finish+0x98/0x230 [obdclass] [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa09d99d1>] ? filter_precleanup+0x131/0x670 [obdfilter] [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass] [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass] [<ffffffffa061d7e5>] ? class_cleanup+0x1b5/0xe20 [obdclass] [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc] [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa061f643>] ? class_process_config+0x11f3/0x1fd0 [obdclass] [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs] [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0620585>] ? class_manual_cleanup+0x165/0x790 [obdclass] [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass] [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150 [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa062b1f0>] ? server_put_super+0xa50/0xf60 [obdclass] [<ffffffff811782bc>] ? dispose_list+0x11c/0x140 [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0 [<ffffffff8115feab>] ? generic_shutdown_super+0x5b/0x110 [<ffffffff8115ffc6>] ? kill_anon_super+0x16/0x60 [<ffffffffa0622306>] ? lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81161060>] ? deactivate_super+0x70/0x90 [<ffffffff8117c76f>] ? mntput_no_expire+0xbf/0x110 [<ffffffff8117cba8>] ? sys_umount+0x78/0x3c0 [<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b =============================================================================== It definitely looks like this issue.
            aboyko Alexander Boyko added a comment - request http://review.whamcloud.com/2789

            Probably root cause is window between llog_sync(ctxt, NULL) and class_import_put() at filter_llog_finish():

                            llog_sync(ctxt, NULL);                                         
                            
                            /*
                             * Balance class_import_get() in llog_receptor_accept().       
                             * This is safe to do, as llog is already synchronized         
                             * and its import may go.
                             */                                                            
                            cfs_mutex_down(&ctxt->loc_sem);                                
                            if (ctxt->loc_imp) {
                                    class_import_put(ctxt->loc_imp);
                                    ctxt->loc_imp = NULL;                                  
                            }
                            cfs_mutex_up(&ctxt->loc_sem);
                            llog_ctxt_put(ctxt);
            

            If llog_obd_repl_cancel() happend at this window new llcd will be cached at ctx. And LBUG happend at llog_recov_thread_stop().
            I work to fix this issue.

            aboyko Alexander Boyko added a comment - Probably root cause is window between llog_sync(ctxt, NULL) and class_import_put() at filter_llog_finish(): llog_sync(ctxt, NULL); /* * Balance class_import_get() in llog_receptor_accept(). * This is safe to do, as llog is already synchronized * and its import may go. */ cfs_mutex_down(&ctxt->loc_sem); if (ctxt->loc_imp) { class_import_put(ctxt->loc_imp); ctxt->loc_imp = NULL; } cfs_mutex_up(&ctxt->loc_sem); llog_ctxt_put(ctxt); If llog_obd_repl_cancel() happend at this window new llcd will be cached at ctx. And LBUG happend at llog_recov_thread_stop(). I work to fix this issue.

            that is logs from a different hit's, but i think we have same root case.

            shadow Alexey Lyashkov added a comment - that is logs from a different hit's, but i think we have same root case.

            1329761744817699,6366,4,0,0,"snxs2n003","Lustre","kernel","[269602.953789] Lustre: 15857:0:(import.c:526:import_select_connection()) Skipped 8 previous similar messages"
            1329761744817727,6368,4,0,0,"snxs2n003","Lustre","kernel","[269658.938516] Lustre: 15856:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1394071067253911 sent from MGC10.10.100.5@o2ib
            to NID 10.10.100.5@o2ib has timed out for slow reply: [sent 1329759091] [real_sent 1329759091] [current 1329759147] [deadline 56s] [delay 0s] req@ffff880364381000 x1394071067253911/t0(0) o-1->
            MGS@MGC10.10.100.5@o2ib_0:26/25 lens 368/512 e 0 to 1 dl 1329759147 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1"
            1329761744817743,6370,4,0,0,"snxs2n003","Lustre","kernel","[269658.977157] Lustre: 15856:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 8 previous similar messages"
            1329761747023362,6378,4,0,0,"snxs2n002","Lustre","kernel","[272256.552763] Lustre: server umount snxs2-OST0003 complete"
            1329761747381152,6383,4,0,0,"snxs2n002","Lustre","kernel","[272256.913120] Lustre: Failing over snxs2-OST0000"
            1329761747414521,6385,4,0,0,"snxs2n003","Lustre","kernel","[272257.995434] Lustre: Failing over snxs2-OST0007"
            1329761748292808,6387,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.830463] Lustre: 3868:0:(import.c:526:import_select_connection()) snxs2-OST000e-osc-MDT0000: tried all connections, increasing la
            tency to 11s"
            1329761748292852,6389,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.845109] Lustre: 3868:0:(import.c:526:import_select_connection()) Skipped 2 previous similar messages"
            1329761748292981,6391,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.853216] Lustre: snxs2-OST0000-osc-MDT0000: Connection to service snxs2-OST0000 via nid 10.10.100.1@o2ib was lost; in progress op
            erations using this service will wait for recovery to complete."
            1329761748368977,6393,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.895529] Lustre: snxs2-OST0003-osc-MDT0000: Connection to service snxs2-OST0003 via nid 10.10.100.1@o2ib was lost; in progress op
            erations using this service will wait for recovery to complete."
            1329761748369012,6395,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.896566] Lustre: 3867:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1394349653885019 sent from snxs2-OST000e-osc-MDT
            0000 to NID 10.10.100.4@o2ib has failed due to network error: [sent 1329761748] [real_sent 1329761748] [current 1329761748] [deadline 16s] [delay -16s] req@ffff88085bed0000 x1394349653885019/t0
            (0) o-1->snxs2-OST000e_UUID@10.10.100.4@o2ib:28/4 lens 368/512 e 0 to 1 dl 1329761764 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1"
            1329761748369033,6397,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.896574] Lustre: 3867:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 3 previous similar messages"
            1329761748793789,6399,4,0,0,"snxs2n003","Lustre","kernel","[272259.407949] Lustre: snxs2-OST0007: shutting down for failover; client state will be preserved."
            1329761763772702,6400,3,0,0,"snxs2n003","LustreError","kernel","[272274.395479] LustreError: 4998:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880419da7800"
            1329761748844011,6402,6,0,0,"snxs2n003","Lustre","kernel","[272259.437366] Lustre: OST snxs2-OST0007 has stopped."
            1329761763772748,6403,0,0,0,"snxs2n003","LustreError","kernel","[272274.406830] LustreError: 4998:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG"

            shadow Alexey Lyashkov added a comment - 1329761744817699,6366,4,0,0,"snxs2n003","Lustre","kernel"," [269602.953789] Lustre: 15857:0:(import.c:526:import_select_connection()) Skipped 8 previous similar messages" 1329761744817727,6368,4,0,0,"snxs2n003","Lustre","kernel"," [269658.938516] Lustre: 15856:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1394071067253911 sent from MGC10.10.100.5@o2ib to NID 10.10.100.5@o2ib has timed out for slow reply: [sent 1329759091] [real_sent 1329759091] [current 1329759147] [deadline 56s] [delay 0s] req@ffff880364381000 x1394071067253911/t0(0) o-1-> MGS@MGC10.10.100.5@o2ib_0:26/25 lens 368/512 e 0 to 1 dl 1329759147 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1" 1329761744817743,6370,4,0,0,"snxs2n003","Lustre","kernel"," [269658.977157] Lustre: 15856:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 8 previous similar messages" 1329761747023362,6378,4,0,0,"snxs2n002","Lustre","kernel"," [272256.552763] Lustre: server umount snxs2-OST0003 complete" 1329761747381152,6383,4,0,0,"snxs2n002","Lustre","kernel"," [272256.913120] Lustre: Failing over snxs2-OST0000" 1329761747414521,6385,4,0,0,"snxs2n003","Lustre","kernel"," [272257.995434] Lustre: Failing over snxs2-OST0007" 1329761748292808,6387,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.830463] Lustre: 3868:0:(import.c:526:import_select_connection()) snxs2-OST000e-osc-MDT0000: tried all connections, increasing la tency to 11s" 1329761748292852,6389,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.845109] Lustre: 3868:0:(import.c:526:import_select_connection()) Skipped 2 previous similar messages" 1329761748292981,6391,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.853216] Lustre: snxs2-OST0000-osc-MDT0000: Connection to service snxs2-OST0000 via nid 10.10.100.1@o2ib was lost; in progress op erations using this service will wait for recovery to complete." 1329761748368977,6393,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.895529] Lustre: snxs2-OST0003-osc-MDT0000: Connection to service snxs2-OST0003 via nid 10.10.100.1@o2ib was lost; in progress op erations using this service will wait for recovery to complete." 1329761748369012,6395,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.896566] Lustre: 3867:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1394349653885019 sent from snxs2-OST000e-osc-MDT 0000 to NID 10.10.100.4@o2ib has failed due to network error: [sent 1329761748] [real_sent 1329761748] [current 1329761748] [deadline 16s] [delay -16s] req@ffff88085bed0000 x1394349653885019/t0 (0) o-1->snxs2-OST000e_UUID@10.10.100.4@o2ib:28/4 lens 368/512 e 0 to 1 dl 1329761764 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1" 1329761748369033,6397,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.896574] Lustre: 3867:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 3 previous similar messages" 1329761748793789,6399,4,0,0,"snxs2n003","Lustre","kernel"," [272259.407949] Lustre: snxs2-OST0007: shutting down for failover; client state will be preserved." 1329761763772702,6400,3,0,0,"snxs2n003","LustreError","kernel"," [272274.395479] LustreError: 4998:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880419da7800" 1329761748844011,6402,6,0,0,"snxs2n003","Lustre","kernel"," [272259.437366] Lustre: OST snxs2-OST0007 has stopped." 1329761763772748,6403,0,0,0,"snxs2n003","LustreError","kernel"," [272274.406830] LustreError: 4998:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG"

            Hi Shadow,
            thanks for your bug report.

            Unfortunately, there isn't enough information in this bug to make any investigation on what is failing here. What kind of load is being run before hitting this problem? What did the CERROR() before the LBUG() report?

            adilger Andreas Dilger added a comment - Hi Shadow, thanks for your bug report. Unfortunately, there isn't enough information in this bug to make any investigation on what is failing here. What kind of load is being run before hitting this problem? What did the CERROR() before the LBUG() report?

            People

              wc-triage WC Triage
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: