Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1194

llog_recov_thread_stop+0x1ae/0x1b0 asserting

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.3.0, Lustre 2.1.3
    • Lustre 2.1.0
    • None
    • RHEL6 + 2.1.0 (xyratex 1.0).
    • 3
    • 4581

    Description

      1329761763787093,7345,4,0,0,"snxs2n003","","kernel","[272274.418338] Call Trace:"
      1329761763794294,7346,4,0,0,"snxs2n003","","kernel","[272274.422654] [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20"
      1329761763794316,7347,4,0,0,"snxs2n003","","kernel","[272274.429093] [<ffffffffa0442855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]"
      1329761763808114,7348,4,0,0,"snxs2n003","","kernel","[272274.436284] [<ffffffffa0442e95>] lbug_with_loc+0x75/0xe0 [libcfs]"
      1329761763815465,7349,4,0,0,"snxs2n003","","kernel","[272274.442714] [<ffffffffa065805e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]"
      1329761763815505,7350,4,0,0,"snxs2n003","","kernel","[272274.450178] [<ffffffffa0658097>] llog_recov_thread_fini+0x37/0x220 [ptlrpc]"
      1329761763820472,7351,4,0,0,"snxs2n003","","kernel","[272274.457480] [<ffffffff81093fae>] ? down+0x2e/0x50"
      1329761763835467,7352,4,0,0,"snxs2n003","","kernel","[272274.462464] [<ffffffffa0a56558>] filter_llog_finish+0x148/0x2f0 [obdfilter]"
      1329761763842393,7353,4,0,0,"snxs2n003","","kernel","[272274.469818] [<ffffffffa0609e5a>] ? target_cleanup_recovery+0x4a/0x320 [ptlrpc]"
      1329761763842426,7354,4,0,0,"snxs2n003","","kernel","[272274.477457] [<ffffffffa0512f58>] obd_llog_finish+0x98/0x230 [obdclass]"
      1329761763857637,7355,4,0,0,"snxs2n003","","kernel","[272274.484336] [<ffffffffa0a54baa>] filter_precleanup+0x12a/0x440 [obdfilter]"
      1329761763871327,7356,4,0,0,"snxs2n003","","kernel","[272274.491538] [<ffffffffa0523316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]"
      1329761763871364,7357,4,0,0,"snxs2n003","","kernel","[272274.499607] [<ffffffffa053b772>] class_cleanup+0x192/0xe30 [obdclass]"
      1329761763878975,7358,4,0,0,"snxs2n003","","kernel","[272274.506423] [<ffffffffa051bc76>] ? class_name2dev+0x56/0xd0 [obdclass]"
      1329761763879013,7359,4,0,0,"snxs2n003","","kernel","[272274.513330] [<ffffffffa053d603>] class_process_config+0x11f3/0x1fd0 [obdclass]"
      1329761763892326,7360,4,0,0,"snxs2n003","","kernel","[272274.520948] [<ffffffffa0443a13>] ? cfs_alloc+0x63/0x90 [libcfs]"
      1329761763899616,7361,4,0,0,"snxs2n003","","kernel","[272274.527229] [<ffffffffa0537fdb>] ? lustre_cfg_new+0x33b/0x880 [obdclass]"
      1329761763899661,7362,4,0,0,"snxs2n003","","kernel","[272274.534335] [<ffffffffa044d6b1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]"
      1329761763907044,7363,4,0,0,"snxs2n003","","kernel","[272274.541589] [<ffffffffa053e545>] class_manual_cleanup+0x165/0x790 [obdclass]"
      1329761763913937,7364,4,0,0,"snxs2n003","","kernel","[272274.549022] [<ffffffffa051bc76>] ? class_name2dev+0x56/0xd0 [obdclass]"
      1329761763920982,7365,4,0,0,"snxs2n003","","kernel","[272274.555894] [<ffffffffa0549170>] server_put_super+0xa50/0xf60 [obdclass]"
      1329761763927266,7366,4,0,0,"snxs2n003","","kernel","[272274.562905] [<ffffffff8118d5b6>] ? invalidate_inodes+0xf6/0x190"
      1329761763933628,7367,4,0,0,"snxs2n003","","kernel","[272274.569150] [<ffffffff811750ab>] generic_shutdown_super+0x5b/0xe0"
      1329761763939475,7368,4,0,0,"snxs2n003","","kernel","[272274.575568] [<ffffffff81175196>] kill_anon_super+0x16/0x60"
      1329761763946424,7369,4,0,0,"snxs2n003","","kernel","[272274.581419] [<ffffffffa05402c6>] lustre_kill_super+0x36/0x60 [obdclass]"
      1329761763952386,7370,4,0,0,"snxs2n003","","kernel","[272274.588352] [<ffffffff81176210>] deactivate_super+0x70/0x90"
      1329761763963859,7371,4,0,0,"snxs2n003","","kernel","[272274.594278] [<ffffffff8119174f>] mntput_no_expire+0xbf/0x110"
      1329761763963903,7372,4,0,0,"snxs2n003","","kernel","[272274.600238] [<ffffffff81191b7b>] sys_umount+0x7b/0x3a0"

      Attachments

        Issue Links

          Activity

            [LU-1194] llog_recov_thread_stop+0x1ae/0x1b0 asserting

            As per Intel request, I have open a new issue (LU-2615) to track this down.

            louveta Alexandre Louvet (Inactive) added a comment - As per Intel request, I have open a new issue ( LU-2615 ) to track this down.

            I can't upload crashes, site is classified.
            We can manage to export some content of structures or backtraces of tid as text. In this case, let me know what you want to see.

            louveta Alexandre Louvet (Inactive) added a comment - I can't upload crashes, site is classified. We can manage to export some content of structures or backtraces of tid as text. In this case, let me know what you want to see.

            may you put crashdump in some place ?

            shadow Alexey Lyashkov added a comment - may you put crashdump in some place ?
            jhammond John Hammond added a comment -

            We are seeing this on 2.1.3 when we unmount OSTs.

            crash> bt
            PID: 58311  TASK: ffff880285dbe040  CPU: 4   COMMAND: "umount"
             #0 [ffff880364a158c0] machine_kexec at ffffffff8103281b
             #1 [ffff880364a15920] crash_kexec at ffffffff810ba8e2
             #2 [ffff880364a159f0] panic at ffffffff814fd6d1
             #3 [ffff880364a15a70] lbug_with_loc at ffffffffa05bae5b [libcfs]
             #4 [ffff880364a15a90] llog_recov_thread_stop at ffffffffa0c23afb [ptlrpc]
             #5 [ffff880364a15ac0] llog_recov_thread_fini at ffffffffa0c23b33 [ptlrpc]
             #6 [ffff880364a15ae0] filter_llog_finish at ffffffffa06c132d [obdfilter]
             #7 [ffff880364a15b10] obd_llog_finish at ffffffffa0ac22d8 [obdclass]
             #8 [ffff880364a15b30] filter_precleanup at ffffffffa06c0d1f [obdfilter]
             #9 [ffff880364a15b90] class_cleanup at ffffffffa0ae9997 [obdclass]
            #10 [ffff880364a15c10] class_process_config at ffffffffa0aebcdb [obdclass]
            #11 [ffff880364a15ca0] class_manual_cleanup at ffffffffa0aeca19 [obdclass]
            #12 [ffff880364a15d60] server_put_super at ffffffffa0af88bc [obdclass]
            #13 [ffff880364a15e30] generic_shutdown_super at ffffffff8117d49b
            #14 [ffff880364a15e50] kill_anon_super at ffffffff8117d586
            #15 [ffff880364a15e70] lustre_kill_super at ffffffffa0aee656 [obdclass]
            #16 [ffff880364a15e90] deactivate_super at ffffffff8117e600
            #17 [ffff880364a15eb0] mntput_no_expire at ffffffff8119a64f
            #18 [ffff880364a15ee0] sys_umount at ffffffff8119b0eb
            #19 [ffff880364a15f80] system_call_fastpath at ffffffff8100b0f2
                RIP: 00007fc434e45717  RSP: 00007fff3fb87248  RFLAGS: 00010206
                RAX: 00000000000000a6  RBX: ffffffff8100b0f2  RCX: 0000000000000010
                RDX: 0000000000000000  RSI: 0000000000000001  RDI: 00007fc436843b60
                RBP: 00007fc436843b40   R8: 00007fc436843b80   R9: 0000000000000001
                R10: 00007fff3fb87070  R11: 0000000000000206  R12: 0000000000000000
                R13: 0000000000000000  R14: 0000000000000000  R15: 00007fc436843bc0
                ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b
            

            The logs leading up to this were:

            [1733729.589047] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            [1733729.600508] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.124@o2ib (151): c: 7, oc: 0, rc: 8
            [1733730.586896] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
            [1733730.597628] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 1 previous similar message
            [1733730.608257] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.129@o2ib (151): c: 7, oc: 0, rc: 8
            [1733730.621140] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 1 previous similar message
            [1733733.580433] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            [1733733.591675] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages
            [1733733.602613] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.156@o2ib (155): c: 7, oc: 0, rc: 8
            [1733733.615604] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages
            [1733742.561072] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            [1733742.571750] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages
            [1733742.582543] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.127@o2ib (162): c: 7, oc: 0, rc: 8
            [1733742.595489] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages
            [1733971.067888] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            [1733971.080940] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
            [1733971.093679] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.243@o2ib (150): c: 7, oc: 0, rc: 8
            [1733971.109722] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 3 previous similar messages
            [1733976.057048] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            [1733976.069835] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 8 previous similar messages
            [1733976.082962] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.181@o2ib (153): c: 7, oc: 0, rc: 8
            [1733976.097100] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 8 previous similar messages
            [1733985.037612] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
            [1733985.049931] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 168 previous similar messages
            [1733985.062783] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.249@o2ib (162): c: 7, oc: 0, rc: 8
            [1733985.078215] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 168 previous similar messages
            [1735454.496518] LustreError: 48744:0:(sec.c:383:import_sec_validate_get()) import ffff88050b301800 (CLOSED) with no sec
            [1735454.508690] LustreError: 48744:0:(recov_thread.c:394:llcd_push()) Couldn't send llcd ffff880131a16038 (-13)
            [1735454.520599] LustreError: 48744:0:(filter.c:3062:filter_sync_llogs()) error flushing logs to MDS: rc -13
            [1735604.539999] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            [1735604.550820] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 42 previous similar messages
            [1735604.561625] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.107@o2ib (150): c: 7, oc: 0, rc: 8
            [1735604.576129] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 42 previous similar messages
            [1735608.531438] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            [1735608.542183] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 12 previous similar messages
            [1735608.553255] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.38@o2ib (152): c: 7, oc: 0, rc: 8
            [1735608.566020] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 12 previous similar messages
            [1735613.520678] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
            [1735613.531552] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 15 previous similar messages
            [1735613.542474] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.74@o2ib (159): c: 7, oc: 0, rc: 8
            [1735613.557368] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 15 previous similar messages
            [1740454.405061] Lustre: 48598:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1355869330/real 1355869330]  req@ffff88086b1f9c00
             x1421268042665725/t0(0) o400->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 192/192 e 0 to 1 dl 1355869337 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [1740454.436739] LustreError: 166-1: MGC192.168.65.5@o2ib: Connection to MGS (at 192.168.65.5@o2ib) was lost; in progress operations using this service will fail
            [1740460.438928] Lustre: 48599:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1355869337/real 1355869337]  req@ffff88086b1f9000
             x1421268042665726/t0(0) o250->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 368/512 e 0 to 1 dl 1355869343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [1740468.526512] LustreError: 58311:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880699307000
            [1740468.538720] LustreError: 58311:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG
            
            jhammond John Hammond added a comment - We are seeing this on 2.1.3 when we unmount OSTs. crash> bt PID: 58311 TASK: ffff880285dbe040 CPU: 4 COMMAND: "umount" #0 [ffff880364a158c0] machine_kexec at ffffffff8103281b #1 [ffff880364a15920] crash_kexec at ffffffff810ba8e2 #2 [ffff880364a159f0] panic at ffffffff814fd6d1 #3 [ffff880364a15a70] lbug_with_loc at ffffffffa05bae5b [libcfs] #4 [ffff880364a15a90] llog_recov_thread_stop at ffffffffa0c23afb [ptlrpc] #5 [ffff880364a15ac0] llog_recov_thread_fini at ffffffffa0c23b33 [ptlrpc] #6 [ffff880364a15ae0] filter_llog_finish at ffffffffa06c132d [obdfilter] #7 [ffff880364a15b10] obd_llog_finish at ffffffffa0ac22d8 [obdclass] #8 [ffff880364a15b30] filter_precleanup at ffffffffa06c0d1f [obdfilter] #9 [ffff880364a15b90] class_cleanup at ffffffffa0ae9997 [obdclass] #10 [ffff880364a15c10] class_process_config at ffffffffa0aebcdb [obdclass] #11 [ffff880364a15ca0] class_manual_cleanup at ffffffffa0aeca19 [obdclass] #12 [ffff880364a15d60] server_put_super at ffffffffa0af88bc [obdclass] #13 [ffff880364a15e30] generic_shutdown_super at ffffffff8117d49b #14 [ffff880364a15e50] kill_anon_super at ffffffff8117d586 #15 [ffff880364a15e70] lustre_kill_super at ffffffffa0aee656 [obdclass] #16 [ffff880364a15e90] deactivate_super at ffffffff8117e600 #17 [ffff880364a15eb0] mntput_no_expire at ffffffff8119a64f #18 [ffff880364a15ee0] sys_umount at ffffffff8119b0eb #19 [ffff880364a15f80] system_call_fastpath at ffffffff8100b0f2 RIP: 00007fc434e45717 RSP: 00007fff3fb87248 RFLAGS: 00010206 RAX: 00000000000000a6 RBX: ffffffff8100b0f2 RCX: 0000000000000010 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007fc436843b60 RBP: 00007fc436843b40 R8: 00007fc436843b80 R9: 0000000000000001 R10: 00007fff3fb87070 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00007fc436843bc0 ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b The logs leading up to this were: [1733729.589047] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733729.600508] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.124@o2ib (151): c: 7, oc: 0, rc: 8 [1733730.586896] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1733730.597628] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 1 previous similar message [1733730.608257] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.129@o2ib (151): c: 7, oc: 0, rc: 8 [1733730.621140] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 1 previous similar message [1733733.580433] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733733.591675] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages [1733733.602613] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.156@o2ib (155): c: 7, oc: 0, rc: 8 [1733733.615604] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages [1733742.561072] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733742.571750] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages [1733742.582543] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.127@o2ib (162): c: 7, oc: 0, rc: 8 [1733742.595489] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages [1733971.067888] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733971.080940] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 3 previous similar messages [1733971.093679] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.243@o2ib (150): c: 7, oc: 0, rc: 8 [1733971.109722] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 3 previous similar messages [1733976.057048] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733976.069835] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 8 previous similar messages [1733976.082962] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.181@o2ib (153): c: 7, oc: 0, rc: 8 [1733976.097100] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 8 previous similar messages [1733985.037612] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1733985.049931] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 168 previous similar messages [1733985.062783] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.249@o2ib (162): c: 7, oc: 0, rc: 8 [1733985.078215] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 168 previous similar messages [1735454.496518] LustreError: 48744:0:(sec.c:383:import_sec_validate_get()) import ffff88050b301800 (CLOSED) with no sec [1735454.508690] LustreError: 48744:0:(recov_thread.c:394:llcd_push()) Couldn't send llcd ffff880131a16038 (-13) [1735454.520599] LustreError: 48744:0:(filter.c:3062:filter_sync_llogs()) error flushing logs to MDS: rc -13 [1735604.539999] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1735604.550820] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 42 previous similar messages [1735604.561625] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.107@o2ib (150): c: 7, oc: 0, rc: 8 [1735604.576129] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 42 previous similar messages [1735608.531438] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1735608.542183] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 12 previous similar messages [1735608.553255] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.38@o2ib (152): c: 7, oc: 0, rc: 8 [1735608.566020] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 12 previous similar messages [1735613.520678] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1735613.531552] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 15 previous similar messages [1735613.542474] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.74@o2ib (159): c: 7, oc: 0, rc: 8 [1735613.557368] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 15 previous similar messages [1740454.405061] Lustre: 48598:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1355869330/real 1355869330] req@ffff88086b1f9c00 x1421268042665725/t0(0) o400->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 192/192 e 0 to 1 dl 1355869337 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [1740454.436739] LustreError: 166-1: MGC192.168.65.5@o2ib: Connection to MGS (at 192.168.65.5@o2ib) was lost; in progress operations using this service will fail [1740460.438928] Lustre: 48599:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1355869337/real 1355869337] req@ffff88086b1f9000 x1421268042665726/t0(0) o250->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 368/512 e 0 to 1 dl 1355869343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [1740468.526512] LustreError: 58311:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880699307000 [1740468.538720] LustreError: 58311:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG

            Doesn't look enough.
            While we are running 2.1.3, we got 4 nodes that crash at umount with exactly the same bt.

            louveta Alexandre Louvet (Inactive) added a comment - Doesn't look enough. While we are running 2.1.3, we got 4 nodes that crash at umount with exactly the same bt.

            Xyratex-bug-id: MRP-456

            nrutman Nathan Rutman added a comment - Xyratex-bug-id: MRP-456
            bogl Bob Glossman (Inactive) added a comment - http://review.whamcloud.com/#change,3480 back port to b2_1
            pjones Peter Jones added a comment -

            Landed for 2.3

            pjones Peter Jones added a comment - Landed for 2.3

            Hi,

            At IFERC cluster, trying to stop all OSSes with some clients left, they got a bunch of the same LBUG in llog_recov_thread_stop() on multiple OSSes.
            On all the concerned OSSes, the panic stack and preceding messages transmitted by the Support team look like :
            ===============================================================================
            LustreError: 137-5: UUID 'work-OST006a_UUID' is not available for connect (stopping)
            Lustre: work-OST006e: shutting down for failover; client state will be preserved.
            LustreError: 137-5: UUID 'work-OST0069_UUID' is not available for connect (stopping)
            LustreError: Skipped 218 previous similar messages
            LustreError: 783:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff8805a151d600
            LustreError: 783:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG
            Pid: 783, comm: umount

            Call Trace:
            [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
            [<ffffffffa056a855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            [<ffffffffa056ae95>] lbug_with_loc+0x75/0xe0 [libcfs]
            [<ffffffffa073bc7e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]
            [<ffffffffa073bcb7>] llog_recov_thread_fini+0x37/0x220 [ptlrpc]
            [<ffffffff8108009e>] ? down+0x2e/0x50
            [<ffffffffa09db5b8>] filter_llog_finish+0x148/0x2f0 [obdfilter]
            [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass]
            [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70
            [<ffffffffa05f4ec8>] obd_llog_finish+0x98/0x230 [obdclass]
            [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
            [<ffffffffa09d99d1>] filter_precleanup+0x131/0x670 [obdfilter]
            [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]
            [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass]
            [<ffffffffa061d7e5>] class_cleanup+0x1b5/0xe20 [obdclass]
            [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc]
            [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
            [<ffffffffa061f643>] class_process_config+0x11f3/0x1fd0 [obdclass]
            [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs]
            [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
            [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
            [<ffffffffa0620585>] class_manual_cleanup+0x165/0x790 [obdclass]
            [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass]
            [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150
            [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
            [<ffffffffa062b1f0>] server_put_super+0xa50/0xf60 [obdclass]
            [<ffffffff811782bc>] ? dispose_list+0x11c/0x140
            [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0
            [<ffffffff8115feab>] generic_shutdown_super+0x5b/0x110
            [<ffffffff8115ffc6>] kill_anon_super+0x16/0x60
            [<ffffffffa0622306>] lustre_kill_super+0x36/0x60 [obdclass]
            [<ffffffff81161060>] deactivate_super+0x70/0x90
            [<ffffffff8117c76f>] mntput_no_expire+0xbf/0x110
            [<ffffffff8117cba8>] sys_umount+0x78/0x3c0
            [<ffffffff81003172>] system_call_fastpath+0x16/0x1b

            Kernel panic - not syncing: LBUG
            Pid: 783, comm: umount Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1
            Call Trace:
            [<ffffffff8147d803>] ? panic+0x78/0x143
            [<ffffffffa056aeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
            [<ffffffffa073bc7e>] ? llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]
            [<ffffffffa073bcb7>] ? llog_recov_thread_fini+0x37/0x220 [ptlrpc]
            [<ffffffff8108009e>] ? down+0x2e/0x50
            [<ffffffffa09db5b8>] ? filter_llog_finish+0x148/0x2f0 [obdfilter]
            [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass]
            Lustre: OST work-OST006e has stopped.
            Lustre: work-OST0069: shutting down for failover; client state will be preserved
            .
            [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70
            [<ffffffffa05f4ec8>] ? obd_llog_finish+0x98/0x230 [obdclass]
            [<ffffffff8104c780>] ? default_wake_function+0x0/0x20
            [<ffffffffa09d99d1>] ? filter_precleanup+0x131/0x670 [obdfilter]
            [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]
            [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass]
            [<ffffffffa061d7e5>] ? class_cleanup+0x1b5/0xe20 [obdclass]
            [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc]
            [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
            [<ffffffffa061f643>] ? class_process_config+0x11f3/0x1fd0 [obdclass]
            [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs]
            [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
            [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
            [<ffffffffa0620585>] ? class_manual_cleanup+0x165/0x790 [obdclass]
            [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass]
            [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150
            [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
            [<ffffffffa062b1f0>] ? server_put_super+0xa50/0xf60 [obdclass]
            [<ffffffff811782bc>] ? dispose_list+0x11c/0x140
            [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0
            [<ffffffff8115feab>] ? generic_shutdown_super+0x5b/0x110
            [<ffffffff8115ffc6>] ? kill_anon_super+0x16/0x60
            [<ffffffffa0622306>] ? lustre_kill_super+0x36/0x60 [obdclass]
            [<ffffffff81161060>] ? deactivate_super+0x70/0x90
            [<ffffffff8117c76f>] ? mntput_no_expire+0xbf/0x110
            [<ffffffff8117cba8>] ? sys_umount+0x78/0x3c0
            [<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b
            ===============================================================================

            It definitely looks like this issue.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, At IFERC cluster, trying to stop all OSSes with some clients left, they got a bunch of the same LBUG in llog_recov_thread_stop() on multiple OSSes. On all the concerned OSSes, the panic stack and preceding messages transmitted by the Support team look like : =============================================================================== LustreError: 137-5: UUID 'work-OST006a_UUID' is not available for connect (stopping) Lustre: work-OST006e: shutting down for failover; client state will be preserved. LustreError: 137-5: UUID 'work-OST0069_UUID' is not available for connect (stopping) LustreError: Skipped 218 previous similar messages LustreError: 783:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff8805a151d600 LustreError: 783:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG Pid: 783, comm: umount Call Trace: [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa056a855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa056ae95>] lbug_with_loc+0x75/0xe0 [libcfs] [<ffffffffa073bc7e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc] [<ffffffffa073bcb7>] llog_recov_thread_fini+0x37/0x220 [ptlrpc] [<ffffffff8108009e>] ? down+0x2e/0x50 [<ffffffffa09db5b8>] filter_llog_finish+0x148/0x2f0 [obdfilter] [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass] [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70 [<ffffffffa05f4ec8>] obd_llog_finish+0x98/0x230 [obdclass] [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa09d99d1>] filter_precleanup+0x131/0x670 [obdfilter] [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass] [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass] [<ffffffffa061d7e5>] class_cleanup+0x1b5/0xe20 [obdclass] [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc] [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa061f643>] class_process_config+0x11f3/0x1fd0 [obdclass] [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs] [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0620585>] class_manual_cleanup+0x165/0x790 [obdclass] [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass] [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150 [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa062b1f0>] server_put_super+0xa50/0xf60 [obdclass] [<ffffffff811782bc>] ? dispose_list+0x11c/0x140 [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0 [<ffffffff8115feab>] generic_shutdown_super+0x5b/0x110 [<ffffffff8115ffc6>] kill_anon_super+0x16/0x60 [<ffffffffa0622306>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81161060>] deactivate_super+0x70/0x90 [<ffffffff8117c76f>] mntput_no_expire+0xbf/0x110 [<ffffffff8117cba8>] sys_umount+0x78/0x3c0 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b Kernel panic - not syncing: LBUG Pid: 783, comm: umount Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1 Call Trace: [<ffffffff8147d803>] ? panic+0x78/0x143 [<ffffffffa056aeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs] [<ffffffffa073bc7e>] ? llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc] [<ffffffffa073bcb7>] ? llog_recov_thread_fini+0x37/0x220 [ptlrpc] [<ffffffff8108009e>] ? down+0x2e/0x50 [<ffffffffa09db5b8>] ? filter_llog_finish+0x148/0x2f0 [obdfilter] [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass] Lustre: OST work-OST006e has stopped. Lustre: work-OST0069: shutting down for failover; client state will be preserved . [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70 [<ffffffffa05f4ec8>] ? obd_llog_finish+0x98/0x230 [obdclass] [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa09d99d1>] ? filter_precleanup+0x131/0x670 [obdfilter] [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass] [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass] [<ffffffffa061d7e5>] ? class_cleanup+0x1b5/0xe20 [obdclass] [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc] [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa061f643>] ? class_process_config+0x11f3/0x1fd0 [obdclass] [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs] [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0620585>] ? class_manual_cleanup+0x165/0x790 [obdclass] [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass] [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150 [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa062b1f0>] ? server_put_super+0xa50/0xf60 [obdclass] [<ffffffff811782bc>] ? dispose_list+0x11c/0x140 [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0 [<ffffffff8115feab>] ? generic_shutdown_super+0x5b/0x110 [<ffffffff8115ffc6>] ? kill_anon_super+0x16/0x60 [<ffffffffa0622306>] ? lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81161060>] ? deactivate_super+0x70/0x90 [<ffffffff8117c76f>] ? mntput_no_expire+0xbf/0x110 [<ffffffff8117cba8>] ? sys_umount+0x78/0x3c0 [<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b =============================================================================== It definitely looks like this issue.
            aboyko Alexander Boyko added a comment - request http://review.whamcloud.com/2789

            People

              wc-triage WC Triage
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: