[LU-1194] llog_recov_thread_stop+0x1ae/0x1b0 asserting - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3
Affects Version/s: Lustre 2.1.0
Labels:
None
Environment:
RHEL6 + 2.1.0 (xyratex 1.0).

Severity:
3
Rank (Obsolete):
4581

Description

1329761763787093,7345,4,0,0,"snxs2n003","","kernel","[272274.418338] Call Trace:"
1329761763794294,7346,4,0,0,"snxs2n003","","kernel","[272274.422654] [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20"
1329761763794316,7347,4,0,0,"snxs2n003","","kernel","[272274.429093] [<ffffffffa0442855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]"
1329761763808114,7348,4,0,0,"snxs2n003","","kernel","[272274.436284] [<ffffffffa0442e95>] lbug_with_loc+0x75/0xe0 [libcfs]"
1329761763815465,7349,4,0,0,"snxs2n003","","kernel","[272274.442714] [<ffffffffa065805e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]"
1329761763815505,7350,4,0,0,"snxs2n003","","kernel","[272274.450178] [<ffffffffa0658097>] llog_recov_thread_fini+0x37/0x220 [ptlrpc]"
1329761763820472,7351,4,0,0,"snxs2n003","","kernel","[272274.457480] [<ffffffff81093fae>] ? down+0x2e/0x50"
1329761763835467,7352,4,0,0,"snxs2n003","","kernel","[272274.462464] [<ffffffffa0a56558>] filter_llog_finish+0x148/0x2f0 [obdfilter]"
1329761763842393,7353,4,0,0,"snxs2n003","","kernel","[272274.469818] [<ffffffffa0609e5a>] ? target_cleanup_recovery+0x4a/0x320 [ptlrpc]"
1329761763842426,7354,4,0,0,"snxs2n003","","kernel","[272274.477457] [<ffffffffa0512f58>] obd_llog_finish+0x98/0x230 [obdclass]"
1329761763857637,7355,4,0,0,"snxs2n003","","kernel","[272274.484336] [<ffffffffa0a54baa>] filter_precleanup+0x12a/0x440 [obdfilter]"
1329761763871327,7356,4,0,0,"snxs2n003","","kernel","[272274.491538] [<ffffffffa0523316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]"
1329761763871364,7357,4,0,0,"snxs2n003","","kernel","[272274.499607] [<ffffffffa053b772>] class_cleanup+0x192/0xe30 [obdclass]"
1329761763878975,7358,4,0,0,"snxs2n003","","kernel","[272274.506423] [<ffffffffa051bc76>] ? class_name2dev+0x56/0xd0 [obdclass]"
1329761763879013,7359,4,0,0,"snxs2n003","","kernel","[272274.513330] [<ffffffffa053d603>] class_process_config+0x11f3/0x1fd0 [obdclass]"
1329761763892326,7360,4,0,0,"snxs2n003","","kernel","[272274.520948] [<ffffffffa0443a13>] ? cfs_alloc+0x63/0x90 [libcfs]"
1329761763899616,7361,4,0,0,"snxs2n003","","kernel","[272274.527229] [<ffffffffa0537fdb>] ? lustre_cfg_new+0x33b/0x880 [obdclass]"
1329761763899661,7362,4,0,0,"snxs2n003","","kernel","[272274.534335] [<ffffffffa044d6b1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]"
1329761763907044,7363,4,0,0,"snxs2n003","","kernel","[272274.541589] [<ffffffffa053e545>] class_manual_cleanup+0x165/0x790 [obdclass]"
1329761763913937,7364,4,0,0,"snxs2n003","","kernel","[272274.549022] [<ffffffffa051bc76>] ? class_name2dev+0x56/0xd0 [obdclass]"
1329761763920982,7365,4,0,0,"snxs2n003","","kernel","[272274.555894] [<ffffffffa0549170>] server_put_super+0xa50/0xf60 [obdclass]"
1329761763927266,7366,4,0,0,"snxs2n003","","kernel","[272274.562905] [<ffffffff8118d5b6>] ? invalidate_inodes+0xf6/0x190"
1329761763933628,7367,4,0,0,"snxs2n003","","kernel","[272274.569150] [<ffffffff811750ab>] generic_shutdown_super+0x5b/0xe0"
1329761763939475,7368,4,0,0,"snxs2n003","","kernel","[272274.575568] [<ffffffff81175196>] kill_anon_super+0x16/0x60"
1329761763946424,7369,4,0,0,"snxs2n003","","kernel","[272274.581419] [<ffffffffa05402c6>] lustre_kill_super+0x36/0x60 [obdclass]"
1329761763952386,7370,4,0,0,"snxs2n003","","kernel","[272274.588352] [<ffffffff81176210>] deactivate_super+0x70/0x90"
1329761763963859,7371,4,0,0,"snxs2n003","","kernel","[272274.594278] [<ffffffff8119174f>] mntput_no_expire+0xbf/0x110"
1329761763963903,7372,4,0,0,"snxs2n003","","kernel","[272274.600238] [<ffffffff81191b7b>] sys_umount+0x7b/0x3a0"

Attachments

Issue Links

Trackbacks

Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....

Activity

[LU-1194] llog_recov_thread_stop+0x1ae/0x1b0 asserting

Alexey Lyashkov added a comment - 19/Dec/12 7:45 AM

may you put crashdump in some place ?

Alexey Lyashkov added a comment - 19/Dec/12 7:45 AM may you put crashdump in some place ?

John Hammond added a comment - 18/Dec/12 6:28 PM

We are seeing this on 2.1.3 when we unmount OSTs.

crash> bt
PID: 58311  TASK: ffff880285dbe040  CPU: 4   COMMAND: "umount"
 #0 [ffff880364a158c0] machine_kexec at ffffffff8103281b
 #1 [ffff880364a15920] crash_kexec at ffffffff810ba8e2
 #2 [ffff880364a159f0] panic at ffffffff814fd6d1
 #3 [ffff880364a15a70] lbug_with_loc at ffffffffa05bae5b [libcfs]
 #4 [ffff880364a15a90] llog_recov_thread_stop at ffffffffa0c23afb [ptlrpc]
 #5 [ffff880364a15ac0] llog_recov_thread_fini at ffffffffa0c23b33 [ptlrpc]
 #6 [ffff880364a15ae0] filter_llog_finish at ffffffffa06c132d [obdfilter]
 #7 [ffff880364a15b10] obd_llog_finish at ffffffffa0ac22d8 [obdclass]
 #8 [ffff880364a15b30] filter_precleanup at ffffffffa06c0d1f [obdfilter]
 #9 [ffff880364a15b90] class_cleanup at ffffffffa0ae9997 [obdclass]
#10 [ffff880364a15c10] class_process_config at ffffffffa0aebcdb [obdclass]
#11 [ffff880364a15ca0] class_manual_cleanup at ffffffffa0aeca19 [obdclass]
#12 [ffff880364a15d60] server_put_super at ffffffffa0af88bc [obdclass]
#13 [ffff880364a15e30] generic_shutdown_super at ffffffff8117d49b
#14 [ffff880364a15e50] kill_anon_super at ffffffff8117d586
#15 [ffff880364a15e70] lustre_kill_super at ffffffffa0aee656 [obdclass]
#16 [ffff880364a15e90] deactivate_super at ffffffff8117e600
#17 [ffff880364a15eb0] mntput_no_expire at ffffffff8119a64f
#18 [ffff880364a15ee0] sys_umount at ffffffff8119b0eb
#19 [ffff880364a15f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00007fc434e45717  RSP: 00007fff3fb87248  RFLAGS: 00010206
    RAX: 00000000000000a6  RBX: ffffffff8100b0f2  RCX: 0000000000000010
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: 00007fc436843b60
    RBP: 00007fc436843b40   R8: 00007fc436843b80   R9: 0000000000000001
    R10: 00007fff3fb87070  R11: 0000000000000206  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00007fc436843bc0
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b

The logs leading up to this were:

[1733729.589047] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[1733729.600508] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.124@o2ib (151): c: 7, oc: 0, rc: 8
[1733730.586896] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[1733730.597628] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 1 previous similar message
[1733730.608257] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.129@o2ib (151): c: 7, oc: 0, rc: 8
[1733730.621140] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 1 previous similar message
[1733733.580433] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[1733733.591675] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages
[1733733.602613] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.156@o2ib (155): c: 7, oc: 0, rc: 8
[1733733.615604] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages
[1733742.561072] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[1733742.571750] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages
[1733742.582543] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.127@o2ib (162): c: 7, oc: 0, rc: 8
[1733742.595489] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages
[1733971.067888] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[1733971.080940] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
[1733971.093679] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.243@o2ib (150): c: 7, oc: 0, rc: 8
[1733971.109722] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 3 previous similar messages
[1733976.057048] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[1733976.069835] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 8 previous similar messages
[1733976.082962] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.181@o2ib (153): c: 7, oc: 0, rc: 8
[1733976.097100] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 8 previous similar messages
[1733985.037612] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[1733985.049931] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 168 previous similar messages
[1733985.062783] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.249@o2ib (162): c: 7, oc: 0, rc: 8
[1733985.078215] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 168 previous similar messages
[1735454.496518] LustreError: 48744:0:(sec.c:383:import_sec_validate_get()) import ffff88050b301800 (CLOSED) with no sec
[1735454.508690] LustreError: 48744:0:(recov_thread.c:394:llcd_push()) Couldn't send llcd ffff880131a16038 (-13)
[1735454.520599] LustreError: 48744:0:(filter.c:3062:filter_sync_llogs()) error flushing logs to MDS: rc -13
[1735604.539999] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[1735604.550820] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 42 previous similar messages
[1735604.561625] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.107@o2ib (150): c: 7, oc: 0, rc: 8
[1735604.576129] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 42 previous similar messages
[1735608.531438] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[1735608.542183] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 12 previous similar messages
[1735608.553255] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.38@o2ib (152): c: 7, oc: 0, rc: 8
[1735608.566020] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 12 previous similar messages
[1735613.520678] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[1735613.531552] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 15 previous similar messages
[1735613.542474] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.74@o2ib (159): c: 7, oc: 0, rc: 8
[1735613.557368] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 15 previous similar messages
[1740454.405061] Lustre: 48598:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1355869330/real 1355869330]  req@ffff88086b1f9c00
 x1421268042665725/t0(0) o400->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 192/192 e 0 to 1 dl 1355869337 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[1740454.436739] LustreError: 166-1: MGC192.168.65.5@o2ib: Connection to MGS (at 192.168.65.5@o2ib) was lost; in progress operations using this service will fail
[1740460.438928] Lustre: 48599:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1355869337/real 1355869337]  req@ffff88086b1f9000
 x1421268042665726/t0(0) o250->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 368/512 e 0 to 1 dl 1355869343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[1740468.526512] LustreError: 58311:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880699307000
[1740468.538720] LustreError: 58311:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG

John Hammond added a comment - 18/Dec/12 6:28 PM We are seeing this on 2.1.3 when we unmount OSTs. crash> bt PID: 58311 TASK: ffff880285dbe040 CPU: 4 COMMAND: "umount" #0 [ffff880364a158c0] machine_kexec at ffffffff8103281b #1 [ffff880364a15920] crash_kexec at ffffffff810ba8e2 #2 [ffff880364a159f0] panic at ffffffff814fd6d1 #3 [ffff880364a15a70] lbug_with_loc at ffffffffa05bae5b [libcfs] #4 [ffff880364a15a90] llog_recov_thread_stop at ffffffffa0c23afb [ptlrpc] #5 [ffff880364a15ac0] llog_recov_thread_fini at ffffffffa0c23b33 [ptlrpc] #6 [ffff880364a15ae0] filter_llog_finish at ffffffffa06c132d [obdfilter] #7 [ffff880364a15b10] obd_llog_finish at ffffffffa0ac22d8 [obdclass] #8 [ffff880364a15b30] filter_precleanup at ffffffffa06c0d1f [obdfilter] #9 [ffff880364a15b90] class_cleanup at ffffffffa0ae9997 [obdclass] #10 [ffff880364a15c10] class_process_config at ffffffffa0aebcdb [obdclass] #11 [ffff880364a15ca0] class_manual_cleanup at ffffffffa0aeca19 [obdclass] #12 [ffff880364a15d60] server_put_super at ffffffffa0af88bc [obdclass] #13 [ffff880364a15e30] generic_shutdown_super at ffffffff8117d49b #14 [ffff880364a15e50] kill_anon_super at ffffffff8117d586 #15 [ffff880364a15e70] lustre_kill_super at ffffffffa0aee656 [obdclass] #16 [ffff880364a15e90] deactivate_super at ffffffff8117e600 #17 [ffff880364a15eb0] mntput_no_expire at ffffffff8119a64f #18 [ffff880364a15ee0] sys_umount at ffffffff8119b0eb #19 [ffff880364a15f80] system_call_fastpath at ffffffff8100b0f2 RIP: 00007fc434e45717 RSP: 00007fff3fb87248 RFLAGS: 00010206 RAX: 00000000000000a6 RBX: ffffffff8100b0f2 RCX: 0000000000000010 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007fc436843b60 RBP: 00007fc436843b40 R8: 00007fc436843b80 R9: 0000000000000001 R10: 00007fff3fb87070 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00007fc436843bc0 ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b The logs leading up to this were: [1733729.589047] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733729.600508] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.124@o2ib (151): c: 7, oc: 0, rc: 8 [1733730.586896] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1733730.597628] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 1 previous similar message [1733730.608257] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.129@o2ib (151): c: 7, oc: 0, rc: 8 [1733730.621140] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 1 previous similar message [1733733.580433] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733733.591675] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages [1733733.602613] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.156@o2ib (155): c: 7, oc: 0, rc: 8 [1733733.615604] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages [1733742.561072] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733742.571750] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages [1733742.582543] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.127@o2ib (162): c: 7, oc: 0, rc: 8 [1733742.595489] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages [1733971.067888] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733971.080940] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 3 previous similar messages [1733971.093679] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.243@o2ib (150): c: 7, oc: 0, rc: 8 [1733971.109722] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 3 previous similar messages [1733976.057048] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733976.069835] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 8 previous similar messages [1733976.082962] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.181@o2ib (153): c: 7, oc: 0, rc: 8 [1733976.097100] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 8 previous similar messages [1733985.037612] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1733985.049931] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 168 previous similar messages [1733985.062783] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.249@o2ib (162): c: 7, oc: 0, rc: 8 [1733985.078215] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 168 previous similar messages [1735454.496518] LustreError: 48744:0:(sec.c:383:import_sec_validate_get()) import ffff88050b301800 (CLOSED) with no sec [1735454.508690] LustreError: 48744:0:(recov_thread.c:394:llcd_push()) Couldn't send llcd ffff880131a16038 (-13) [1735454.520599] LustreError: 48744:0:(filter.c:3062:filter_sync_llogs()) error flushing logs to MDS: rc -13 [1735604.539999] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1735604.550820] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 42 previous similar messages [1735604.561625] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.107@o2ib (150): c: 7, oc: 0, rc: 8 [1735604.576129] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 42 previous similar messages [1735608.531438] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1735608.542183] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 12 previous similar messages [1735608.553255] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.38@o2ib (152): c: 7, oc: 0, rc: 8 [1735608.566020] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 12 previous similar messages [1735613.520678] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1735613.531552] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 15 previous similar messages [1735613.542474] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.74@o2ib (159): c: 7, oc: 0, rc: 8 [1735613.557368] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 15 previous similar messages [1740454.405061] Lustre: 48598:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1355869330/real 1355869330] req@ffff88086b1f9c00 x1421268042665725/t0(0) o400->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 192/192 e 0 to 1 dl 1355869337 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [1740454.436739] LustreError: 166-1: MGC192.168.65.5@o2ib: Connection to MGS (at 192.168.65.5@o2ib) was lost; in progress operations using this service will fail [1740460.438928] Lustre: 48599:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1355869337/real 1355869337] req@ffff88086b1f9000 x1421268042665726/t0(0) o250->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 368/512 e 0 to 1 dl 1355869343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [1740468.526512] LustreError: 58311:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880699307000 [1740468.538720] LustreError: 58311:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG

Alexandre Louvet (Inactive) added a comment - 14/Dec/12 8:57 AM

Doesn't look enough.
While we are running 2.1.3, we got 4 nodes that crash at umount with exactly the same bt.

Alexandre Louvet (Inactive) added a comment - 14/Dec/12 8:57 AM Doesn't look enough. While we are running 2.1.3, we got 4 nodes that crash at umount with exactly the same bt.

Nathan Rutman added a comment - 21/Nov/12 6:31 PM

Xyratex-bug-id: MRP-456

Nathan Rutman added a comment - 21/Nov/12 6:31 PM Xyratex-bug-id: MRP-456

Bob Glossman (Inactive) added a comment - 26/Jul/12 12:58 PM

http://review.whamcloud.com/#change,3480

back port to b2_1

Bob Glossman (Inactive) added a comment - 26/Jul/12 12:58 PM http://review.whamcloud.com/#change,3480 back port to b2_1

Peter Jones added a comment - 25/Jul/12 10:53 AM

Landed for 2.3

Peter Jones added a comment - 25/Jul/12 10:53 AM Landed for 2.3

Sebastien Buisson (Inactive) added a comment - 30/May/12 5:36 AM

Hi,

At IFERC cluster, trying to stop all OSSes with some clients left, they got a bunch of the same LBUG in llog_recov_thread_stop() on multiple OSSes.
On all the concerned OSSes, the panic stack and preceding messages transmitted by the Support team look like :
===============================================================================
LustreError: 137-5: UUID 'work-OST006a_UUID' is not available for connect (stopping)
Lustre: work-OST006e: shutting down for failover; client state will be preserved.
LustreError: 137-5: UUID 'work-OST0069_UUID' is not available for connect (stopping)
LustreError: Skipped 218 previous similar messages
LustreError: 783:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff8805a151d600
LustreError: 783:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG
Pid: 783, comm: umount

Call Trace:
[<ffffffff8104c780>] ? default_wake_function+0x0/0x20
[<ffffffffa056a855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa056ae95>] lbug_with_loc+0x75/0xe0 [libcfs]
[<ffffffffa073bc7e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]
[<ffffffffa073bcb7>] llog_recov_thread_fini+0x37/0x220 [ptlrpc]
[<ffffffff8108009e>] ? down+0x2e/0x50
[<ffffffffa09db5b8>] filter_llog_finish+0x148/0x2f0 [obdfilter]
[<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass]
[<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70
[<ffffffffa05f4ec8>] obd_llog_finish+0x98/0x230 [obdclass]
[<ffffffff8104c780>] ? default_wake_function+0x0/0x20
[<ffffffffa09d99d1>] filter_precleanup+0x131/0x670 [obdfilter]
[<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]
[<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass]
[<ffffffffa061d7e5>] class_cleanup+0x1b5/0xe20 [obdclass]
[<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc]
[<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
[<ffffffffa061f643>] class_process_config+0x11f3/0x1fd0 [obdclass]
[<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs]
[<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
[<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
[<ffffffffa0620585>] class_manual_cleanup+0x165/0x790 [obdclass]
[<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass]
[<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150
[<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
[<ffffffffa062b1f0>] server_put_super+0xa50/0xf60 [obdclass]
[<ffffffff811782bc>] ? dispose_list+0x11c/0x140
[<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0
[<ffffffff8115feab>] generic_shutdown_super+0x5b/0x110
[<ffffffff8115ffc6>] kill_anon_super+0x16/0x60
[<ffffffffa0622306>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff81161060>] deactivate_super+0x70/0x90
[<ffffffff8117c76f>] mntput_no_expire+0xbf/0x110
[<ffffffff8117cba8>] sys_umount+0x78/0x3c0
[<ffffffff81003172>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 783, comm: umount Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1
Call Trace:
[<ffffffff8147d803>] ? panic+0x78/0x143
[<ffffffffa056aeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
[<ffffffffa073bc7e>] ? llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc]
[<ffffffffa073bcb7>] ? llog_recov_thread_fini+0x37/0x220 [ptlrpc]
[<ffffffff8108009e>] ? down+0x2e/0x50
[<ffffffffa09db5b8>] ? filter_llog_finish+0x148/0x2f0 [obdfilter]
[<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass]
Lustre: OST work-OST006e has stopped.
Lustre: work-OST0069: shutting down for failover; client state will be preserved
.
[<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70
[<ffffffffa05f4ec8>] ? obd_llog_finish+0x98/0x230 [obdclass]
[<ffffffff8104c780>] ? default_wake_function+0x0/0x20
[<ffffffffa09d99d1>] ? filter_precleanup+0x131/0x670 [obdfilter]
[<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass]
[<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass]
[<ffffffffa061d7e5>] ? class_cleanup+0x1b5/0xe20 [obdclass]
[<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc]
[<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
[<ffffffffa061f643>] ? class_process_config+0x11f3/0x1fd0 [obdclass]
[<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs]
[<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
[<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass]
[<ffffffffa0620585>] ? class_manual_cleanup+0x165/0x790 [obdclass]
[<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass]
[<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150
[<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass]
[<ffffffffa062b1f0>] ? server_put_super+0xa50/0xf60 [obdclass]
[<ffffffff811782bc>] ? dispose_list+0x11c/0x140
[<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0
[<ffffffff8115feab>] ? generic_shutdown_super+0x5b/0x110
[<ffffffff8115ffc6>] ? kill_anon_super+0x16/0x60
[<ffffffffa0622306>] ? lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff81161060>] ? deactivate_super+0x70/0x90
[<ffffffff8117c76f>] ? mntput_no_expire+0xbf/0x110
[<ffffffff8117cba8>] ? sys_umount+0x78/0x3c0
[<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b
===============================================================================

It definitely looks like this issue.

Sebastien Buisson (Inactive) added a comment - 30/May/12 5:36 AM Hi, At IFERC cluster, trying to stop all OSSes with some clients left, they got a bunch of the same LBUG in llog_recov_thread_stop() on multiple OSSes. On all the concerned OSSes, the panic stack and preceding messages transmitted by the Support team look like : =============================================================================== LustreError: 137-5: UUID 'work-OST006a_UUID' is not available for connect (stopping) Lustre: work-OST006e: shutting down for failover; client state will be preserved. LustreError: 137-5: UUID 'work-OST0069_UUID' is not available for connect (stopping) LustreError: Skipped 218 previous similar messages LustreError: 783:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff8805a151d600 LustreError: 783:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG Pid: 783, comm: umount Call Trace: [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa056a855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa056ae95>] lbug_with_loc+0x75/0xe0 [libcfs] [<ffffffffa073bc7e>] llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc] [<ffffffffa073bcb7>] llog_recov_thread_fini+0x37/0x220 [ptlrpc] [<ffffffff8108009e>] ? down+0x2e/0x50 [<ffffffffa09db5b8>] filter_llog_finish+0x148/0x2f0 [obdfilter] [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass] [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70 [<ffffffffa05f4ec8>] obd_llog_finish+0x98/0x230 [obdclass] [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa09d99d1>] filter_precleanup+0x131/0x670 [obdfilter] [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass] [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass] [<ffffffffa061d7e5>] class_cleanup+0x1b5/0xe20 [obdclass] [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc] [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa061f643>] class_process_config+0x11f3/0x1fd0 [obdclass] [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs] [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0620585>] class_manual_cleanup+0x165/0x790 [obdclass] [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass] [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150 [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa062b1f0>] server_put_super+0xa50/0xf60 [obdclass] [<ffffffff811782bc>] ? dispose_list+0x11c/0x140 [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0 [<ffffffff8115feab>] generic_shutdown_super+0x5b/0x110 [<ffffffff8115ffc6>] kill_anon_super+0x16/0x60 [<ffffffffa0622306>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81161060>] deactivate_super+0x70/0x90 [<ffffffff8117c76f>] mntput_no_expire+0xbf/0x110 [<ffffffff8117cba8>] sys_umount+0x78/0x3c0 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b Kernel panic - not syncing: LBUG Pid: 783, comm: umount Not tainted 2.6.32-131.17.1.bl6.Bull.27.0.x86_64 #1 Call Trace: [<ffffffff8147d803>] ? panic+0x78/0x143 [<ffffffffa056aeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs] [<ffffffffa073bc7e>] ? llog_recov_thread_stop+0x1ae/0x1b0 [ptlrpc] [<ffffffffa073bcb7>] ? llog_recov_thread_fini+0x37/0x220 [ptlrpc] [<ffffffff8108009e>] ? down+0x2e/0x50 [<ffffffffa09db5b8>] ? filter_llog_finish+0x148/0x2f0 [obdfilter] [<ffffffffa05fcc85>] ? obd_zombie_impexp_notify+0x15/0x20 [obdclass] Lustre: OST work-OST006e has stopped. Lustre: work-OST0069: shutting down for failover; client state will be preserved . [<ffffffff8107a5cb>] ? remove_wait_queue+0x5b/0x70 [<ffffffffa05f4ec8>] ? obd_llog_finish+0x98/0x230 [obdclass] [<ffffffff8104c780>] ? default_wake_function+0x0/0x20 [<ffffffffa09d99d1>] ? filter_precleanup+0x131/0x670 [obdfilter] [<ffffffffa0605316>] ? class_disconnect_exports+0x116/0x2b0 [obdclass] [<ffffffffa05fe553>] ? dump_exports+0x133/0x1a0 [obdclass] [<ffffffffa061d7e5>] ? class_cleanup+0x1b5/0xe20 [obdclass] [<ffffffffa06d4c32>] ? mgc_process_config+0x232/0xed0 [mgc] [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa061f643>] ? class_process_config+0x11f3/0x1fd0 [obdclass] [<ffffffffa056ba13>] ? cfs_alloc+0x63/0x90 [libcfs] [<ffffffffa061a02b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0625d7b>] ? lustre_cfg_new+0x33b/0x880 [obdclass] [<ffffffffa0620585>] ? class_manual_cleanup+0x165/0x790 [obdclass] [<ffffffffa0626aa8>] ? lustre_end_log+0x1c8/0x490 [obdclass] [<ffffffff81041f25>] ? fair_select_idle_sibling+0x95/0x150 [<ffffffffa05fe0a6>] ? class_name2dev+0x56/0xd0 [obdclass] [<ffffffffa062b1f0>] ? server_put_super+0xa50/0xf60 [obdclass] [<ffffffff811782bc>] ? dispose_list+0x11c/0x140 [<ffffffff81178748>] ? invalidate_inodes+0x158/0x1a0 [<ffffffff8115feab>] ? generic_shutdown_super+0x5b/0x110 [<ffffffff8115ffc6>] ? kill_anon_super+0x16/0x60 [<ffffffffa0622306>] ? lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81161060>] ? deactivate_super+0x70/0x90 [<ffffffff8117c76f>] ? mntput_no_expire+0xbf/0x110 [<ffffffff8117cba8>] ? sys_umount+0x78/0x3c0 [<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b =============================================================================== It definitely looks like this issue.

Alexander Boyko added a comment - 15/May/12 4:59 AM

request http://review.whamcloud.com/2789

Alexander Boyko added a comment - 15/May/12 4:59 AM request http://review.whamcloud.com/2789

Alexander Boyko added a comment - 12/Apr/12 8:20 AM

Probably root cause is window between llog_sync(ctxt, NULL) and class_import_put() at filter_llog_finish():

                llog_sync(ctxt, NULL);                                         
                
                /*
                 * Balance class_import_get() in llog_receptor_accept().       
                 * This is safe to do, as llog is already synchronized         
                 * and its import may go.
                 */                                                            
                cfs_mutex_down(&ctxt->loc_sem);                                
                if (ctxt->loc_imp) {
                        class_import_put(ctxt->loc_imp);
                        ctxt->loc_imp = NULL;                                  
                }
                cfs_mutex_up(&ctxt->loc_sem);
                llog_ctxt_put(ctxt);

If llog_obd_repl_cancel() happend at this window new llcd will be cached at ctx. And LBUG happend at llog_recov_thread_stop().
I work to fix this issue.

Alexander Boyko added a comment - 12/Apr/12 8:20 AM Probably root cause is window between llog_sync(ctxt, NULL) and class_import_put() at filter_llog_finish(): llog_sync(ctxt, NULL); /* * Balance class_import_get() in llog_receptor_accept(). * This is safe to do, as llog is already synchronized * and its import may go. */ cfs_mutex_down(&ctxt->loc_sem); if (ctxt->loc_imp) { class_import_put(ctxt->loc_imp); ctxt->loc_imp = NULL; } cfs_mutex_up(&ctxt->loc_sem); llog_ctxt_put(ctxt); If llog_obd_repl_cancel() happend at this window new llcd will be cached at ctx. And LBUG happend at llog_recov_thread_stop(). I work to fix this issue.

Alexey Lyashkov added a comment - 12/Mar/12 1:24 PM

that is logs from a different hit's, but i think we have same root case.

Alexey Lyashkov added a comment - 12/Mar/12 1:24 PM that is logs from a different hit's, but i think we have same root case.

Alexey Lyashkov added a comment - 12/Mar/12 1:23 PM

1329761744817699,6366,4,0,0,"snxs2n003","Lustre","kernel","[269602.953789] Lustre: 15857:0:(import.c:526:import_select_connection()) Skipped 8 previous similar messages"
1329761744817727,6368,4,0,0,"snxs2n003","Lustre","kernel","[269658.938516] Lustre: 15856:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1394071067253911 sent from MGC10.10.100.5@o2ib
to NID 10.10.100.5@o2ib has timed out for slow reply: [sent 1329759091] [real_sent 1329759091] [current 1329759147] [deadline 56s] [delay 0s] req@ffff880364381000 x1394071067253911/t0(0) o-1->
MGS@MGC10.10.100.5@o2ib_0:26/25 lens 368/512 e 0 to 1 dl 1329759147 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1"
1329761744817743,6370,4,0,0,"snxs2n003","Lustre","kernel","[269658.977157] Lustre: 15856:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 8 previous similar messages"
1329761747023362,6378,4,0,0,"snxs2n002","Lustre","kernel","[272256.552763] Lustre: server umount snxs2-OST0003 complete"
1329761747381152,6383,4,0,0,"snxs2n002","Lustre","kernel","[272256.913120] Lustre: Failing over snxs2-OST0000"
1329761747414521,6385,4,0,0,"snxs2n003","Lustre","kernel","[272257.995434] Lustre: Failing over snxs2-OST0007"
1329761748292808,6387,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.830463] Lustre: 3868:0:(import.c:526:import_select_connection()) snxs2-OST000e-osc-MDT0000: tried all connections, increasing la
tency to 11s"
1329761748292852,6389,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.845109] Lustre: 3868:0:(import.c:526:import_select_connection()) Skipped 2 previous similar messages"
1329761748292981,6391,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.853216] Lustre: snxs2-OST0000-osc-MDT0000: Connection to service snxs2-OST0000 via nid 10.10.100.1@o2ib was lost; in progress op
erations using this service will wait for recovery to complete."
1329761748368977,6393,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.895529] Lustre: snxs2-OST0003-osc-MDT0000: Connection to service snxs2-OST0003 via nid 10.10.100.1@o2ib was lost; in progress op
erations using this service will wait for recovery to complete."
1329761748369012,6395,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.896566] Lustre: 3867:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1394349653885019 sent from snxs2-OST000e-osc-MDT
0000 to NID 10.10.100.4@o2ib has failed due to network error: [sent 1329761748] [real_sent 1329761748] [current 1329761748] [deadline 16s] [delay -16s] req@ffff88085bed0000 x1394349653885019/t0
(0) o-1->snxs2-OST000e_UUID@10.10.100.4@o2ib:28/4 lens 368/512 e 0 to 1 dl 1329761764 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1"
1329761748369033,6397,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.896574] Lustre: 3867:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 3 previous similar messages"
1329761748793789,6399,4,0,0,"snxs2n003","Lustre","kernel","[272259.407949] Lustre: snxs2-OST0007: shutting down for failover; client state will be preserved."
1329761763772702,6400,3,0,0,"snxs2n003","LustreError","kernel","[272274.395479] LustreError: 4998:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880419da7800"
1329761748844011,6402,6,0,0,"snxs2n003","Lustre","kernel","[272259.437366] Lustre: OST snxs2-OST0007 has stopped."
1329761763772748,6403,0,0,0,"snxs2n003","LustreError","kernel","[272274.406830] LustreError: 4998:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG"

Alexey Lyashkov added a comment - 12/Mar/12 1:23 PM 1329761744817699,6366,4,0,0,"snxs2n003","Lustre","kernel"," [269602.953789] Lustre: 15857:0:(import.c:526:import_select_connection()) Skipped 8 previous similar messages" 1329761744817727,6368,4,0,0,"snxs2n003","Lustre","kernel"," [269658.938516] Lustre: 15856:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1394071067253911 sent from MGC10.10.100.5@o2ib to NID 10.10.100.5@o2ib has timed out for slow reply: [sent 1329759091] [real_sent 1329759091] [current 1329759147] [deadline 56s] [delay 0s] req@ffff880364381000 x1394071067253911/t0(0) o-1-> MGS@MGC10.10.100.5@o2ib_0:26/25 lens 368/512 e 0 to 1 dl 1329759147 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1" 1329761744817743,6370,4,0,0,"snxs2n003","Lustre","kernel"," [269658.977157] Lustre: 15856:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 8 previous similar messages" 1329761747023362,6378,4,0,0,"snxs2n002","Lustre","kernel"," [272256.552763] Lustre: server umount snxs2-OST0003 complete" 1329761747381152,6383,4,0,0,"snxs2n002","Lustre","kernel"," [272256.913120] Lustre: Failing over snxs2-OST0000" 1329761747414521,6385,4,0,0,"snxs2n003","Lustre","kernel"," [272257.995434] Lustre: Failing over snxs2-OST0007" 1329761748292808,6387,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.830463] Lustre: 3868:0:(import.c:526:import_select_connection()) snxs2-OST000e-osc-MDT0000: tried all connections, increasing la tency to 11s" 1329761748292852,6389,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.845109] Lustre: 3868:0:(import.c:526:import_select_connection()) Skipped 2 previous similar messages" 1329761748292981,6391,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.853216] Lustre: snxs2-OST0000-osc-MDT0000: Connection to service snxs2-OST0000 via nid 10.10.100.1@o2ib was lost; in progress op erations using this service will wait for recovery to complete." 1329761748368977,6393,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.895529] Lustre: snxs2-OST0003-osc-MDT0000: Connection to service snxs2-OST0003 via nid 10.10.100.1@o2ib was lost; in progress op erations using this service will wait for recovery to complete." 1329761748369012,6395,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.896566] Lustre: 3867:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1394349653885019 sent from snxs2-OST000e-osc-MDT 0000 to NID 10.10.100.4@o2ib has failed due to network error: [sent 1329761748] [real_sent 1329761748] [current 1329761748] [deadline 16s] [delay -16s] req@ffff88085bed0000 x1394349653885019/t0 (0) o-1->snxs2-OST000e_UUID@10.10.100.4@o2ib:28/4 lens 368/512 e 0 to 1 dl 1329761764 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1" 1329761748369033,6397,4,0,0,"snxs2n001","Lustre","kernel","[ 6339.896574] Lustre: 3867:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 3 previous similar messages" 1329761748793789,6399,4,0,0,"snxs2n003","Lustre","kernel"," [272259.407949] Lustre: snxs2-OST0007: shutting down for failover; client state will be preserved." 1329761763772702,6400,3,0,0,"snxs2n003","LustreError","kernel"," [272274.395479] LustreError: 4998:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880419da7800" 1329761748844011,6402,6,0,0,"snxs2n003","Lustre","kernel"," [272259.437366] Lustre: OST snxs2-OST0007 has stopped." 1329761763772748,6403,0,0,0,"snxs2n003","LustreError","kernel"," [272274.406830] LustreError: 4998:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG"

People

Assignee:: WC Triage

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 07/Mar/12 8:09 AM

Updated:: 14/Jan/13 9:17 AM

Resolved:: 25/Jul/12 10:53 AM