[LU-1194] llog_recov_thread_stop+0x1ae/0x1b0 asserting Created: 07/Mar/12 Updated: 14/Jan/13 Resolved: 25/Jul/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexey Lyashkov | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6 + 2.1.0 (xyratex 1.0). |
||
| Severity: | 3 |
| Rank (Obsolete): | 4581 |
| Description |
|
1329761763787093,7345,4,0,0,"snxs2n003","","kernel","[272274.418338] Call Trace:" |
| Comments |
| Comment by Andreas Dilger [ 12/Mar/12 ] |
|
Hi Shadow, Unfortunately, there isn't enough information in this bug to make any investigation on what is failing here. What kind of load is being run before hitting this problem? What did the CERROR() before the LBUG() report? |
| Comment by Alexey Lyashkov [ 12/Mar/12 ] |
|
1329761744817699,6366,4,0,0,"snxs2n003","Lustre","kernel","[269602.953789] Lustre: 15857:0:(import.c:526:import_select_connection()) Skipped 8 previous similar messages" |
| Comment by Alexey Lyashkov [ 12/Mar/12 ] |
|
that is logs from a different hit's, but i think we have same root case. |
| Comment by Alexander Boyko [ 12/Apr/12 ] |
|
Probably root cause is window between llog_sync(ctxt, NULL) and class_import_put() at filter_llog_finish(): llog_sync(ctxt, NULL);
/*
* Balance class_import_get() in llog_receptor_accept().
* This is safe to do, as llog is already synchronized
* and its import may go.
*/
cfs_mutex_down(&ctxt->loc_sem);
if (ctxt->loc_imp) {
class_import_put(ctxt->loc_imp);
ctxt->loc_imp = NULL;
}
cfs_mutex_up(&ctxt->loc_sem);
llog_ctxt_put(ctxt);
If llog_obd_repl_cancel() happend at this window new llcd will be cached at ctx. And LBUG happend at llog_recov_thread_stop(). |
| Comment by Alexander Boyko [ 15/May/12 ] |
|
request http://review.whamcloud.com/2789 |
| Comment by Sebastien Buisson (Inactive) [ 30/May/12 ] |
|
Hi, At IFERC cluster, trying to stop all OSSes with some clients left, they got a bunch of the same LBUG in llog_recov_thread_stop() on multiple OSSes. Call Trace: Kernel panic - not syncing: LBUG It definitely looks like this issue. |
| Comment by Peter Jones [ 25/Jul/12 ] |
|
Landed for 2.3 |
| Comment by Bob Glossman (Inactive) [ 26/Jul/12 ] |
|
http://review.whamcloud.com/#change,3480 back port to b2_1 |
| Comment by Nathan Rutman [ 21/Nov/12 ] |
|
Xyratex-bug-id: MRP-456 |
| Comment by Alexandre Louvet [ 14/Dec/12 ] |
|
Doesn't look enough. |
| Comment by John Hammond [ 18/Dec/12 ] |
|
We are seeing this on 2.1.3 when we unmount OSTs. crash> bt
PID: 58311 TASK: ffff880285dbe040 CPU: 4 COMMAND: "umount"
#0 [ffff880364a158c0] machine_kexec at ffffffff8103281b
#1 [ffff880364a15920] crash_kexec at ffffffff810ba8e2
#2 [ffff880364a159f0] panic at ffffffff814fd6d1
#3 [ffff880364a15a70] lbug_with_loc at ffffffffa05bae5b [libcfs]
#4 [ffff880364a15a90] llog_recov_thread_stop at ffffffffa0c23afb [ptlrpc]
#5 [ffff880364a15ac0] llog_recov_thread_fini at ffffffffa0c23b33 [ptlrpc]
#6 [ffff880364a15ae0] filter_llog_finish at ffffffffa06c132d [obdfilter]
#7 [ffff880364a15b10] obd_llog_finish at ffffffffa0ac22d8 [obdclass]
#8 [ffff880364a15b30] filter_precleanup at ffffffffa06c0d1f [obdfilter]
#9 [ffff880364a15b90] class_cleanup at ffffffffa0ae9997 [obdclass]
#10 [ffff880364a15c10] class_process_config at ffffffffa0aebcdb [obdclass]
#11 [ffff880364a15ca0] class_manual_cleanup at ffffffffa0aeca19 [obdclass]
#12 [ffff880364a15d60] server_put_super at ffffffffa0af88bc [obdclass]
#13 [ffff880364a15e30] generic_shutdown_super at ffffffff8117d49b
#14 [ffff880364a15e50] kill_anon_super at ffffffff8117d586
#15 [ffff880364a15e70] lustre_kill_super at ffffffffa0aee656 [obdclass]
#16 [ffff880364a15e90] deactivate_super at ffffffff8117e600
#17 [ffff880364a15eb0] mntput_no_expire at ffffffff8119a64f
#18 [ffff880364a15ee0] sys_umount at ffffffff8119b0eb
#19 [ffff880364a15f80] system_call_fastpath at ffffffff8100b0f2
RIP: 00007fc434e45717 RSP: 00007fff3fb87248 RFLAGS: 00010206
RAX: 00000000000000a6 RBX: ffffffff8100b0f2 RCX: 0000000000000010
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007fc436843b60
RBP: 00007fc436843b40 R8: 00007fc436843b80 R9: 0000000000000001
R10: 00007fff3fb87070 R11: 0000000000000206 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 00007fc436843bc0
ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b
The logs leading up to this were: [1733729.589047] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733729.600508] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.124@o2ib (151): c: 7, oc: 0, rc: 8 [1733730.586896] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1733730.597628] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 1 previous similar message [1733730.608257] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.129@o2ib (151): c: 7, oc: 0, rc: 8 [1733730.621140] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 1 previous similar message [1733733.580433] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733733.591675] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages [1733733.602613] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.156@o2ib (155): c: 7, oc: 0, rc: 8 [1733733.615604] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages [1733742.561072] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733742.571750] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 4 previous similar messages [1733742.582543] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.127@o2ib (162): c: 7, oc: 0, rc: 8 [1733742.595489] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 4 previous similar messages [1733971.067888] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733971.080940] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 3 previous similar messages [1733971.093679] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.243@o2ib (150): c: 7, oc: 0, rc: 8 [1733971.109722] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 3 previous similar messages [1733976.057048] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1733976.069835] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 8 previous similar messages [1733976.082962] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.181@o2ib (153): c: 7, oc: 0, rc: 8 [1733976.097100] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 8 previous similar messages [1733985.037612] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1733985.049931] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 168 previous similar messages [1733985.062783] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.249@o2ib (162): c: 7, oc: 0, rc: 8 [1733985.078215] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 168 previous similar messages [1735454.496518] LustreError: 48744:0:(sec.c:383:import_sec_validate_get()) import ffff88050b301800 (CLOSED) with no sec [1735454.508690] LustreError: 48744:0:(recov_thread.c:394:llcd_push()) Couldn't send llcd ffff880131a16038 (-13) [1735454.520599] LustreError: 48744:0:(filter.c:3062:filter_sync_llogs()) error flushing logs to MDS: rc -13 [1735604.539999] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1735604.550820] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 42 previous similar messages [1735604.561625] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.107@o2ib (150): c: 7, oc: 0, rc: 8 [1735604.576129] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 42 previous similar messages [1735608.531438] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [1735608.542183] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 12 previous similar messages [1735608.553255] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.38@o2ib (152): c: 7, oc: 0, rc: 8 [1735608.566020] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 12 previous similar messages [1735613.520678] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1735613.531552] LustreError: 48595:0:(o2iblnd_cb.c:2991:kiblnd_check_txs_locked()) Skipped 15 previous similar messages [1735613.542474] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Timed out RDMA with 192.168.66.74@o2ib (159): c: 7, oc: 0, rc: 8 [1735613.557368] LustreError: 48595:0:(o2iblnd_cb.c:3054:kiblnd_check_conns()) Skipped 15 previous similar messages [1740454.405061] Lustre: 48598:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1355869330/real 1355869330] req@ffff88086b1f9c00 x1421268042665725/t0(0) o400->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 192/192 e 0 to 1 dl 1355869337 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [1740454.436739] LustreError: 166-1: MGC192.168.65.5@o2ib: Connection to MGS (at 192.168.65.5@o2ib) was lost; in progress operations using this service will fail [1740460.438928] Lustre: 48599:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1355869337/real 1355869337] req@ffff88086b1f9000 x1421268042665726/t0(0) o250->MGC192.168.65.5@o2ib@192.168.65.5@o2ib:26/25 lens 368/512 e 0 to 1 dl 1355869343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [1740468.526512] LustreError: 58311:0:(recov_thread.c:447:llog_recov_thread_stop()) Busy llcds found (1) on lcm ffff880699307000 [1740468.538720] LustreError: 58311:0:(recov_thread.c:467:llog_recov_thread_stop()) LBUG |
| Comment by Alexey Lyashkov [ 19/Dec/12 ] |
|
may you put crashdump in some place ? |
| Comment by Alexandre Louvet [ 02/Jan/13 ] |
|
I can't upload crashes, site is classified. |
| Comment by Alexandre Louvet [ 14/Jan/13 ] |
|
As per Intel request, I have open a new issue ( |