Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.0.0
-
None
-
x86_64, RHEL6
-
3
-
24,420
-
5040
Description
As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
I simply copy here the initial description from bugzilla 24420:
We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
provoking a long Lustre service interruption.
Each time/crash, the panic'ing thread stack-trace looked like following :
=========================================================================
#0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
#1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
#2 [ffff881021fd1368] panic at ffffffff8145210d
#3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
#4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
#5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
#6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
#7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
#8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
#9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
#10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
#11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
#12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
#13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
#14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
=========================================================================
In a particular analysis from our on-site support we get the following values when the LBUG is
raised on "filter_finish_transno" function:
lcd_last_transno=0x4ddebb
oti_transno=last_rcvd=0x4ddeba
lsd_last_transno=0x4de0ee
So we have the client (lcd_last_transno) having a bad transaction number with the actual
transaction number being lower than client's one which, according the the ASSERT, is bad.
I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
only for tests, not for production as it's our case.
Does this sound as a known bug for you? In order to work-around this bug, what would be the
consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
don't know if there is any other important consequence.
I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.
Thanks,
Sebastien.
Attachments
Activity
Resolution | New: Fixed [ 1 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Priority | Original: Blocker [ 1 ] | New: Major [ 3 ] |
Priority | Original: Major [ 3 ] | New: Blocker [ 1 ] |
Priority | Original: Blocker [ 1 ] | New: Major [ 3 ] |
Comment |
[ Hi, Tappro The check of if (transno < obd->obd_next_recovery_transno) { /* Processing the queue right now, don't re-add. */ LASSERT(cfs_list_empty(&req->rq_list)); cfs_spin_unlock(&obd->obd_recovery_task_lock); RETURN(1); } in target_queue_recovery_request() not only allows open requests to pass, it also allows resent replay request to be proccessed immediately, such kind of resent replay requests could trigger this LBUG. There are two kinds of resent replays: 1) Replay request (transno A) timeout (reply lost): in such case, if A wasn't committed on server, client will reconnect and start replay from transno A, otherwise, client reconnect and start replay from next transno (greater than A). No matter which transno the re-replay started from, the timeouted request in sending list has to be replayed again after all replay and replay locks done, at that time, transno A is quite possible less than current lcd_last_transno, so LBUG could be triggered. 2) Replay request (transno A) got error reply: in such case, client will reconnect and start replay from transno A, if A was already committed on server, then we'll get (last_rcvd == lcd_last_transno) in filter_finish_transno(); No resend sending list for this kind of resent replay, I think it shouldn't trigger the LBUG. If there is anything wrong, please correct me. ] |
Fix Version/s | New: Lustre 2.1.0 [ 10021 ] |
Priority | Original: Critical [ 2 ] | New: Blocker [ 1 ] |
Assignee | Original: Robert Read [ rread ] | New: Niu Yawei [ niu ] |