Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.0.0
-
None
-
x86_64, RHEL6
-
3
-
24,420
-
5040
Description
As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
I simply copy here the initial description from bugzilla 24420:
We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
provoking a long Lustre service interruption.
Each time/crash, the panic'ing thread stack-trace looked like following :
=========================================================================
#0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
#1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
#2 [ffff881021fd1368] panic at ffffffff8145210d
#3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
#4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
#5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
#6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
#7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
#8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
#9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
#10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
#11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
#12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
#13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
#14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
=========================================================================
In a particular analysis from our on-site support we get the following values when the LBUG is
raised on "filter_finish_transno" function:
lcd_last_transno=0x4ddebb
oti_transno=last_rcvd=0x4ddeba
lsd_last_transno=0x4de0ee
So we have the client (lcd_last_transno) having a bad transaction number with the actual
transaction number being lower than client's one which, according the the ASSERT, is bad.
I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
only for tests, not for production as it's our case.
Does this sound as a known bug for you? In order to work-around this bug, what would be the
consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
don't know if there is any other important consequence.
I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.
Thanks,
Sebastien.