Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.0.0
-
None
-
x86_64, RHEL6
-
3
-
24,420
-
5040
Description
As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
I simply copy here the initial description from bugzilla 24420:
We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
provoking a long Lustre service interruption.
Each time/crash, the panic'ing thread stack-trace looked like following :
=========================================================================
#0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
#1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
#2 [ffff881021fd1368] panic at ffffffff8145210d
#3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
#4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
#5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
#6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
#7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
#8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
#9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
#10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
#11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
#12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
#13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
#14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
=========================================================================
In a particular analysis from our on-site support we get the following values when the LBUG is
raised on "filter_finish_transno" function:
lcd_last_transno=0x4ddebb
oti_transno=last_rcvd=0x4ddeba
lsd_last_transno=0x4de0ee
So we have the client (lcd_last_transno) having a bad transaction number with the actual
transaction number being lower than client's one which, according the the ASSERT, is bad.
I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
only for tests, not for production as it's our case.
Does this sound as a known bug for you? In order to work-around this bug, what would be the
consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
don't know if there is any other important consequence.
I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.
Thanks,
Sebastien.
That reason can be other just 2.0.0 code itself, some bug which causes this assertion. The assert can be removed from patch, in that case there will be client evictions only and more debug info about why that is happening.