[LU-128] OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.1.0
Affects Version/s: Lustre 2.0.0
Labels:
None
Environment:
x86_64, RHEL6

Severity:
3
Bugzilla ID:
24,420
Rank (Obsolete):
5040

Description

As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
I simply copy here the initial description from bugzilla 24420:

We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
provoking a long Lustre service interruption.

Each time/crash, the panic'ing thread stack-trace looked like following :
=========================================================================
#0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
#1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
#2 [ffff881021fd1368] panic at ffffffff8145210d
#3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
#4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
#5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
#6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
#7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
#8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
#9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
#10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
#11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
#12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
#13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
#14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
=========================================================================

In a particular analysis from our on-site support we get the following values when the LBUG is
raised on "filter_finish_transno" function:

lcd_last_transno=0x4ddebb
oti_transno=last_rcvd=0x4ddeba
lsd_last_transno=0x4de0ee

So we have the client (lcd_last_transno) having a bad transaction number with the actual
transaction number being lower than client's one which, according the the ASSERT, is bad.

I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
only for tests, not for production as it's our case.

Does this sound as a known bug for you? In order to work-around this bug, what would be the
consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
don't know if there is any other important consequence.

I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.

Thanks,
Sebastien.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

24420-b18.patch
3 kB
15/Mar/11 1:49 AM

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Mar/11 1:49 AM

Updated:: 25/Aug/11 11:38 PM

Resolved:: 14/Jun/11 8:06 AM