[LU-3892] osp_sync.c:356:osp_sync_interpret()) ASSERTION( req->rq_transno == 0 ) failed Created: 06/Sep/13  Updated: 10/Oct/14  Resolved: 24/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5629 osp_sync_interpret() ASSERTION( rc ||... Resolved
Rank (Obsolete): 10156

 Description   

I started to hit this recently running sanity in a loop, in different tests, same crash every time:

<0>[20903.330989] LustreError: 9397:0:(osp_sync.c:356:osp_sync_interpret()) ASSERTION( req->rq_transno == 0 ) failed: 
<0>[20903.331969] LustreError: 9397:0:(osp_sync.c:356:osp_sync_interpret()) LBUG
<4>[20903.332470] Pid: 9397, comm: ptlrpcd_2
<4>[20903.332898] 
<4>[20903.332899] Call Trace:
<4>[20903.333610]  [<ffffffffa0ac78a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[20903.334192]  [<ffffffffa0ac7ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[20903.334705]  [<ffffffffa07f1092>] osp_sync_interpret+0x492/0x500 [osp]
<4>[20903.335252]  [<ffffffffa127b2aa>] ptlrpc_check_set+0x2ca/0x1da0 [ptlrpc]
<4>[20903.335824]  [<ffffffffa12a700b>] ptlrpcd_check+0x55b/0x590 [ptlrpc]
<4>[20903.336580]  [<ffffffffa12a7553>] ptlrpcd+0x233/0x390 [ptlrpc]
<4>[20903.337121]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
<4>[20903.337649]  [<ffffffffa12a7320>] ? ptlrpcd+0x0/0x390 [ptlrpc]
<4>[20903.338147]  [<ffffffff81094606>] kthread+0x96/0xa0
<4>[20903.343428]  [<ffffffff8100c10a>] child_rip+0xa/0x20
<4>[20903.343939]  [<ffffffff81094570>] ? kthread+0x0/0xa0
<4>[20903.344452]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
<4>[20903.345028] 
<0>[20903.348400] Kernel panic - not syncing: LBUG

Crash and modules: /exports/crashdumps/192.168.10.219-2013-09-05-20\:55\:33/
other crashes like this: /exports/crashdumps/192.168.10.224-2013-09-05-19:19:15 /exports/crashdumps/192.168.10.219-2013-09-05-15\:06\:22
source tag in my tree: master-20130905



 Comments   
Comment by Alex Zhuravlev [ 16/Sep/13 ]

http://review.whamcloud.com/7664

Comment by Alex Zhuravlev [ 16/Sep/13 ]

hopefully a better approach: http://review.whamcloud.com/#/c/7672/

Comment by Alex Zhuravlev [ 18/Sep/13 ]

I was able to reproduce the issue locally. the last patch should fix the root cause.

Comment by Peter Jones [ 24/Sep/13 ]

Landed for 2.5.0

Comment by Lukasz Flis [ 25/Sep/13 ]

Our MDS server got panic today due to this bug.
Problem is also present in 2.4.1 - please remember to cherry pick
patch for next 2.4 release

Regards

Lukasz Flis
ACC Cyfronet

Comment by Lukasz Flis [ 25/Sep/13 ]

Peter, should i report this bug in a new ticket pointing 2.4 explicitly?

Sep 25 20:56:14 <user.notice> mds01.storage 3450:0:(osp_sync.c:359:osp_sync_interpret()) ASSERTION(
req->rq_transno == 0 ) failed:
Sep 25 20:56:14 <user.notice> mds01.storage 3450:0:(osp_sync.c:359:osp_sync_interpret()) LBUG

Sep 25 20:56:14 <user.notice> mds01.storage Kernel[]: panic - not syncing: LBUG
Sep 25 20:56:14 <user.notice> mds01.storage Pid[]: 3450, comm: ptlrpcd_4 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
Sep 25 20:56:14 <user.notice> mds01.storage Call[]: Trace:
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffff8150de58>]: ? panic+0xa7/0x16f
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffffa052beeb>]: ? lbug_with_loc+0x9b/0xb0 [libcfs]
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffffa0fe56b3>]: ? osp_sync_interpret+0x4a3/0x510 [osp]
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffffa07f5edc>]: ? ptlrpc_check_set+0x2ac/0x1b20 [ptlrpc]
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffffa082369b>]: ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffffa0823bc3>]: ? ptlrpcd+0x233/0x390 [ptlrpc]
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffff81063410>]: ? default_wake_function+0x0/0x20
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffffa0823990>]: ? ptlrpcd+0x0/0x390 [ptlrpc]
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffff8100c0ca>]: ? child_rip+0xa/0x20
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffffa0823990>]: ? ptlrpcd+0x0/0x390 [ptlrpc]
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffffa0823990>]: ? ptlrpcd+0x0/0x390 [ptlrpc]
Sep 25 20:56:14 <user.notice> mds01.storage [<ffffffff8100c0c0>]: ? child_rip+0x0/0x20

It's exactly the same issue

Lukasz Flis

Comment by Peter Jones [ 25/Sep/13 ]

Hi Lukasz

It's ok. This issue is under consideration for 2.4.2. There is no need to open a new ticket.

Peter

Comment by Patrick Farrell (Inactive) [ 08/Oct/14 ]

Was this patched in master as well? Cray saw this issue in 2.6 (LU-5193), so it's presumably in master.
[Edit]

Sorry, please forget that comment. For some reason I thought the original patch was against 2.5. Not sure what I was thinking here.

Generated at Sat Feb 10 01:37:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.