[LU-9135] sanity test_313: osp_sync.c:571:osp_sync_interpret()) LBUG Created: 16/Feb/17  Updated: 12/Mar/18  Resolved: 29/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Minor
Reporter: Bob Glossman (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-10430 LBUG osp_sync.c:578:osp_sync_interpret Open
is duplicated by LU-9130 sanity: test_313 timeout Resolved
Related
is related to LU-8411 Fix Lustre filesystem corruption when... Resolved
is related to LU-5629 osp_sync_interpret() ASSERTION( rc ||... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1d4a3042-f3cc-11e6-8862-5254006e85c2.

The sub-test test_313 failed with the following error:

test failed to respond and timed out

the following panic seen in console log for MDS:

20:34:29:[ 6925.325814] LustreError: 6289:0:(osp_sync.c:571:osp_sync_interpret()) ASSERTION( req->rq_transno == 0 || req->rq_import_generation < imp->imp_generation ) failed: transno 21474848133, rc -5, gen: req 1, imp 1
20:34:29:[ 6925.334095] LustreError: 6289:0:(osp_sync.c:571:osp_sync_interpret()) LBUG
20:34:29:[ 6925.337147] Pid: 6289, comm: ptlrpcd_00_00
20:34:29:[ 6925.338786] 
20:34:29:[ 6925.338786] Call Trace:
20:34:29:[ 6925.341582]  [<ffffffffa06e77f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
20:34:29:[ 6925.343382]  [<ffffffffa06e7861>] lbug_with_loc+0x41/0xb0 [libcfs]
20:34:29:[ 6925.345155]  [<ffffffffa0fb99b3>] osp_sync_interpret+0x363/0x520 [osp]
20:34:29:[ 6925.347107]  [<ffffffffa0a490b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
20:34:29:[ 6925.348959]  [<ffffffffa0a4aabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
20:34:29:[ 6925.350719]  [<ffffffffa0a76b8b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
20:34:29:[ 6925.352418]  [<ffffffffa0a76f3b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
20:34:29:[ 6925.354016]  [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20
20:34:29:[ 6925.355646]  [<ffffffffa0a76c80>] ? ptlrpcd+0x0/0x560 [ptlrpc]
20:34:29:[ 6925.357242]  [<ffffffff810b064f>] kthread+0xcf/0xe0
20:34:29:[ 6925.358743]  [<ffffffff810b0580>] ? kthread+0x0/0xe0
20:34:29:[ 6925.360250]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
20:34:29:[ 6925.361794]  [<ffffffff810b0580>] ? kthread+0x0/0xe0
20:34:29:[ 6925.363307] 
20:34:29:[ 6925.364557] Kernel panic - not syncing: LBUG
20:34:29:[ 6925.365550] CPU: 0 PID: 6289 Comm: ptlrpcd_00_00 Tainted: G           OE  ------------   3.10.0-514.6.1.el7_lustre.x86_64 #1
20:34:29:[ 6925.365550] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
20:34:29:[ 6925.365550]  ffffffffa0705d8c 00000000cf1e6ef2 ffff880077eb7bc8 ffffffff816863f8
20:34:29:[ 6925.365550]  ffff880077eb7c48 ffffffff8167f823 ffffffff00000008 ffff880077eb7c58
20:34:29:[ 6925.365550]  ffff880077eb7bf8 00000000cf1e6ef2 00000000cf1e6ef2 ffff88007fc0f838
20:34:29:[ 6925.365550] Call Trace:
20:34:29:[ 6925.365550]  [<ffffffff816863f8>] dump_stack+0x19/0x1b
20:34:29:[ 6925.365550]  [<ffffffff8167f823>] panic+0xe3/0x1f2
20:34:29:[ 6925.365550]  [<ffffffffa06e7879>] lbug_with_loc+0x59/0xb0 [libcfs]
20:34:29:[ 6925.365550]  [<ffffffffa0fb99b3>] osp_sync_interpret+0x363/0x520 [osp]
20:34:29:[ 6925.365550]  [<ffffffffa0a490b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
20:34:29:[ 6925.365550]  [<ffffffffa0a4aabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
20:34:29:[ 6925.365550]  [<ffffffffa0a76b8b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
20:34:29:[ 6925.365550]  [<ffffffffa0a76f3b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
20:34:29:[ 6925.365550]  [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20
20:34:29:[ 6925.365550]  [<ffffffffa0a76c80>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc]
20:34:29:[ 6925.365550]  [<ffffffff810b064f>] kthread+0xcf/0xe0
20:34:29:[ 6925.365550]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
20:34:29:[ 6925.365550]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
20:34:29:[ 6925.365550]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140

Info required for matching: sanity 313



 Comments   
Comment by Oleg Drokin [ 24/Jul/17 ]

just hit this on my new testbed in master.
crashdump on onyx-68, though I don't have the path readily available now. i you want it ask me to look it up (this system is still being setup so some things don't fully work yet)

Comment by Bob Glossman (Inactive) [ 16/Aug/17 ]

another on b2_10:
https://testing.hpdd.intel.com/test_sets/cd38ab92-82c8-11e7-bbd3-5254006e85c2

Comment by Bob Glossman (Inactive) [ 29/Aug/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/bd989d74-8cbd-11e7-b50a-5254006e85c2

Comment by Jian Yu [ 01/Oct/17 ]

More failure instances on master branch:
https://testing.hpdd.intel.com/test_sets/14fbac40-a515-11e7-bb19-5254006e85c2
https://testing.hpdd.intel.com/test_sets/749b4e68-a4c9-11e7-bb19-5254006e85c2

Comment by Bob Glossman (Inactive) [ 10/Oct/17 ]

failure on master:
https://testing.hpdd.intel.com/test_sets/60971de6-ad72-11e7-bb19-5254006e85c2

Not 100% sure this is the same fail.
No kernel stack trace seen in the MDS console log.
Only this partial line that looks like it might precede a panic:

[ 4835.796965] LustreError: 4565:0:(osp_sync.c:578:osp[ 0.000000]

followed by normal reboot logs.
Suspect the gathering of the console log is flawed.

Comment by Bob Glossman (Inactive) [ 13/Oct/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/d83edcfa-b059-11e7-a26c-5254006e85c2

Comment by Emoly Liu [ 16/Oct/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/2a3b68d6-b04e-11e7-9eeb-5254006e85c2

Comment by Andreas Dilger [ 07/Nov/17 ]

Again on master: https://testing.hpdd.intel.com/test_sets/97931696-b5e8-11e7-9d39-52540065bddc

Comment by Mikhail Pershin [ 14/Nov/17 ]

Was seen several times in master:

https://testing.hpdd.intel.com/test_sets/8e069408-c935-11e7-8027-52540065bddc

https://testing.hpdd.intel.com/test_sets/87888c5e-c8e9-11e7-9c63-52540065bddc

 

Comment by Oleg Drokin [ 14/Nov/17 ]

this fails for me all the time still, I discussed it with Alex for a long time. I have hundreds of crashdumps of this if anybody wants to take a look here.

Comment by Jinshan Xiong (Inactive) [ 15/Nov/17 ]

https://testing.hpdd.intel.com/sub_tests/bc8cf07c-c978-11e7-a066-52540065bddc

Comment by Gerrit Updater [ 16/Nov/17 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/30129
Subject: LU-9135 osp: do not fail if last_rcvd update failed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 57fbf5026c8fb5d84206c75645356ea1de1823db

Comment by Gerrit Updater [ 29/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30129/
Subject: LU-9135 osp: do not fail if last_rcvd update failed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d764d538d601fa7e980a32a02899f9c32772844c

Comment by Peter Jones [ 29/Nov/17 ]

Landed for 2.11

Comment by Andreas Dilger [ 05/Dec/17 ]

This test was added in patch https://review.whamcloud.com/21398 "LU-8411 ofd: handle last_rcvd file can't update properly" landed 2017-01-31, but didn't start failing regularly until 2017-08-25.

Comment by Alex Zhuravlev [ 05/Dec/17 ]

iirc, Oleg had a very long (more than an year?) history of hitting that. the assertion was brought with the initial OSP code, iirc.

Comment by Gerrit Updater [ 25/Jan/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31013
Subject: LU-9135 osp: do not fail if last_rcvd update failed
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: da9e88954febbee279f10857a3bead519ab258d1

Comment by Gerrit Updater [ 12/Mar/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31013/
Subject: LU-9135 osp: do not fail if last_rcvd update failed
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 22ef5d606a629d79671ec3e340f068a7d0a38469

Generated at Sat Feb 10 02:23:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.