[LU-7453] osp_sync_interpret assertion Created: 20/Nov/15  Updated: 11/Feb/16  Resolved: 11/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jesse Hanley Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

2.5.4-2.6.32_504.30.3.el6.atlas.x86_64.x86_64


Attachments: Text File crash-2015-11-18.dk.tar.gz    
Issue Links:
Related
is related to LU-5629 osp_sync_interpret() ASSERTION( rc ||... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Wednesday morning one of our production MDS nodes hit an assertion:

{{
2015-11-18 10:58:35 [1912759.384335] LustreError: 14428:0:(osp_sync.c:352:osp_sync_interpret()) ASSERTION( rc || req->rq_transno ) failed:
2015-11-18 10:58:35 [1912759.396346] LustreError: 14428:0:(osp_sync.c:352:osp_sync_interpret()) LBUG
2015-11-18 10:58:35 [1912759.404445] Pid: 14428, comm: ptlrpcd_2
2015-11-18 10:58:35 [1912759.409032]
2015-11-18 10:58:35 [1912759.409033] Call Trace:
2015-11-18 10:58:35 [1912759.414039] [<ffffffffa0430895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2015-11-18 10:58:35 [1912759.422141] [<ffffffffa0430e97>] lbug_with_loc+0x47/0xb0 [libcfs]
2015-11-18 10:58:35 [1912759.429363] [<ffffffffa0f6e3db>] osp_sync_interpret+0x50b/0x510 [osp]
2015-11-18 10:58:35 [1912759.437003] [<ffffffffa075aacd>] ptlrpc_check_set+0x31d/0x1c20 [ptlrpc]
2015-11-18 10:58:35 [1912759.444806] [<ffffffff8108802b>] ? try_to_del_timer_sync+0x7b/0xe0
2015-11-18 10:58:35 [1912759.452147] [<ffffffffa0788b13>] ptlrpcd_check+0x3d3/0x610 [ptlrpc]
2015-11-18 10:58:35 [1912759.459582] [<ffffffffa078924b>] ptlrpcd+0x20b/0x370 [ptlrpc]
2015-11-18 10:58:35 [1912759.466413] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
2015-11-18 10:58:35 [1912759.473654] [<ffffffffa0789040>] ? ptlrpcd+0x0/0x370 [ptlrpc]
2015-11-18 10:58:35 [1912759.480485] [<ffffffff8109e78e>] kthread+0x9e/0xc0
2015-11-18 10:58:35 [1912759.486243] [<ffffffff8100c28a>] child_rip+0xa/0x20
2015-11-18 10:58:35 [1912759.492100] [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
2015-11-18 10:58:35 [1912759.497954] [<ffffffff8100c280>] ? child_rip+0x0/0x20
2015-11-18 10:58:35 [1912759.503992]
2015-11-18 10:58:35 [1912759.506435] Kernel panic - not syncing: LBUG
2015-11-18 10:58:35 [1912759.511515] Pid: 14428, comm: ptlrpcd_2 Not tainted 2.6.32-504.30.3.el6.atlas.x86_64 #1
2015-11-18 10:58:35 [1912759.520881] Call Trace:
2015-11-18 10:58:35 [1912759.523916] [<ffffffff81529cbc>] ? panic+0xa7/0x16f
2015-11-18 10:58:35 [1912759.529778] [<ffffffffa0430eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2015-11-18 10:58:35 [1912759.537188] [<ffffffffa0f6e3db>] ? osp_sync_interpret+0x50b/0x510 [osp]
2015-11-18 10:58:35 [1912759.545012] [<ffffffffa075aacd>] ? ptlrpc_check_set+0x31d/0x1c20 [ptlrpc]
2015-11-18 10:58:35 [1912759.553002] [<ffffffff8108802b>] ? try_to_del_timer_sync+0x7b/0xe0
2015-11-18 10:58:35 [1912759.560340] [<ffffffffa0788b13>] ? ptlrpcd_check+0x3d3/0x610 [ptlrpc]
2015-11-18 10:58:35 [1912759.567961] [<ffffffffa078924b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
2015-11-18 10:58:35 [1912759.574979] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
2015-11-18 10:58:35 [1912759.582209] [<ffffffffa0789040>] ? ptlrpcd+0x0/0x370 [ptlrpc]
2015-11-18 10:58:35 [1912759.589034] [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
2015-11-18 10:58:35 [1912759.594981] [<ffffffff8100c28a>] ? child_rip+0xa/0x20
2015-11-18 10:58:35 [1912759.601024] [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
2015-11-18 10:58:35 [1912759.606874] [<ffffffff8100c280>] ? child_rip+0x0/0x20
}}

Is this related to https://jira.hpdd.intel.com/browse/LU-5629 ?



 Comments   
Comment by Jian Yu [ 20/Nov/15 ]

Hi Dmitry,

It looks like the assertion and back traces in this ticket are the same as those in LU-5629, which was not resolved by the fix of LU-3892.

Could you please take a look at this ticket as well? Thank you.

Comment by Alex Zhuravlev [ 07/Dec/15 ]

is it possible to get any logs for the case? the most valuable information at the moment is info on RPC (was it OST_DESTROY or OST_SETATTR).

Comment by Jian Yu [ 08/Dec/15 ]

Hi Jesse,

Does this assertion still occur? Is it possible to gather Lustre debug logs required by Alex above? Thank you.

Comment by Jesse Hanley [ 10/Dec/15 ]

Hey Alex and Jian,

Sorry for not responding sooner. We haven't seen this assertion since I submitted this. I don't have any debug logs from when it happened, but I do have a vmcore dump. I'll take a look and see if I can find anything relevant in it (not sure if it'll be there).

Comment by Jian Yu [ 10/Dec/15 ]

Thank you very much, Jesse.

Comment by Jesse Hanley [ 10/Dec/15 ]

From the backtrace, it looks like ptlrpcd called ptlrpcd_check, which called ptlrpc_check_set, which finally called osp_sync_interpret.

If I followed the code correctly (very likely that I didn't - I'm not that familiar with the internals of Lustre), it looks like ptlrpc_req_interpret is only called from ptlrpc_set_destroy and ptlrpc_check_set. Am I right in assuming that ptlrpc_set_destroy corresponds to an OST_DESTROY and ptlrpc_check_set corresponds to a OST_SETATTR call? If so, does that mean this is associated with an OST_SETATTR call?

Sorry if this isn't much help; I'm still learning how to dig around core dumps.

Comment by Jian Yu [ 17/Dec/15 ]

Hi Alex,
Could you please take a look at Jesse's comments and advise? Thank you.

Comment by Alex Zhuravlev [ 17/Dec/15 ]

there is a ticket with a binary and instructions: https://bugzilla.lustre.org/show_bug.cgi?id=13155
if you've got time - please, try to extract the log from the dump. thanks in advance!

Comment by Jesse Hanley [ 07/Jan/16 ]

I've uploaded the log. Sorry for the delay.

I've never seen this extension for crash before. Thanks for the info!


Jesse

Comment by Patrick Farrell (Inactive) [ 07/Jan/16 ]

Jesse -

Just FYI, those ptlrpc functions aren't related to the higher level operations you're describing. They both act on sets of RPCs, so they don't have anything to do with SETATTR or DESTROY.

  • Patrick
Comment by Jian Yu [ 13/Jan/16 ]

Hi Alex,

Jesse has uploaded the logs. Could you please investigate and suggest? Thank you.

Comment by Alex Zhuravlev [ 14/Jan/16 ]

it was a resend for OST_DESTROY, though I have no a good idea for the root cause yet..

Comment by Alex Zhuravlev [ 04/Feb/16 ]

yes, we can add a bit more debug.. which target branch is supposed?

Comment by Jian Yu [ 04/Feb/16 ]

Thank you, Alex.

The server version is Lustre 2.5.4 (2.5.4-2.6.32_504.30.3.el6.atlas.x86_64.x86_64).

Comment by Alex Zhuravlev [ 04/Feb/16 ]

any additional patches on top of that?

Comment by Peter Jones [ 11/Feb/16 ]

As per ORNL ok to close - this just happened one time a long time ago

Generated at Sat Feb 10 02:09:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.