[LU-5303] osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0 Created: 08/Jul/14  Updated: 24/Aug/15  Resolved: 24/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Gregoire Pichon Assignee: Bob Glossman (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4528 osd_trans_exec_op()) ASSERTION( oti->... Resolved
Severity: 3
Rank (Obsolete): 14808

 Description   

I hit a crash of OSS when mounting its targets.
The issue is the same as LU-4528 but with Lustre 2.4.1.

3>LustreError: 27422:0:(osd_io.c:1220:osd_ldiskfs_write_record()) loop21: error reading offset 0 (block 0): rc = -28
<3>LustreError: 27422:0:(llog_osd.c:160:llog_osd_write_blob()) fs96OST-OST003b-osd: error writing log record: rc = -28
<0>LustreError: 27422:0:(osd_internal.h:953:osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0
<0>LustreError: 27422:0:(osd_internal.h:953:osd_trans_exec_op()) LBUG
<4>Pid: 27422, comm: mount.lustre
<4>
<4>Call Trace:
<4> [<ffffffffa0bfb895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0bfbe97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa161d42d>] osd_trans_exec_op+0x2ad/0x2e0 [osd_ldiskfs]
<4> [<ffffffffa162e723>] osd_attr_set+0xe3/0x540 [osd_ldiskfs]
<4> [<ffffffffa163b845>] ? osd_punch+0x1b5/0x600 [osd_ldiskfs]
<4> [<ffffffffa10e60f1>] llog_osd_write_blob+0x211/0x850 [obdclass]
<4> [<ffffffffa10e9d34>] llog_osd_write_rec+0x7d4/0x1370 [obdclass]
<4> [<ffffffffa10b5438>] llog_write_rec+0xc8/0x290 [obdclass]
<4> [<ffffffffa10b6bad>] llog_write+0x2ad/0x420 [obdclass]
<4> [<ffffffffa10b6d44>] llog_copy_handler+0x24/0x30 [obdclass]
<4> [<ffffffffa10b7e0b>] llog_process_thread+0x8fb/0xe00 [obdclass]
<4> [<ffffffffa10b6d20>] ? llog_copy_handler+0x0/0x30 [obdclass]
<4> [<ffffffffa10b9c7d>] llog_process_or_fork+0x12d/0x660 [obdclass]
<4> [<ffffffffa10ba5a2>] llog_backup+0x3d2/0x500 [obdclass]
<4> [<ffffffff8128cd30>] ? sprintf+0x40/0x50
<4> [<ffffffffa16a38cf>] mgc_process_log+0x119f/0x18f0 [mgc]
<4> [<ffffffffa169c8ba>] ? mgc_name2resid+0x4a/0x230 [mgc]
<4> [<ffffffffa169d370>] ? mgc_blocking_ast+0x0/0x800 [mgc]
<4> [<ffffffffa1215b20>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
<4> [<ffffffffa16a5514>] mgc_process_config+0x594/0xed0 [mgc]
<4> [<ffffffffa110164c>] lustre_process_log+0x25c/0xaa0 [obdclass]
<4> [<ffffffffa112bffc>] ? server_find_mount+0xbc/0x160 [obdclass]
<4> [<ffffffffa112ebd6>] ? server_register_mount+0x516/0x8f0 [obdclass]
<4> [<ffffffffa1134467>] server_start_targets+0x5c7/0x19c0 [obdclass]
<4> [<ffffffffa0bfcb2e>] ? cfs_free+0xe/0x10 [libcfs]
<4> [<ffffffffa1104eb5>] ? lustre_start_mgc+0x4a5/0x2180 [obdclass]
<4> [<ffffffffa10fca20>] ? class_config_llog_handler+0x0/0x1890 [obdclass]
<4> [<ffffffffa113640c>] server_fill_super+0xbac/0x1660 [obdclass]
<4> [<ffffffffa1106d68>] lustre_fill_super+0x1d8/0x530 [obdclass]
<4> [<ffffffffa1106b90>] ? lustre_fill_super+0x0/0x530 [obdclass]
<4> [<ffffffff8118c7cf>] get_sb_nodev+0x5f/0xa0
<4> [<ffffffffa10fe3b5>] lustre_get_sb+0x25/0x30 [obdclass]
<4> [<ffffffff8118be2b>] vfs_kern_mount+0x7b/0x1b0
<4> [<ffffffff8118bfd2>] do_kern_mount+0x52/0x130
<4> [<ffffffff811acfdb>] do_mount+0x2fb/0x930
<4> [<ffffffff811ad6a0>] sys_mount+0x90/0xe0
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Would it be possible to backport patch http://review.whamcloud.com/#/c/10108/ in b2_4 branch ?



 Comments   
Comment by John Fuchs-Chesney (Inactive) [ 08/Jul/14 ]

Bob, Can you explore this please.

Comment by John Fuchs-Chesney (Inactive) [ 08/Jul/14 ]

Hello Gregoire,
Bob Glossman will take a look at the 'portability' of this patch to 2.4.x and will advise if this is possible.
Thanks,
~ jfc.

Comment by Bob Glossman (Inactive) [ 08/Jul/14 ]

It looks to me like a back port should be possible, but needs the attention of somebody who really understands the code being modified to do it correctly. Trying to cherry-pick http://review.whamcloud.com/#/c/10108 back into b2_4 leaves 10 or more files that need manual editing to merge. Some appear to need more knowledge that just trying to resolve context diffs.

I note that the Author of the original master patch was Mike Pershin. Maybe it's a job for him.

Comment by John Fuchs-Chesney (Inactive) [ 08/Jul/14 ]

Mike – we've added you as a watcher on this ticket.
Can you please advise on the best way forward to resolve Gregoire's problem?
Thanks,
~ jfc.

Comment by Gregoire Pichon [ 09/Jul/14 ]

Thanks for looking.
I wonder if a back port in b2_5, which is the maintenance release, would'nt be more appropriate in this case. This would benefit to the whole community and I am definitely going to move to lustre 2.5.x version at some time.

Comment by Bob Glossman (Inactive) [ 09/Jul/14 ]

It does appear a back port to b2_5 is more plausible. Less than half the number of files need manual attention to merge and the edits needed may not require an expert as in b2_4.

Comment by John Fuchs-Chesney (Inactive) [ 09/Jul/14 ]

Gregoire, since we seem to agree that b2_5 is the better branch to fix, can you give us a rough idea when you will need the solution to be in place?
This will help us with our workload planning.

Many thanks,
~ jfc

Comment by Gregoire Pichon [ 10/Jul/14 ]

The issue did not occured on a production cluster, so this does not require immediate handling. Anyway, this is still a node crash and I would not like to see the same issue appear at a customer site.

Comment by Gregoire Pichon [ 01/Sep/14 ]

Would it be possible to have a patch for b2_5 worked out ?
I see a 2.5.3 version is going to be released soon, it would be good to have the fix integrated for that version since some of our customers are going to use lustre 2.5 in septembre.
thanks.

Comment by Peter Jones [ 11/Sep/14 ]

Gregoire

The b2_5 patch is being tracked under LU-4528, which this issue is believed to be a duplicate of

Peter

Comment by Gregoire Pichon [ 19/Jan/15 ]

Hi Peter,
The patch http://review.whamcloud.com/#/c/11751/ is ready to be landed into b2_5 (two positive inspections).
Could it be included in the future 2.5.4 release ?

thanks,
Grégoire.

Comment by Peter Jones [ 24/Aug/15 ]

duplicate of LU-4528

Generated at Sat Feb 10 01:50:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.