[LU-5303] osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0 Created: 08/Jul/14 Updated: 24/Aug/15 Resolved: 24/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Gregoire Pichon | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14808 | ||||||||
| Description |
|
I hit a crash of OSS when mounting its targets. 3>LustreError: 27422:0:(osd_io.c:1220:osd_ldiskfs_write_record()) loop21: error reading offset 0 (block 0): rc = -28 <3>LustreError: 27422:0:(llog_osd.c:160:llog_osd_write_blob()) fs96OST-OST003b-osd: error writing log record: rc = -28 <0>LustreError: 27422:0:(osd_internal.h:953:osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0 <0>LustreError: 27422:0:(osd_internal.h:953:osd_trans_exec_op()) LBUG <4>Pid: 27422, comm: mount.lustre <4> <4>Call Trace: <4> [<ffffffffa0bfb895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0bfbe97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa161d42d>] osd_trans_exec_op+0x2ad/0x2e0 [osd_ldiskfs] <4> [<ffffffffa162e723>] osd_attr_set+0xe3/0x540 [osd_ldiskfs] <4> [<ffffffffa163b845>] ? osd_punch+0x1b5/0x600 [osd_ldiskfs] <4> [<ffffffffa10e60f1>] llog_osd_write_blob+0x211/0x850 [obdclass] <4> [<ffffffffa10e9d34>] llog_osd_write_rec+0x7d4/0x1370 [obdclass] <4> [<ffffffffa10b5438>] llog_write_rec+0xc8/0x290 [obdclass] <4> [<ffffffffa10b6bad>] llog_write+0x2ad/0x420 [obdclass] <4> [<ffffffffa10b6d44>] llog_copy_handler+0x24/0x30 [obdclass] <4> [<ffffffffa10b7e0b>] llog_process_thread+0x8fb/0xe00 [obdclass] <4> [<ffffffffa10b6d20>] ? llog_copy_handler+0x0/0x30 [obdclass] <4> [<ffffffffa10b9c7d>] llog_process_or_fork+0x12d/0x660 [obdclass] <4> [<ffffffffa10ba5a2>] llog_backup+0x3d2/0x500 [obdclass] <4> [<ffffffff8128cd30>] ? sprintf+0x40/0x50 <4> [<ffffffffa16a38cf>] mgc_process_log+0x119f/0x18f0 [mgc] <4> [<ffffffffa169c8ba>] ? mgc_name2resid+0x4a/0x230 [mgc] <4> [<ffffffffa169d370>] ? mgc_blocking_ast+0x0/0x800 [mgc] <4> [<ffffffffa1215b20>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] <4> [<ffffffffa16a5514>] mgc_process_config+0x594/0xed0 [mgc] <4> [<ffffffffa110164c>] lustre_process_log+0x25c/0xaa0 [obdclass] <4> [<ffffffffa112bffc>] ? server_find_mount+0xbc/0x160 [obdclass] <4> [<ffffffffa112ebd6>] ? server_register_mount+0x516/0x8f0 [obdclass] <4> [<ffffffffa1134467>] server_start_targets+0x5c7/0x19c0 [obdclass] <4> [<ffffffffa0bfcb2e>] ? cfs_free+0xe/0x10 [libcfs] <4> [<ffffffffa1104eb5>] ? lustre_start_mgc+0x4a5/0x2180 [obdclass] <4> [<ffffffffa10fca20>] ? class_config_llog_handler+0x0/0x1890 [obdclass] <4> [<ffffffffa113640c>] server_fill_super+0xbac/0x1660 [obdclass] <4> [<ffffffffa1106d68>] lustre_fill_super+0x1d8/0x530 [obdclass] <4> [<ffffffffa1106b90>] ? lustre_fill_super+0x0/0x530 [obdclass] <4> [<ffffffff8118c7cf>] get_sb_nodev+0x5f/0xa0 <4> [<ffffffffa10fe3b5>] lustre_get_sb+0x25/0x30 [obdclass] <4> [<ffffffff8118be2b>] vfs_kern_mount+0x7b/0x1b0 <4> [<ffffffff8118bfd2>] do_kern_mount+0x52/0x130 <4> [<ffffffff811acfdb>] do_mount+0x2fb/0x930 <4> [<ffffffff811ad6a0>] sys_mount+0x90/0xe0 <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Would it be possible to backport patch http://review.whamcloud.com/#/c/10108/ in b2_4 branch ? |
| Comments |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Jul/14 ] |
|
Bob, Can you explore this please. |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Jul/14 ] |
|
Hello Gregoire, |
| Comment by Bob Glossman (Inactive) [ 08/Jul/14 ] |
|
It looks to me like a back port should be possible, but needs the attention of somebody who really understands the code being modified to do it correctly. Trying to cherry-pick http://review.whamcloud.com/#/c/10108 back into b2_4 leaves 10 or more files that need manual editing to merge. Some appear to need more knowledge that just trying to resolve context diffs. I note that the Author of the original master patch was Mike Pershin. Maybe it's a job for him. |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Jul/14 ] |
|
Mike – we've added you as a watcher on this ticket. |
| Comment by Gregoire Pichon [ 09/Jul/14 ] |
|
Thanks for looking. |
| Comment by Bob Glossman (Inactive) [ 09/Jul/14 ] |
|
It does appear a back port to b2_5 is more plausible. Less than half the number of files need manual attention to merge and the edits needed may not require an expert as in b2_4. |
| Comment by John Fuchs-Chesney (Inactive) [ 09/Jul/14 ] |
|
Gregoire, since we seem to agree that b2_5 is the better branch to fix, can you give us a rough idea when you will need the solution to be in place? Many thanks, |
| Comment by Gregoire Pichon [ 10/Jul/14 ] |
|
The issue did not occured on a production cluster, so this does not require immediate handling. Anyway, this is still a node crash and I would not like to see the same issue appear at a customer site. |
| Comment by Gregoire Pichon [ 01/Sep/14 ] |
|
Would it be possible to have a patch for b2_5 worked out ? |
| Comment by Peter Jones [ 11/Sep/14 ] |
|
Gregoire The b2_5 patch is being tracked under Peter |
| Comment by Gregoire Pichon [ 19/Jan/15 ] |
|
Hi Peter, thanks, |
| Comment by Peter Jones [ 24/Aug/15 ] |
|
duplicate of |