[LU-4528] osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0 Created: 22/Jan/14 Updated: 16/May/19 Resolved: 09/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sarah Liu | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 12386 | ||||||||||||||||||||||||||||||||
| Description |
|
When running sanity on the latest master build, hit following LBUG on client: Lustre: DEBUG MARKER: == sanity test 60a: llog sanity tests run from kernel module ============ 15:06:20 (1390431980) Lustre: DEBUG MARKER: test_60 run 3219 - from kernel mode Lustre: 9324:0:(llog_test.c:1018:llog_test_setup()) Setup llog-test device over MGS device Lustre: 9324:0:(llog_test.c:110:llog_test_1()) 1a: create a log with name: aee6dfbf Lustre: 9324:0:(llog_test.c:127:llog_test_1()) 1b: close newly-created log Lustre: 9324:0:(llog_test.c:151:llog_test_2()) 2a: re-open a log with name: aee6dfbf Lustre: 9324:0:(llog_test.c:232:llog_test_3()) 3a: write one create_rec Lustre: 9324:0:(llog_test.c:244:llog_test_3()) 3b: write 10 cfg log records with 8 bytes bufs Lustre: 9324:0:(llog_test.c:265:llog_test_3()) 3c: write 1000 more log records Lustre: 9324:0:(llog_test.c:280:llog_test_3()) 3d: write log more than BITMAP_SIZE, return -ENOSPC Lustre: 9324:0:(llog_test.c:311:llog_test_3()) 3d: wrote 64364 more records before end of llog is reached Lustre: 9324:0:(llog_test.c:340:llog_test_4()) 4a: create a catalog log with name: aee6dfc0 Lustre: 9324:0:(llog_test.c:355:llog_test_4()) 4b: write 1 record into the catalog Lustre: 9324:0:(llog_test.c:370:llog_test_4()) 4c: cancel 1 log record Lustre: 9324:0:(llog_test.c:382:llog_test_4()) 4d: write 65536 more log records Lustre: 9324:0:(llog_test.c:398:llog_test_4()) 4e: add 5 large records, one record per block Lustre: 9324:0:(llog_test.c:418:llog_test_4()) 4f: put newly-created catalog Lustre: 9324:0:(llog_test.c:516:llog_test_5()) 5a: re-open catalog by id Lustre: 9324:0:(llog_test.c:529:llog_test_5()) 5b: print the catalog entries.. we expect 2 Lustre: 9325:0:(llog_test.c:447:cat_print_cb()) seeing record at index 1 - [0x1:0xc:0x0] in log [0xa:0xb:0x0] Lustre: 9325:0:(llog_test.c:447:cat_print_cb()) seeing record at index 2 - [0x1:0xd:0x0] in log [0xa:0xb:0x0] Lustre: 9324:0:(llog_test.c:541:llog_test_5()) 5c: Cancel 65536 records, see one log zapped Lustre: 9324:0:(llog_test.c:549:llog_test_5()) 5c: print the catalog entries.. we expect 1 Lustre: 9324:0:(llog_test.c:561:llog_test_5()) 5d: add 1 record to the log with many canceled empty pages Lustre: 9324:0:(llog_test.c:569:llog_test_5()) 5e: print plain log entries.. expect 6 Lustre: 9324:0:(llog_test.c:581:llog_test_5()) 5f: print plain log entries reversely.. expect 6 Lustre: 9324:0:(llog_test.c:595:llog_test_5()) 5g: close re-opened catalog Lustre: 9324:0:(llog_test.c:625:llog_test_6()) 6a: re-open log aee6dfbf using client API Lustre: MGS: non-config logname received: aee6dfbf Lustre: 9324:0:(llog_test.c:657:llog_test_6()) 6b: process log aee6dfbf using client API Lustre: 9324:0:(llog_test.c:661:llog_test_6()) 6b: processed 64364 records Lustre: 9324:0:(llog_test.c:668:llog_test_6()) 6c: process log aee6dfbf reversely using client API Lustre: 9324:0:(llog_test.c:672:llog_test_6()) 6c: processed 64364 records Lustre: 9324:0:(llog_test.c:820:llog_test_7()) 7a: test llog_logid_rec LustreError: 9324:0:(osd_io.c:1396:osd_ldiskfs_write_record()) loop0: error reading offset 1794048 (block 438): rc = -28 LustreError: 9324:0:(llog_osd.c:156:llog_osd_write_blob()) lustre-MDT0000-osd: error writing log record: rc = -28 LustreError: 9324:0:(osd_internal.h:970:osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0 LustreError: 9324:0:(osd_internal.h:970:osd_trans_exec_op()) LBUG Pid: 9324, comm: lctl Message from Call Trace: [<ffffffffa0369895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0369e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0ccdb3b>] osd_trans_exec_op+0x2ab/0x2e0 [osd_ldiskfs] [<ffffffffa0ce0583>] osd_attr_set+0xe3/0x540 [osd_ldiskfs] [<ffffffffa0ced713>] ? osd_punch+0x1d3/0x620 [osd_ldiskfs] [<ffffffffa04b5081>] llog_osd_write_blob+0x211/0x850 [obdclass] [<ffffffffa04b906d>] llog_osd_write_rec+0xacd/0x1390 [obdclass] syslogd@client- [<ffffffffa04a74f8>] llog_write_rec+0xc8/0x290 [obdclass] [<ffffffffa04a8c6d>] llog_write+0x2ad/0x420 [obdclass] [<ffffffffa04594fd>] llog_test_7_sub+0x16d/0xaf0 [llog_test] 3 at Jan 22 15:0 [<ffffffffa0459fbf>] llog_test_7+0x13f/0xa10 [llog_test] [<ffffffffa045bd4a>] llog_run_tests+0x14ba/0x1e50 [llog_test] [<ffffffffa04b179e>] ? llog_setup+0x3be/0x910 [obdclass] [<ffffffffa045ce59>] llog_test_setup+0x779/0x8c0 [llog_test] [<ffffffffa04e2ebb>] obd_setup+0x19b/0x290 [obdclass] [<ffffffff81168043>] ? kmem_cache_alloc_trace+0x1a3/0x1b0 6:22 ... kern [<ffffffffa04c601b>] ? class_new_export+0x72b/0x960 [obdclass] [<ffffffffa04e31b8>] class_setup+0x208/0x870 [obdclass] [<ffffffffa04eb72c>] class_process_config+0xc6c/0x1ad0 [obdclass] [<ffffffff811685ac>] ? __kmalloc+0x20c/0x220 [<ffffffffa04a4d6d>] ? obd_ioctl_getdata+0x12d/0x1120 [obdclass] [<ffffffffa04bd3fb>] class_handle_ioctl+0x3bb/0x1ef0 [obdclass] [<ffffffffa04a42ab>] obd_class_ioctl+0x4b/0x190 [obdclass] [<ffffffff81195382>] vfs_ioctl+0x22/0xa0 [<ffffffff81195524>] do_vfs_ioctl+0x84/0x580 [<ffffffff81182bf1>] ? __fput+0x1a1/0x210 [<ffffffff81195aa1>] sys_ioctl+0x81/0xa0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Jodi Levi (Inactive) [ 23/Jan/14 ] |
|
Mike, |
| Comment by Mikhail Pershin [ 11/Apr/14 ] |
|
yes |
| Comment by Mikhail Pershin [ 13/Apr/14 ] |
|
per Alex comment this issue may happen when transaction is trying to do undo in case of some errors during processing. It was -28 (ENOSPC) error during llog testing and current transaction starts to undo. There can be other cases but with the same scenario. I have no idea how to fix that because we cannot just include all credits for undo in the current transaction because that will increase number of credits dramatically. This is somehow expected and turned off in production code, so it is not critical failure. I wonder why to have this assertion and break test runs if we know about that issue and can't make it works properly in all cases. Maybe Alex and/or Andreas have some ideas about how to handle this gracefully. |
| Comment by Mikhail Pershin [ 13/Apr/14 ] |
|
Meanwhile, this particular case can be fixed by declaring attr_set operation in llog_osd_declare_write_rec(), that is missing and that is why it fails in case of error. |
| Comment by Mikhail Pershin [ 13/Apr/14 ] |
|
|
| Comment by Andreas Dilger [ 17/Apr/14 ] |
|
Mike, I thought that the declare for each update also allowed the "undo" update in case of error handling? Is that not the case for these llog updates because they are potentially split across multiple transaction handles? I'd definitely prefer to fix this via |
| Comment by Andreas Dilger [ 21/Apr/14 ] |
|
Closing |
| Comment by Mikhail Pershin [ 23/Apr/14 ] |
|
I agree that |
| Comment by Mikhail Pershin [ 26/Apr/14 ] |
|
http://review.whamcloud.com/#/c/10108/ - patch removes 3-steps write for llog records, eliminating the extra credits need for error case. |
| Comment by Jinshan Xiong (Inactive) [ 29/Apr/14 ] |
|
Is it possible to provide a workaround patch? I'm hitting this issue all the time by running racer. |
| Comment by Mikhail Pershin [ 30/Apr/14 ] |
|
the link above is the patch, try it. If you need something simpler then add dt_declare_attr_set() in llog_osd_declare_write_rec() right after punch declaration. |
| Comment by Jodi Levi (Inactive) [ 09/May/14 ] |
|
Patch landed to Master. Please reopen ticket if more work is needed. |
| Comment by Gregoire Pichon [ 12/May/14 ] |
|
Hi, It seems we hit the same issue in lustre 2.4.2. thanks, |
| Comment by Peter Jones [ 04/Sep/14 ] |
|
b2_5 path http://review.whamcloud.com/11751 |
| Comment by Alex Zhuravlev [ 22/Oct/14 ] |
|
Peter, this can be improved with http://review.whamcloud.com/#/c/12169/ IMO |
| Comment by Peter Jones [ 22/Oct/14 ] |
|
Alex Then could you please open a new ticket to track that work and link it to this one? Thanks Peter |
| Comment by Alex Zhuravlev [ 23/Oct/14 ] |
|
Peter, I'm not sure how to link, but that's it - thanks, Alex |
| Comment by Peter Jones [ 23/Oct/14 ] |
|
Thanks Alex. For future reference you find Link under the More tab |
| Comment by Andreas Dilger [ 31/Oct/14 ] |
|
You can also add links between patches by clicking the "+" on the right side of the Issue Links header. |