[LU-4528] osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0 Created: 22/Jan/14  Updated: 16/May/19  Resolved: 09/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-4899 osd_trans_exec_op()) ASSERTION( oti->... Closed
Related
is related to LU-5794 improve credit checks in osd-ldiskfs Resolved
is related to LU-4901 Remove triple buffer update from llog... Closed
is related to LU-6556 changelog catalog corruption if all p... Resolved
is related to LU-5303 osd_trans_exec_op()) ASSERTION( oti->... Resolved
is related to LU-4183 MDS crash when running fsx with NFS e... Resolved
Severity: 3
Rank (Obsolete): 12386

 Description   

When running sanity on the latest master build, hit following LBUG on client:

Lustre: DEBUG MARKER: == sanity test 60a: llog sanity tests run from kernel module ============ 15:06:20 (1390431980)
Lustre: DEBUG MARKER: test_60 run 3219 - from kernel mode
Lustre: 9324:0:(llog_test.c:1018:llog_test_setup()) Setup llog-test device over MGS device
Lustre: 9324:0:(llog_test.c:110:llog_test_1()) 1a: create a log with name: aee6dfbf
Lustre: 9324:0:(llog_test.c:127:llog_test_1()) 1b: close newly-created log
Lustre: 9324:0:(llog_test.c:151:llog_test_2()) 2a: re-open a log with name: aee6dfbf
Lustre: 9324:0:(llog_test.c:232:llog_test_3()) 3a: write one create_rec
Lustre: 9324:0:(llog_test.c:244:llog_test_3()) 3b: write 10 cfg log records with 8 bytes bufs
Lustre: 9324:0:(llog_test.c:265:llog_test_3()) 3c: write 1000 more log records
Lustre: 9324:0:(llog_test.c:280:llog_test_3()) 3d: write log more than BITMAP_SIZE, return -ENOSPC
Lustre: 9324:0:(llog_test.c:311:llog_test_3()) 3d: wrote 64364 more records before end of llog is reached
Lustre: 9324:0:(llog_test.c:340:llog_test_4()) 4a: create a catalog log with name: aee6dfc0
Lustre: 9324:0:(llog_test.c:355:llog_test_4()) 4b: write 1 record into the catalog
Lustre: 9324:0:(llog_test.c:370:llog_test_4()) 4c: cancel 1 log record
Lustre: 9324:0:(llog_test.c:382:llog_test_4()) 4d: write 65536 more log records
Lustre: 9324:0:(llog_test.c:398:llog_test_4()) 4e: add 5 large records, one record per block
Lustre: 9324:0:(llog_test.c:418:llog_test_4()) 4f: put newly-created catalog
Lustre: 9324:0:(llog_test.c:516:llog_test_5()) 5a: re-open catalog by id
Lustre: 9324:0:(llog_test.c:529:llog_test_5()) 5b: print the catalog entries.. we expect 2
Lustre: 9325:0:(llog_test.c:447:cat_print_cb()) seeing record at index 1 - [0x1:0xc:0x0] in log [0xa:0xb:0x0]
Lustre: 9325:0:(llog_test.c:447:cat_print_cb()) seeing record at index 2 - [0x1:0xd:0x0] in log [0xa:0xb:0x0]
Lustre: 9324:0:(llog_test.c:541:llog_test_5()) 5c: Cancel 65536 records, see one log zapped
Lustre: 9324:0:(llog_test.c:549:llog_test_5()) 5c: print the catalog entries.. we expect 1
Lustre: 9324:0:(llog_test.c:561:llog_test_5()) 5d: add 1 record to the log with many canceled empty pages
Lustre: 9324:0:(llog_test.c:569:llog_test_5()) 5e: print plain log entries.. expect 6
Lustre: 9324:0:(llog_test.c:581:llog_test_5()) 5f: print plain log entries reversely.. expect 6
Lustre: 9324:0:(llog_test.c:595:llog_test_5()) 5g: close re-opened catalog
Lustre: 9324:0:(llog_test.c:625:llog_test_6()) 6a: re-open log aee6dfbf using client API
Lustre: MGS: non-config logname received: aee6dfbf
Lustre: 9324:0:(llog_test.c:657:llog_test_6()) 6b: process log aee6dfbf using client API
Lustre: 9324:0:(llog_test.c:661:llog_test_6()) 6b: processed 64364 records
Lustre: 9324:0:(llog_test.c:668:llog_test_6()) 6c: process log aee6dfbf reversely using client API
Lustre: 9324:0:(llog_test.c:672:llog_test_6()) 6c: processed 64364 records
Lustre: 9324:0:(llog_test.c:820:llog_test_7()) 7a: test llog_logid_rec
LustreError: 9324:0:(osd_io.c:1396:osd_ldiskfs_write_record()) loop0: error reading offset 1794048 (block 438): rc = -28
LustreError: 9324:0:(llog_osd.c:156:llog_osd_write_blob()) lustre-MDT0000-osd: error writing log record: rc = -28
LustreError: 9324:0:(osd_internal.h:970:osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 0
LustreError: 9324:0:(osd_internal.h:970:osd_trans_exec_op()) LBUG
Pid: 9324, comm: lctl

Message from
Call Trace:
 [<ffffffffa0369895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0369e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0ccdb3b>] osd_trans_exec_op+0x2ab/0x2e0 [osd_ldiskfs]
 [<ffffffffa0ce0583>] osd_attr_set+0xe3/0x540 [osd_ldiskfs]
 [<ffffffffa0ced713>] ? osd_punch+0x1d3/0x620 [osd_ldiskfs]
 [<ffffffffa04b5081>] llog_osd_write_blob+0x211/0x850 [obdclass]
 [<ffffffffa04b906d>] llog_osd_write_rec+0xacd/0x1390 [obdclass]
 syslogd@client- [<ffffffffa04a74f8>] llog_write_rec+0xc8/0x290 [obdclass]
 [<ffffffffa04a8c6d>] llog_write+0x2ad/0x420 [obdclass]
 [<ffffffffa04594fd>] llog_test_7_sub+0x16d/0xaf0 [llog_test]
3 at Jan 22 15:0 [<ffffffffa0459fbf>] llog_test_7+0x13f/0xa10 [llog_test]
 [<ffffffffa045bd4a>] llog_run_tests+0x14ba/0x1e50 [llog_test]
 [<ffffffffa04b179e>] ? llog_setup+0x3be/0x910 [obdclass]
 [<ffffffffa045ce59>] llog_test_setup+0x779/0x8c0 [llog_test]
 [<ffffffffa04e2ebb>] obd_setup+0x19b/0x290 [obdclass]
 [<ffffffff81168043>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
6:22 ...
 kern [<ffffffffa04c601b>] ? class_new_export+0x72b/0x960 [obdclass]
 [<ffffffffa04e31b8>] class_setup+0x208/0x870 [obdclass]
 [<ffffffffa04eb72c>] class_process_config+0xc6c/0x1ad0 [obdclass]
 [<ffffffff811685ac>] ? __kmalloc+0x20c/0x220
 [<ffffffffa04a4d6d>] ? obd_ioctl_getdata+0x12d/0x1120 [obdclass]
 [<ffffffffa04bd3fb>] class_handle_ioctl+0x3bb/0x1ef0 [obdclass]
 [<ffffffffa04a42ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff81195382>] vfs_ioctl+0x22/0xa0
 [<ffffffff81195524>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81182bf1>] ? __fput+0x1a1/0x210
 [<ffffffff81195aa1>] sys_ioctl+0x81/0xa0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Jodi Levi (Inactive) [ 23/Jan/14 ]

Mike,
Could you have a look and comment on this one?
Thank you!

Comment by Mikhail Pershin [ 11/Apr/14 ]

yes

Comment by Mikhail Pershin [ 13/Apr/14 ]

per Alex comment this issue may happen when transaction is trying to do undo in case of some errors during processing. It was -28 (ENOSPC) error during llog testing and current transaction starts to undo. There can be other cases but with the same scenario.

I have no idea how to fix that because we cannot just include all credits for undo in the current transaction because that will increase number of credits dramatically. This is somehow expected and turned off in production code, so it is not critical failure. I wonder why to have this assertion and break test runs if we know about that issue and can't make it works properly in all cases. Maybe Alex and/or Andreas have some ideas about how to handle this gracefully.

Comment by Mikhail Pershin [ 13/Apr/14 ]

Meanwhile, this particular case can be fixed by declaring attr_set operation in llog_osd_declare_write_rec(), that is missing and that is why it fails in case of error.

Comment by Mikhail Pershin [ 13/Apr/14 ]

LU-4901 may solve this issue naturally in other way

Comment by Andreas Dilger [ 17/Apr/14 ]

Mike, I thought that the declare for each update also allowed the "undo" update in case of error handling? Is that not the case for these llog updates because they are potentially split across multiple transaction handles?

I'd definitely prefer to fix this via LU-4901 instead of adding more workarounds in the transaction credit accounting to handle this case. In that case, consider LU-4901 a 2.6.0 blocker in order to fix this bug.

Comment by Andreas Dilger [ 21/Apr/14 ]

Closing LU-4901 so there is only a single blocker bug for this problem. This bug will be kept open as a 2.6 blocker.

Comment by Mikhail Pershin [ 23/Apr/14 ]

I agree that LU-4901 is preferable but I didn't estimate how complex patch may be. At the same time the fix is simple and straightforward - there is just setattr declaration is missing. Therefore I will start with LU-4901 patch but switch to the simple fix just to get rid of this issue if there will be any unexpected complexity.

Comment by Mikhail Pershin [ 26/Apr/14 ]

http://review.whamcloud.com/#/c/10108/ - patch removes 3-steps write for llog records, eliminating the extra credits need for error case.

Comment by Jinshan Xiong (Inactive) [ 29/Apr/14 ]

Is it possible to provide a workaround patch? I'm hitting this issue all the time by running racer.

Comment by Mikhail Pershin [ 30/Apr/14 ]

the link above is the patch, try it. If you need something simpler then add dt_declare_attr_set() in llog_osd_declare_write_rec() right after punch declaration.

Comment by Jodi Levi (Inactive) [ 09/May/14 ]

Patch landed to Master. Please reopen ticket if more work is needed.

Comment by Gregoire Pichon [ 12/May/14 ]

Hi,

It seems we hit the same issue in lustre 2.4.2.
Would it be possible to have the patch backported in b2_4 ?

thanks,
Grégoire.

Comment by Peter Jones [ 04/Sep/14 ]

b2_5 path http://review.whamcloud.com/11751

Comment by Alex Zhuravlev [ 22/Oct/14 ]

Peter, this can be improved with http://review.whamcloud.com/#/c/12169/ IMO

Comment by Peter Jones [ 22/Oct/14 ]

Alex

Then could you please open a new ticket to track that work and link it to this one?

Thanks

Peter

Comment by Alex Zhuravlev [ 23/Oct/14 ]

Peter, I'm not sure how to link, but that's it - LU-5794

thanks, Alex

Comment by Peter Jones [ 23/Oct/14 ]

Thanks Alex. For future reference you find Link under the More tab

Comment by Andreas Dilger [ 31/Oct/14 ]

You can also add links between patches by clicking the "+" on the right side of the Issue Links header.

Generated at Sat Feb 10 01:43:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.