[LU-3732] osd_io.c:320:osd_do_bio()) ASSERTION( iobuf->dr_rw == 0 ) failed: page_idx 4, block_idx 4, i 0 Created: 09/Aug/13  Updated: 11/May/15  Resolved: 11/May/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: John Hammond Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: osd-ldiskfs, trinity
Environment:

Using current master 2.4.53-22-g295968f on CentOS 6.4 2.6.32-358.11.1.el6.lustre.x86_64.


Issue Links:
Duplicate
duplicates LU-6489 osd-ldiskfs checks s_maxbytes limits ... Resolved
Severity: 3
Rank (Obsolete): 9631

 Description   

I don't have a simple reproducer but running trinity on a Lustre client mount will trigger this easily. I even turned off the weird and dangerous non-filesystem related stuff and I still see it.

LustreError: 3395:0:(osd_io.c:320:osd_do_bio()) ASSERTION( iobuf->dr_rw == 0 ) failed: page_idx 4, block_idx 4, i 0
LustreError: 3395:0:(osd_io.c:320:osd_do_bio()) LBUG
Pid: 3395, comm: ll_ost_io01_001

Call Trace:
 [<ffffffffa04ec895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa04ece97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0c7b228>] osd_do_bio+0x7f8/0x800 [osd_ldiskfs]
 [<ffffffffa0bf70bb>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
 [<ffffffffa0c2c348>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
 [<ffffffffa0c7dbb8>] osd_write_commit+0x328/0x610 [osd_ldiskfs]
 [<ffffffffa0e7ac84>] ofd_commitrw_write+0x684/0x11b0 [ofd]
 [<ffffffffa0e7d9ed>] ofd_commitrw+0x5cd/0xbb0 [ofd]
 [<ffffffffa06397e5>] ? lprocfs_counter_add+0x125/0x182 [lvfs]
 [<ffffffffa0dbe1e8>] obd_commitrw+0x128/0x3d0 [ost]
 [<ffffffffa0dc82d1>] ost_brw_write+0xea1/0x15d0 [ost]
 [<ffffffff81282b36>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa07e2310>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa0dce75e>] ost_handle+0x3a8e/0x4030 [ost]
 [<ffffffffa04f8d64>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa0832598>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa04ed54e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa04fea6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa08299a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
 [<ffffffffa083391d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
 [<ffffffffa0832e60>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff81096936>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20


 Comments   
Comment by John Hammond [ 14/Aug/13 ]

Seems like an off-by-one-ish kind of error. Here is a simplified reproducer:

buf = malloc(4096);
fd = open("/mnt/lustre/Gena", O_WRONLY|O_CREAT);
pwrite(fd, buf, 4096, 0x7fffffffffff);
Comment by Alex Zhuravlev [ 15/Aug/13 ]

check my math please:

(gdb) p (0x7fffffffffffULL / 4096) >> 32
$5 = 7

while with ldiskfs:

/*

  • Maximum number of logical blocks in a file; ldiskfs_extent's ee_block is
  • __le32.
    */
    #define EXT_MAX_BLOCKS 0xffffffff

I guess someone (ldiskfs or fsfilt) should be checking the offset is in supported range.

Comment by John Hammond [ 15/Aug/13 ]

OK but there may be more than one supported range. Using an offset of 0x7ffffffff000 or 0x800000000000 is fine. However 0x7ffffffff001 triggers the same assertion.

Comment by Henri Doreau (Inactive) [ 11/Feb/14 ]

I stumbled upon this crash as well. Offset 0x7ffffffff000 does trigger it, but like for you 0x800000000000 works fine. It seems that ldiskfs_ext_new_extent_cb isn't even called when the crash occurs, leading to iobuf->dr_blocks containing only zeroes. I have extensively traced it but am unsure how to fix it best.

Generated at Sat Feb 10 01:36:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.