[LU-2099] osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) sanity test 63a Created: 07/Oct/12  Updated: 14/May/14  Resolved: 14/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

single node, 4 cpu cores. lustre files in /tmp that is tmpfs


Attachments: Text File dmesg.txt     Text File log.gz    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 4386

 Description   

Just hit this when testing with USE_OFD=yes

[  929.726136] Lustre: DEBUG MARKER: == sanity test 63a: Verify oig_wait interruption does not crash ========= 03:34:59 (1349595299)
[  964.438007] LustreError: 28767:0:(osd_io.c:107:osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) failed: 
[  964.439525] LustreError: 28767:0:(osd_io.c:107:osd_init_iobuf()) LBUG
[  964.441022] Pid: 28767, comm: ll_ost_io00_004
[  964.442579] 
[  964.442580] Call Trace:
[  964.445464]  [<ffffffffa03f4915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[  964.446900]  [<ffffffffa03f4f27>] lbug_with_loc+0x47/0xb0 [libcfs]
[  964.448333]  [<ffffffffa0c41a82>] osd_init_iobuf+0xd2/0xe0 [osd_ldiskfs]
[  964.449770]  [<ffffffffa0c44b0b>] osd_write_prep+0x17b/0x420 [osd_ldiskfs]
[  964.451208]  [<ffffffffa0c43976>] ? osd_bufs_get+0x206/0x390 [osd_ldiskfs]
[  964.452627]  [<ffffffffa0e26daa>] ofd_preprw+0x7da/0x11c0 [ofd]
[  964.454077]  [<ffffffffa0781cf4>] ? sptlrpc_svc_alloc_rs+0x74/0x2d0 [ptlrpc]
[  964.455474]  [<ffffffffa0d557fc>] obd_preprw+0x12c/0x3d0 [ost]
[  964.456890]  [<ffffffffa0d5c932>] ost_brw_write+0x882/0x15f0 [ost]
[  964.458284]  [<ffffffff8127a646>] ? vsnprintf+0x2b6/0x5f0
[  964.459696]  [<ffffffffa0754cbc>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
[  964.461128]  [<ffffffffa0754e18>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
[  964.462513]  [<ffffffffa0d621f0>] ost_handle+0x3120/0x4550 [ost]
[  964.463899]  [<ffffffffa0401464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[  964.465329]  [<ffffffffa0762883>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[  964.466745]  [<ffffffffa03f566e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[  964.468163]  [<ffffffffa075b571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[  964.469568]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[  964.470993]  [<ffffffffa076541a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[  964.472399]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[  964.473806]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[  964.475201]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[  964.476610]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[  964.477937]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[  964.479240] 
[  964.481682] Kernel panic - not syncing: LBUG
[  964.482986] Pid: 28767, comm: ll_ost_io00_004 Not tainted 2.6.32-debug #6
[  964.484296] Call Trace:
[  964.485627]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[  964.486964]  [<ffffffffa03f4f7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[  964.488261]  [<ffffffffa0c41a82>] ? osd_init_iobuf+0xd2/0xe0 [osd_ldiskfs]
[  964.489572]  [<ffffffffa0c44b0b>] ? osd_write_prep+0x17b/0x420 [osd_ldiskfs]
[  964.490866]  [<ffffffffa0c43976>] ? osd_bufs_get+0x206/0x390 [osd_ldiskfs]
[  964.492151]  [<ffffffffa0e26daa>] ? ofd_preprw+0x7da/0x11c0 [ofd]
[  964.493512]  [<ffffffffa0781cf4>] ? sptlrpc_svc_alloc_rs+0x74/0x2d0 [ptlrpc]
[  964.494817]  [<ffffffffa0d557fc>] ? obd_preprw+0x12c/0x3d0 [ost]
[  964.496116]  [<ffffffffa0d5c932>] ? ost_brw_write+0x882/0x15f0 [ost]
[  964.497432]  [<ffffffff8127a646>] ? vsnprintf+0x2b6/0x5f0
[  964.498774]  [<ffffffffa0754cbc>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
[  964.500109]  [<ffffffffa0754e18>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
[  964.501424]  [<ffffffffa0d621f0>] ? ost_handle+0x3120/0x4550 [ost]
[  964.502716]  [<ffffffffa0401464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[  964.504014]  [<ffffffffa0762883>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[  964.505319]  [<ffffffffa03f566e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[  964.506593]  [<ffffffffa075b571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[  964.507817]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[  964.509104]  [<ffffffffa076541a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[  964.510345]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[  964.511510]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[  964.512795]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[  964.513993]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[  964.515105]  [<ffffffff8100c140>] ? child_rip+0x0/0x20

I have a crashdump for this instance too in case you need something from there.



 Comments   
Comment by Oleg Drokin [ 07/Oct/12 ]

Seems to be mostly the sane as ORI-258

Comment by Oleg Drokin [ 07/Oct/12 ]

I just hit this one two more times on two different nodes, so this one is certainly a very real problem.

Comment by Alex Zhuravlev [ 08/Oct/12 ]

do you have crash dumps for the case?

Comment by Alex Zhuravlev [ 08/Oct/12 ]

http://review.whamcloud.com/#change,4216

Comment by Oleg Drokin [ 08/Oct/12 ]

I've yet to hit it again with the patch applied, though there was one strange problem hit in the same path once.

Comment by Alex Zhuravlev [ 10/Oct/12 ]

from what I heard last time, no more hits happened with the patch.

Comment by Jinshan Xiong (Inactive) [ 18/Apr/14 ]

Seeing this issue again on latest master.

loop: Write error at byte offset 243236864, length 4096.
LustreError: 15444:0:(osd_io.c:71:__osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) failed: iobuf ffff880548d92c00, reqs 0, rw 1, line 1079
LustreError: 15444:0:(osd_io.c:71:__osd_init_iobuf()) LBUG
Pid: 15444, comm: ll_ost_io01_021

Call Trace:
[<ffffffffa09e8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa09e8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa054c8b8>] __osd_init_iobuf+0x338/0x350 [osd_ldiskfs]
[<ffffffffa1232b83>] ? lu_context_init+0xa3/0x240 [obdclass]
[<ffffffffa0550a42>] osd_write_commit+0x152/0x610 [osd_ldiskfs]
[<ffffffffa08821fa>] ofd_commitrw_write+0x59a/0x11e0 [ofd]
[<ffffffffa088336a>] ofd_commitrw+0x52a/0x8c0 [ofd]
[<ffffffffa1210411>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
[<ffffffffa14c1c8d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
[<ffffffffa14c8eae>] tgt_brw_write+0xc7e/0x1530 [ptlrpc]
[<ffffffff81059216>] ? enqueue_task+0x66/0x80
[<ffffffff8105571d>] ? check_preempt_curr+0x6d/0x90
[<ffffffff81065c5e>] ? try_to_wake_up+0x24e/0x3e0
[<ffffffffa14247c0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
[<ffffffffa14c79ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
[<ffffffffa146635c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
[<ffffffffa147698a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
[<ffffffff81068370>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff81527e08>] ? schedule+0x178/0x3b2
[<ffffffffa1475c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
[<ffffffff8109aee6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ae50>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20

Comment by Jinshan Xiong (Inactive) [ 07/May/14 ]

I can't continue my testing since I'm keeping seeing this issue on 2.5 and 2.6

Comment by Alex Zhuravlev [ 12/May/14 ]

could you grab full debug log for the case please? with the following patch:

— a/lustre/osd-ldiskfs/osd_io.c
+++ b/lustre/osd-ldiskfs/osd_io.c
@@ -1139,7 +1139,9 @@ static int osd_write_commit(const struct lu_env *env, struct dt_object *dt,
rc = osd_do_bio(osd, inode, iobuf);
/* we don't do stats here as in read path because

  • write is async: we'll do this in osd_put_bufs() */
  • }
    + } else { + CERROR("NO IO: %d\n", rc); + }

if (unlikely(rc != 0)) {
/* if write fails, we should drop pages from the cache */

Comment by Jinshan Xiong (Inactive) [ 12/May/14 ]

Log and dmesg.

I have a crash dump please let me know if you want it

Comment by Alex Zhuravlev [ 13/May/14 ]

please, try with http://review.whamcloud.com/#/c/10305/
I'm still doing local testing, but I'd appreciate your help given you can reproduce this easily.
thanks in advance.

Comment by Jinshan Xiong (Inactive) [ 13/May/14 ]

it looks good in the initial testing

Comment by Peter Jones [ 14/May/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:22:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.