[LU-1831] filter_direct_io()) ASSERTION( iobuf->dr_npages > 0 ) Created: 05/Sep/12  Updated: 10/Sep/12  Resolved: 10/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Alex Zhuravlev Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

local testing in vbox


Issue Links:
Duplicate
duplicates LU-1824 Test failure on test suite obdfilter-... Resolved
Severity: 3
Rank (Obsolete): 6339

 Description   

Lustre: DEBUG MARKER: == sanityn test 15: test out-of-space with multiple writers ========================================== 18:15:33 (1346854533)
LustreError: 7743:0:(filter_io_26.c:484:filter_direct_io()) ASSERTION( iobuf->dr_npages > 0 ) failed:
LustreError: 7743:0:(filter_io_26.c:484:filter_direct_io()) LBUG
Pid: 7743, comm: ll_ost_io01_002

Call Trace:
[<00000000dfd5b7a1>] libcfs_debug_dumpstack+0x51/0x80 [libcfs]
[<00000000dfd5bf7b>] lbug_with_loc+0x3b/0xa0 [libcfs]
[<00000000e0e6e437>] filter_do_bio+0x21c7/0x2a00 [obdfilter]
[<00000000e0e7001f>] filter_commitrw_write+0x13af/0x5010 [obdfilter]
[<00000000dfd5ccd3>] ? cfs_alloc+0x23/0xf0 [libcfs]
[<00000000c109ea7a>] ? cache_alloc_debugcheck_after.isra.40+0xca/0x180
[<00000000dfd5ccd3>] ? cfs_alloc+0x23/0xf0 [libcfs]
[<00000000c109ed85>] ? __kmalloc+0xb5/0x1a0
[<00000000c10876a0>] ? kzfree+0x30/0xc0
[<00000000e0e623fc>] filter_commitrw+0x29c/0x340 [obdfilter]
[<00000000c13daa38>] ? _spin_unlock+0x8/0x10
[<00000000dff2ec4c>] ? lprocfs_counter_add+0x14c/0x1e0 [lvfs]
[<00000000e029443f>] ost_brw_write+0x17fd/0x22e2 [ost]



 Comments   
Comment by Andreas Dilger [ 05/Sep/12 ]

I'm also able to hit this problem 100% of the time on my single-node testing system (real hardware), though I see it at the end of racer.sh when it is trying to unmount the filesystem.

My first guess would be some recent change in either the RHEL 6.3 kernel bio layer which is breaking the assumption of this code, or a recent modification to nearby code. This code itself has been around for a very long time and I only started seeing the problem with the update to RHEL 6.3, but there was a window of 2 months or so where I didn't run any tests locally because the ldiskfs patches would build against the RHEL 6.2 kernel I had installed.

I can make the node available for remote debugging if that is needed. However, it isn't ideal for a bug which crashes the node, since it cannot reboot automatically.

Comment by Alex Zhuravlev [ 05/Sep/12 ]

I hit this quite often and can help to collect data.

Comment by Jinshan Xiong (Inactive) [ 05/Sep/12 ]

Besides this problem I also saw the module refcount of obdfilter is 2 after unmounting all OST targets.

Comment by Peter Jones [ 06/Sep/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 06/Sep/12 ]

I checked git log lustre/obdfilter/filter_io.c for changes that had been made recently to that code, and found the following commits:

commit 859e5b2d20552f8df0ed2afda0f1a7c3c7d86678
Author: Hongchao Zhang <hongchao.zhang@whamcloud.com>
Date:   Mon Aug 20 15:40:48 2012 +0800

    LU-657 obdfilter: fix bug in previous patch
    
    in the merged patch http://review.whamcloud.com/#change,3446,
    the usage of fsfilt_commit_wait is wrong, and it doesn't stop
    the journal firstly.
    
    Change-Id: I3a36edf7049466880c27c14bb7f99966aa75d4f1
    Reviewed-on: http://review.whamcloud.com/3692

commit a9597791b658ff51474c06f419162d0a0bf03c65
Author: Hongchao Zhang <hongchao.zhang@whamcloud.com>
Date:   Tue Aug 7 08:43:11 2012 +0800

    LU-657 obdfilter: commit pending journals if -ENOSPC
    
    in filter_preprw_write, if there is no enough space for this
    write operation, then commit the pending journals to get some
    more disk space and retry it.
    
    Change-Id: I46106b26443bb203eee6f01a0795b47be09170a6
    Reviewed-on: http://review.whamcloud.com/3446

Reverting these two patches has allowed me to pass both racer.sh and sanityn.sh test_15 (OOS), which failed for me this morning.

I suspect there is some kind of refcount problem in the retry code from this patch, and it is only hit when the new code is active when filesystem is nearly out of space.

Comment by Peter Jones [ 06/Sep/12 ]

Hongchao

Could you please look into this?

Thanks

Peter

Comment by Andreas Dilger [ 10/Sep/12 ]

This looks like a duplicate of LU-1824, which Yu Jian is already working on, and has a patch in http://review.whamcloud.com/3913 ready for inspection.

Generated at Sat Feb 10 01:20:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.