[LU-1831] filter_direct_io()) ASSERTION( iobuf->dr_npages > 0 ) Created: 05/Sep/12 Updated: 10/Sep/12 Resolved: 10/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Alex Zhuravlev | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
local testing in vbox |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6339 | ||||||||
| Description |
|
Lustre: DEBUG MARKER: == sanityn test 15: test out-of-space with multiple writers ========================================== 18:15:33 (1346854533) Call Trace: |
| Comments |
| Comment by Andreas Dilger [ 05/Sep/12 ] |
|
I'm also able to hit this problem 100% of the time on my single-node testing system (real hardware), though I see it at the end of racer.sh when it is trying to unmount the filesystem. My first guess would be some recent change in either the RHEL 6.3 kernel bio layer which is breaking the assumption of this code, or a recent modification to nearby code. This code itself has been around for a very long time and I only started seeing the problem with the update to RHEL 6.3, but there was a window of 2 months or so where I didn't run any tests locally because the ldiskfs patches would build against the RHEL 6.2 kernel I had installed. I can make the node available for remote debugging if that is needed. However, it isn't ideal for a bug which crashes the node, since it cannot reboot automatically. |
| Comment by Alex Zhuravlev [ 05/Sep/12 ] |
|
I hit this quite often and can help to collect data. |
| Comment by Jinshan Xiong (Inactive) [ 05/Sep/12 ] |
|
Besides this problem I also saw the module refcount of obdfilter is 2 after unmounting all OST targets. |
| Comment by Peter Jones [ 06/Sep/12 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Andreas Dilger [ 06/Sep/12 ] |
|
I checked git log lustre/obdfilter/filter_io.c for changes that had been made recently to that code, and found the following commits: commit 859e5b2d20552f8df0ed2afda0f1a7c3c7d86678
Author: Hongchao Zhang <hongchao.zhang@whamcloud.com>
Date: Mon Aug 20 15:40:48 2012 +0800
LU-657 obdfilter: fix bug in previous patch
in the merged patch http://review.whamcloud.com/#change,3446,
the usage of fsfilt_commit_wait is wrong, and it doesn't stop
the journal firstly.
Change-Id: I3a36edf7049466880c27c14bb7f99966aa75d4f1
Reviewed-on: http://review.whamcloud.com/3692
commit a9597791b658ff51474c06f419162d0a0bf03c65
Author: Hongchao Zhang <hongchao.zhang@whamcloud.com>
Date: Tue Aug 7 08:43:11 2012 +0800
LU-657 obdfilter: commit pending journals if -ENOSPC
in filter_preprw_write, if there is no enough space for this
write operation, then commit the pending journals to get some
more disk space and retry it.
Change-Id: I46106b26443bb203eee6f01a0795b47be09170a6
Reviewed-on: http://review.whamcloud.com/3446
Reverting these two patches has allowed me to pass both racer.sh and sanityn.sh test_15 (OOS), which failed for me this morning. I suspect there is some kind of refcount problem in the retry code from this patch, and it is only hit when the new code is active when filesystem is nearly out of space. |
| Comment by Peter Jones [ 06/Sep/12 ] |
|
Hongchao Could you please look into this? Thanks Peter |
| Comment by Andreas Dilger [ 10/Sep/12 ] |
|
This looks like a duplicate of |