[LU-16030] sanity-pcc test_45: attach more than 2 time: 0 Created: 20/Jul/22 Updated: 01/May/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Qian Yingjin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Qian Yingjin <qian@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5cddefde-7d18-45d2-9fe6-496bfef626d4 test_45 failed with the following error: attach more than 2 time: 0 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Qian Yingjin [ 07/Dec/22 ] |
|
It seems that this failure occurred on the newer kernel SLES 15.3 and Unubtu2004. |
| Comment by Andreas Dilger [ 25/Jan/23 ] |
|
+3 in past week: |
| Comment by Qian Yingjin [ 17/Mar/23 ] |
|
After investigate the problem, the problem is narrowed down to a generic Lustre bug even without PCC on the newer kernel (Ubuntu 2204). The bug can be easily reproduced by the following scripts (without PCC setup on the client):
#!/bin/bash
error() {
echo "$@"
exit 1
}
HSMTOOL_ARCHIVE_FORMAT=v2
LCTL=$(which lctl)
LFS=$(which lfs)
MMAP_CAT="/usr/lib/lustre/tests/mmap_cat"
DIR="/mnt/lustre"
tdir="sanity-pcc.d99"
tfile="sanity-pcc.f99"
hsm_root="/mnt/pcc"
mntpt="/mnt/pcc"
dir=$DIR/$tdir
file=$dir/$tfile
id=2
cnt=30
DEBUGFS=/sys/kernel/debug/tracing
CURRENT_TRACER="function_graph"
SET_FUNCTION_GRAPH="*:mod:lustre"
lctl set_param debug=cache+inode+iotrace+vfstrace+info+dlmtrace+trace+reada
lctl set_param subsystem_debug=llite+ldlm+osc
lctl set_param debug_mb=40000
lctl get_param debug_mb
rm -rf $mntpt/*
rm -rf $DIR/*
rm logC
rm mytrace.dat
echo "$CURRENT_TRACER" > $DEBUGFS/current_tracer
echo 12 > $DEBUGFS/max_graph_depth
#echo "$SET_FUNCTION_GRAPH" > $DEBUG_FS/set_ftrace_filter
echo '' > $DEBUG_FS/set_ftrace_filter
mkdir -p $dir
lctl clear
dd if=/dev/zero of=$file bs=1M count=$cnt || error "Write $file failed"
lctl clear
(
while [ ! -e $DIR/sanity-pcc.99.lck ]; do
dd if=/dev/zero of=$file bs=1M count=$cnt conv=notrunc ||
error "failed to write $file"
sleep 0.$((RANDOM % 4 + 1))
done
)&
wpid=$!
(
while [ ! -e $DIR/sanity-pcc.99.lck ]; do
#lctl clear
echo > $DEBUGFS/trace
echo 1 > $DEBUGFS/tracing_on
dd if=$file of=/dev/null bs=1M count=$cnt &
pid=$!
echo "$pid" > $DEBUGFS/set_ftrace_pid
wait $pid ||
{
rc=$?
cp $DEBUGFS/trace mytrace.dat
echo 0 > $DEBUGFS/tracing_on
echo 'nop' > $DEBUGFS/current_tracer
echo "failed to read $file: rc = $rc pid = $pid"
lctl dk > logA
error "failed to read $file: rc = $?"
}
sleep 0.$((RANDOM % 4 + 1))
echo 0 > $DEBUGFS/tracing_on
done
)&
rpid=$!
(
while [ ! -e $DIR/sanity-pcc.99.lck ]; do
dd if=$file of=/dev/null bs=1M count=$cnt iflag=direct ||
error "failed to mmap_cat $file"
sleep 0.$((RANDOM % 4 + 1))
done
true
)&
rpid2=$!
sleep 5
touch $DIR/sanity-pcc.99.lck
wait $wpid || error "$?: write failed"
wait $rpid || error "$?: read failed: pid=$rpid"
wait $rpid2 || error "$?: read2 failed"
The output:
30+0 records in
30+0 records out
31457280 bytes (31 MB, 30 MiB) copied, 0.353618 s, 89.0 MB/s
dd: error reading '/mnt/lustre/sanity-pcc.d99/sanity-pcc.f99': Input/output error
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.459942 s, 9.1 MB/s
30+0 records in
Three process keeping buffered read/direct I/O read from a file and write to the same file will return -EIO . The EIO code is returned from: static int filemap_read_page(struct file *file, struct address_space *mapping, struct page *page) { int error; /* * A previous I/O error may have been due to temporary failures, * eg. multipath errors. PG_error will be set again if readpage * fails. */ ClearPageError(page); /* Start the actual read. The read will unlock the page. */ error = mapping->a_ops->readpage(file, page); if (error) return error; error = wait_on_page_locked_killable(page); if (error) return error; if (PageUptodate(page)) return 0; shrink_readahead_size_eio(&file->f_ra); return -EIO; <=== return code!! }
Still need more time to investigate the bug....
|
| Comment by Qian Yingjin [ 20/Mar/23 ] |
|
After investigate the bug, the problem should be that: The client has two overlapped PR DLM extent locks:
A reader process holds L1 and read data in range [3M, 4M -1]. L2 is being revoking due to the conflict access. Then in the ->readpage() call: It will unlock the page after the page is in Uptodate state. And lock blocking callback for L2 will clear the Uptodate state and delete the page from cache. Then in the old kernel, it will:
/* Start the actual read. The read will unlock the page. */ error = mapping->a_ops->readpage(filp, page); if (unlikely(error)) { if (error == AOP_TRUNCATED_PAGE) { put_page(page); error = 0; goto find_page; } goto readpage_error; } if (!PageUptodate(page)) { error = lock_page_killable(page); if (unlikely(error)) goto readpage_error; if (!PageUptodate(page)) { if (page->mapping == NULL) { =====> /* * invalidate_mapping_pages got it */ unlock_page(page); put_page(page); goto find_page; } unlock_page(page); shrink_readahead_size_eio(filp, ra); error = -EIO; goto readpage_error; } unlock_page(page); } goto page_ok; It will check whether the page was truncated and deleted from page cache via page->mapping, if so, it will retry the page read.
But in the newer kernel, it deletes this check and retry. It seems there is no any simple solution for this bug. static int filemap_read_page(struct file *file, struct address_space *mapping, struct page *page) { int error; /* * A previous I/O error may have been due to temporary failures, * eg. multipath errors. PG_error will be set again if readpage * fails. */ ClearPageError(page); /* Start the actual read. The read will unlock the page. */ error = mapping->a_ops->readpage(file, page); if (error) return error; error = wait_on_page_locked_killable(page); if (error) return error; if (PageUptodate(page)) return 0; shrink_readahead_size_eio(&file->f_ra); return -EIO; }
And I also found that it may cause short read (partial read) during the test... Which may be related to DDN-3569.
|
| Comment by Qian Yingjin [ 20/Mar/23 ] |
|
According to the Linux Documents: Documentations/filkesystems/locking.rst: ocking rules: All except set_page_dirty and freepage may block ====================== ======================== ========= =============== ops PageLocked(page) i_rwsem invalidate_lock ====================== ======================== ========= =============== writepage: yes, unlocks (see below) readpage: yes, unlocks shared writepages: set_page_dirty no readahead: yes, unlocks shared readpages: no shared write_begin: locks the page exclusive write_end: yes, unlocks exclusive bmap: invalidatepage: yes exclusive releasepage: yes freepage: yes direct_IO: isolate_page: yes migratepage: yes (both) putback_page: yes launder_page: yes is_partially_uptodate: yes error_remove_page: yes swap_activate: no swap_deactivate: no ====================== ======================== ========= =============== The filesystem must exclusively acquire invalidate_lock before invalidating page cache in truncate / hole punch path (and thus calling into ->invalidatepage) to block races between page cache invalidation and page cache filling functions (fault, read, ...). |
| Comment by Qian Yingjin [ 20/Mar/23 ] |
|
I will cook a patch to verify it later. |
| Comment by Qian Yingjin [ 21/Mar/23 ] |
|
Verifed that after acquired invalidate_lock during the lock blocking AST (extent lock), the test passed without -EIO. |
| Comment by Andreas Dilger [ 23/Apr/23 ] |
|
Yingjin, what is the path forward for this bug? Is there a fix in Lustre that can be made, possibly one already in flight for the other read cache issues, or is the only fix to submit a patch to Ubuntu to backport a fix from upstream to their kernel? |
| Comment by Andreas Dilger [ 23/Apr/23 ] |
I recall that when a lock was being cancelled that the client would check for an overlapping read lock and attach the page to the second lock rather than invalidating it. I believe that was stopped because it added overhead for cancelling locks and the thought was that the page would be refreshed if it was accessed again. In addition to patching the kernel to fix this retry issue there, another option would be to do an interval tree lookup on the other DLM locks on this resource to see if there is an overlapping lock extent, and then bulk transfer all of the pages from the old lock to the new lock, rather than doing this one page at a time. This lock could also be cached in case other pages need to be evicted. In the common case, the client resource will only have a single lock on it, so the lookup would be fast. We could also examine why the client is requesting (or is granted) overlapping DLM extent locks, since that would also avoid the issue. |
| Comment by Patrick Farrell [ 23/Apr/23 ] |
|
"I recall that when a lock was being cancelled that the client would check for an overlapping read lock and attach the page to the second lock rather than invalidating it. I believe that was stopped because it added overhead for cancelling locks and the thought was that the page would be refreshed if it was accessed again." No, I believe we still do that, though I think we played with not doing so. We actually don't attach pages to locks in any way - they just live 'under' the DLM lock. We just keep page if there's a still lock covering them - check_and_discard_cb. There's no linkage from a page to a lock (and so also pages don't hold references on locks. It would be quite expensive to do so.) So, about the bug here, just a moment... |
| Comment by Patrick Farrell [ 23/Apr/23 ] |
|
So, I'm about 95% sure the bug here is fixed - on all kernels - by: The described situation is identical. But that fix is a bit of a hack, and the 'right' way to do it is to use the invalidate lock as Yingjin describes here. It just doesn't exist on older kernels. Yingjin made patches using the invalidate lock. See the series here: That series ends by removing the hack from But to be clear: Yingjin, please confirm and correct if I have anything badly wrong |
| Comment by Patrick Farrell [ 23/Apr/23 ] |
|
"We could also examine why the client is requesting (or is granted) overlapping DLM extent locks, since that would also avoid the issue." So, just an additional note: Consider this case: Client 1 has a write lock from 1-3 MiB on a file Also consider that server lock expansion isn't needed either, if the second read request is for 0.5 MiB to 1.5 MiB. We don't split read/write requests across dlm locks (within an OSC), for various good reasons, so this also results in a second overlapping read lock. To be fair, we also don't really protect against racing lock requests from different processes on the same client. We do a little bit of it - the OSC locking code can catch it - but it's racy, so sometimes we can get multiple identical read locks from parallel read requests (we can also get overlapping write lock requests from the same client when there are overlapping writes from userspace, resulting in a self-conflict, which is resolved safely with the usual lock callback mechanisms, just with the slightly odd sight of a client requesting two write locks on the same area of a file). I looked at fixing these overlaps and self-conflicts back when I was first working on lockahead (it's very noticeable with lockahead, particularly if you screw up in userspace and request the same lock over and over), but concluded the way it is is basically fine and not worth the trouble to fix. It happens only occasionally during parallel reads and writes and because it's rare and transient, I couldn't see a meaningful cost. (OK, I think this is really my final edit...) |