[LU-10356] CLIO simplification broke a direct IO sometimes Created: 08/Dec/17 Updated: 29/May/21 Resolved: 27/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL7 / 2.10 ; RHEL7 / master; # ./diotest5 -b 65536 -i 1000 -f /mnt/lustre/file2; test taken from LTP suite (need set a full debug to easy hit for me). |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
3.10 and some early kernels (RHEL6 also have it likely). written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
/*
* If the write stopped short of completing, fall back to
* buffered writes. Some filesystems do this for writes to
* holes, for example. For DAX files, a buffered write will
* not succeed (even if it did, DAX does not handle dirty
* page-cache pages correctly).
*/
if (written < 0 || written == count || IS_DAX(inode))
goto out;
pos += written;
count -= written;
written_buffered = generic_file_buffered_write(iocb, iov,
nr_segs, pos, ppos, count,
written);
/*
it caused a situation when buffered io will be send long after direct io finished for this segment
00000080:00200000:1.0:1512485492.000543:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=393216=60000 pages 16 (max 348160)
00000080:00008000:1.0:1512485492.003959:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 1.
00000080:00008000:1.0:1512485492.003984:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 2.
00000080:00008000:1.0:1512485492.004008:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 3.
00000080:00008000:1.0:1512485492.004031:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 4.
00000080:00008000:1.0:1512485492.004054:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 5.
00000080:00008000:1.0:1512485492.004077:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 6.
00000080:00008000:1.0:1512485492.004147:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 7.
00000080:00008000:1.0:1512485492.004172:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 8.
00000080:00008000:1.0:1512485492.004195:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 9.
00000080:00008000:1.0:1512485492.004218:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 10.
00000080:00008000:1.0:1512485492.004241:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 11.
00000080:00008000:1.0:1512485492.004264:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 12.
00000080:00008000:1.0:1512485492.004287:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 13.
00000080:00008000:1.0:1512485492.004311:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 14.
00000080:00008000:1.0:1512485492.004327:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 15.
00000080:00008000:1.0:1512485492.004343:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 16.
00000080:00008000:1.0:1512485492.004491:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 17.
00000080:00008000:1.0:1512485492.004506:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 18.
00000080:00008000:1.0:1512485492.004522:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 19.
00000080:00008000:1.0:1512485492.004537:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 20.
00000080:00008000:1.0:1512485492.004553:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 21.
00000080:00008000:1.0:1512485492.004568:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 22.
00000080:00008000:1.0:1512485492.004583:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 23.
00000080:00008000:1.0:1512485492.004599:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 24.
00000080:00008000:1.0:1512485492.004614:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 25.
00000080:00008000:1.0:1512485492.004629:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 26.
00000080:00008000:1.0:1512485492.004653:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 27.
00000080:00008000:1.0:1512485492.004676:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 28.
00000080:00008000:1.0:1512485492.004699:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 29.
00000080:00008000:1.0:1512485492.004722:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 30.
00000080:00008000:1.0:1512485492.004745:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 31.
00000080:00008000:1.0:1512485492.004767:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 32.
00000080:00200000:1.0:1512485492.004969:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=589824=90000, pages 16 (max 348160)
00000080:00200000:1.0:1512485492.007839:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=655360=a0000, pages 16 (max 348160)
00000080:00200000:1.0:1512485492.010532:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=720896=b0000, pages 16 (max 348160)
00000080:00200000:1.0:1512485492.013680:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=786432=c0000, pages 16 (max 348160)
00000080:00200000:1.0:1512485492.016534:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=851968=d0000, pages 16 (max 348160)
00000080:00000001:1.0:1512485492.024648:0:6427:0:(file.c:3093:cl_sync_file_range()) Process entered
00000080:00200000:1.0:1512485492.024686:0:6427:0:(vvp_io.c:943:vvp_io_write_commit()) commit ffff8800792bd438/ffff88003cbd14a0 async pages: 128, from 0, to 4096
it caused a two problems. |
| Comments |
| Comment by Peter Jones [ 08/Dec/17 ] |
|
Bobijam Can you please advise? Thanks Peter |
| Comment by Alexey Lyashkov [ 22/Dec/17 ] |
|
Peter, as I right understand - it's backside effect of CLIO simplification. Before this task - kernel was create a single page queue where DIO code was put pages and it was send to the CLIO stack once, but now it will be send as continues region. But CLIO simplification broke this way by removing transient pages in common queue. as about second part - cray have a patch and Vladimir Savelev should submit it, but it not a best solution from performance perspective. |
| Comment by Vladimir Saveliev [ 26/Dec/17 ] |
|
The direct IO write may proceed as a mixture of segments written via direct IO and segments written via page cache. Suppose a test interleaves direct IO write and buffered read (originated from ltp/testcases/kernel/io/direct_io/diotest5.c). The buffered read creates 20 pages in order to do readahead. Linux kernel batches multiple page additions to lru list via page vector of 14 (PAGEVEC_SIZE) pages (see __lru_cache_add). Due to that after the read there are 14 pages in lru list and 6 pages in page vector with increased reference counter. When direct IO write comes to write page #0 it tries to invalidate the page and to drain lru add page vector: __generic_file_aio_write() {
if (O_DIRECT) {
generic_file_direct_write() {
filemap_write_and_wait_range();
invalidate_inode_pages2_range() {
invalidate_complete_page2() {
ll_releasepage();
pagevec_release() {
lru_add_drain() {
lru_add_drain_cpu();
}
}
}
mapping->a_ops->direct_IO();
}
}
}
The pages from #0 to #13 are releasable, and pagevec_release() is supposed to drain remaining 6 pages from page vector to lru list. The page vector is a per-cpu variable, and if the preceding buffered read ran on different cpu, this time pagevec_release() makes no effect of the pages #14..#19 and they remain in page vector associated with that cpu with increased reference counter and therefore are not releasable, as ll_releasepage() fails on page ref count check: ll_releasepage() ... /* 1 for caller, 1 for cl_page and 1 for page cache */ if (page_count(vmpage) > 3) return 0; ... The subsequent calls to invalidate_complete_page2() for pages from #14 to #19 fail. __generic_file_aio_write() performs buffered write for those pages. __generic_file_aio_write() {
if (O_DIRECT) {
generic_file_direct_write() {
}
/* after ll_releasepage() failures, direct IO resorts to buffered write */
generic_file_buffered_write();
filemap_write_and_wait_range();
invalidate_mapping_pages();
}
}
Pages written via buffered write remain queued by ll_write_end() for async commit until vvp_io_write_commit() comes in. That is why filemap_write_and_wait_range(); invalidate_mapping_pages(); makes no effect on those 6 pages. On the next round buffered read creates only 14 pages in page vector (#0..#13), as pages #14..#19 still exist. The page vector is not drained so all 14 pages are not in the lru list yet and therefore are not releasable. So, on this round pages #0, #14..#19 are queued for async commit which hits the below assertion which requires only subsequent pages to be queued. vvp_io_write_commit() ... LASSERT(page_list_sanity_check(obj, queue)); ... |
| Comment by Gerrit Updater [ 26/Dec/17 ] |
|
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/30659 |
| Comment by Jinshan Xiong (Inactive) [ 05/Jan/18 ] |
|
It's a known issue to read/write the same file with mixed cached and direct IO because it would cause data corruption. We could land the patch as you proposed but I don't think there would be any further actions required. |
| Comment by Alexey Lyashkov [ 05/Jan/18 ] |
|
Jay, it not a mixed IO types from user perspective. It will be good if DIO code will put a pages into single queue and send all data in once. It unlikely to be data corruption - as pages will be locked all time and will removed from mapping immediately after move to the submit queue. |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30659/ |
| Comment by Peter Jones [ 27/Feb/18 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31432 |
| Comment by Gerrit Updater [ 05/Apr/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31432/ |