[LU-13968] PCC read failed for io_uring Created: 16/Sep/20  Updated: 05/Nov/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Qian Yingjin Assignee: Qian Yingjin
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During PCC-RO testing, we found that the following io_uring fio test failed:

lctl pcc add /mnt/lustre /mnt/pcc -p "projid={10001} roid=5 ropcc=1"
mkdir /mnt/lustre/pcc
lfs project -sp 10001 /mnt/lustre/pcc
cat /proc/meminfo
fio --name=seqread --create_only=1 --create_serialize=0 --bs=128K --directory=/mnt/lustre/pcc --filename_format=seqread.$jobnum.$filenum --filesize=$1 --direct=1 --iodepth=64 --numjobs=1


cat /proc/meminfo
lfs pcc state /mnt/lustre/pcc/*
cat /proc/meminfo
echo -n "\nDrop caches"
sysctl -w vm.drop_caches=3
cat /proc/meminfo
fio --name=seqread --rw=read --bs=128K --directory=/mnt/lustre/pcc --filename_format=seqread.$jobnum.$filenum --filesize=$1 --allow_file_create=0 --ioengine=io_uring --direct=1 --iodepth=4 --numjobs=1

 

The test panic the kernel when filesize=1G, but can pass when filesize=1M.

 

The panic kdump is :

[ 5051.929965] LustreError: 3136:0:(vvp_io.c:878:vvp_io_read_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
[ 5051.930069] LustreError: 3136:0:(vvp_io.c:878:vvp_io_read_start()) LBUG
[ 5051.930960] Pid: 3136, comm: fio 5.4.0-47-generic #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020
[ 5051.930961] Call Trace TBD:
[ 5051.930962] Kernel panic - not syncing: LBUG
[ 5051.930966] CPU: 0 PID: 3136 Comm: fio Kdump: loaded Tainted: G           OE     5.4.0-47-generic #51-Ubuntu
[ 5051.930967] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 5051.930968] Call Trace:
[ 5051.931020]  dump_stack+0x6d/0x9a
[ 5051.931113]  panic+0x101/0x2e3
[ 5051.931172]  lbug_with_loc.cold+0x18/0x18 [libcfs]
[ 5051.931336]  vvp_io_read_start+0x72f/0x7a0 [lustre]
[ 5051.931673]  ? cl_lock_request+0x65/0x1c0 [obdclass]
[ 5051.931691]  cl_io_start+0x62/0x110 [obdclass]
[ 5051.931704]  cl_io_loop+0x9d/0x1f0 [obdclass]
[ 5051.931717]  ll_file_io_generic+0x929/0xde0 [lustre]
[ 5051.931726]  ll_file_read_iter+0x160/0x310 [lustre]
[ 5051.931729]  io_read+0xe5/0x240
[ 5051.931802]  ? shmem_getpage_gfp+0xef/0x940
[ 5051.931867]  ? __switch_to_asm+0x34/0x70
[ 5051.931869]  ? __switch_to_asm+0x34/0x70
[ 5051.931871]  ? __switch_to_asm+0x40/0x70
[ 5051.931873]  __io_submit_sqe+0x444/0x8e0
[ 5051.931876]  ? current_time+0x43/0x80
[ 5051.931878]  __io_queue_sqe+0x23/0x2a0
[ 5051.931879]  io_queue_sqe+0x7a/0x90
[ 5051.931881]  io_submit_sqe+0x23d/0x330
[ 5051.931882]  io_ring_submit+0xca/0x200
[ 5051.931887]  ? handle_mm_fault+0xca/0x200
[ 5051.931889]  __x64_sys_io_uring_enter+0x1e4/0x2c0
[ 5051.931929]  do_syscall_64+0x57/0x190
[ 5051.931932]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5051.931934] RIP: 0033:0x7f98c5ba070d

when test with filesize=1M, we collect the Lustre debug log, and found that it failed in __pcc_file_read_iter() with error code -11 (-EAGAIN).

Dig into Ext4 ->read_iter, the code line returning -EAGAIN:

generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
	size_t count = iov_iter_count(iter);
	ssize_t retval = 0;

	if (!count)
		goto out; /* skip atime */

	if (iocb->ki_flags & IOCB_DIRECT) {
		struct file *file = iocb->ki_filp;
		struct address_space *mapping = file->f_mapping;
		struct inode *inode = mapping->host;
		loff_t size;

		size = i_size_read(inode);
		if (iocb->ki_flags & IOCB_NOWAIT) {
			if (filemap_range_has_page(mapping, iocb->ki_pos,
						   iocb->ki_pos + count - 1))
				return -EAGAIN;
		} else {
			retval = filemap_write_and_wait_range(mapping,
						iocb->ki_pos,
					        iocb->ki_pos + count - 1);
			if (retval < 0)
				goto out;
		}

The reason is that io_uring is using direct IO mode with flag IOCB_NOWAIT, and the PCC cached file map rang has pages...



 Comments   
Comment by Gerrit Updater [ 24/Sep/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40025
Subject: LU-13968 pcc: dnot tolerate read errors -EAGAIN/-EIOCBQUEUED
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 43ef8d983cbfcd5cfb90acd68e796b2c85b453cb

Generated at Sat Feb 10 03:05:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.