Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      During PCC-RO testing, we found that the following io_uring fio test failed:

      lctl pcc add /mnt/lustre /mnt/pcc -p "projid={10001} roid=5 ropcc=1"
      mkdir /mnt/lustre/pcc
      lfs project -sp 10001 /mnt/lustre/pcc
      cat /proc/meminfo
      fio --name=seqread --create_only=1 --create_serialize=0 --bs=128K --directory=/mnt/lustre/pcc --filename_format=seqread.$jobnum.$filenum --filesize=$1 --direct=1 --iodepth=64 --numjobs=1
      
      
      cat /proc/meminfo
      lfs pcc state /mnt/lustre/pcc/*
      cat /proc/meminfo
      echo -n "\nDrop caches"
      sysctl -w vm.drop_caches=3
      cat /proc/meminfo
      fio --name=seqread --rw=read --bs=128K --directory=/mnt/lustre/pcc --filename_format=seqread.$jobnum.$filenum --filesize=$1 --allow_file_create=0 --ioengine=io_uring --direct=1 --iodepth=4 --numjobs=1
      

       

      The test panic the kernel when filesize=1G, but can pass when filesize=1M.

       

      The panic kdump is :

      [ 5051.929965] LustreError: 3136:0:(vvp_io.c:878:vvp_io_read_start()) ASSERTION( vio->vui_iocb->ki_pos == pos ) failed:
      [ 5051.930069] LustreError: 3136:0:(vvp_io.c:878:vvp_io_read_start()) LBUG
      [ 5051.930960] Pid: 3136, comm: fio 5.4.0-47-generic #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020
      [ 5051.930961] Call Trace TBD:
      [ 5051.930962] Kernel panic - not syncing: LBUG
      [ 5051.930966] CPU: 0 PID: 3136 Comm: fio Kdump: loaded Tainted: G           OE     5.4.0-47-generic #51-Ubuntu
      [ 5051.930967] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [ 5051.930968] Call Trace:
      [ 5051.931020]  dump_stack+0x6d/0x9a
      [ 5051.931113]  panic+0x101/0x2e3
      [ 5051.931172]  lbug_with_loc.cold+0x18/0x18 [libcfs]
      [ 5051.931336]  vvp_io_read_start+0x72f/0x7a0 [lustre]
      [ 5051.931673]  ? cl_lock_request+0x65/0x1c0 [obdclass]
      [ 5051.931691]  cl_io_start+0x62/0x110 [obdclass]
      [ 5051.931704]  cl_io_loop+0x9d/0x1f0 [obdclass]
      [ 5051.931717]  ll_file_io_generic+0x929/0xde0 [lustre]
      [ 5051.931726]  ll_file_read_iter+0x160/0x310 [lustre]
      [ 5051.931729]  io_read+0xe5/0x240
      [ 5051.931802]  ? shmem_getpage_gfp+0xef/0x940
      [ 5051.931867]  ? __switch_to_asm+0x34/0x70
      [ 5051.931869]  ? __switch_to_asm+0x34/0x70
      [ 5051.931871]  ? __switch_to_asm+0x40/0x70
      [ 5051.931873]  __io_submit_sqe+0x444/0x8e0
      [ 5051.931876]  ? current_time+0x43/0x80
      [ 5051.931878]  __io_queue_sqe+0x23/0x2a0
      [ 5051.931879]  io_queue_sqe+0x7a/0x90
      [ 5051.931881]  io_submit_sqe+0x23d/0x330
      [ 5051.931882]  io_ring_submit+0xca/0x200
      [ 5051.931887]  ? handle_mm_fault+0xca/0x200
      [ 5051.931889]  __x64_sys_io_uring_enter+0x1e4/0x2c0
      [ 5051.931929]  do_syscall_64+0x57/0x190
      [ 5051.931932]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 5051.931934] RIP: 0033:0x7f98c5ba070d
      

      when test with filesize=1M, we collect the Lustre debug log, and found that it failed in __pcc_file_read_iter() with error code -11 (-EAGAIN).

      Dig into Ext4 ->read_iter, the code line returning -EAGAIN:

      generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
      {
      	size_t count = iov_iter_count(iter);
      	ssize_t retval = 0;
      
      	if (!count)
      		goto out; /* skip atime */
      
      	if (iocb->ki_flags & IOCB_DIRECT) {
      		struct file *file = iocb->ki_filp;
      		struct address_space *mapping = file->f_mapping;
      		struct inode *inode = mapping->host;
      		loff_t size;
      
      		size = i_size_read(inode);
      		if (iocb->ki_flags & IOCB_NOWAIT) {
      			if (filemap_range_has_page(mapping, iocb->ki_pos,
      						   iocb->ki_pos + count - 1))
      				return -EAGAIN;
      		} else {
      			retval = filemap_write_and_wait_range(mapping,
      						iocb->ki_pos,
      					        iocb->ki_pos + count - 1);
      			if (retval < 0)
      				goto out;
      		}
      

      The reason is that io_uring is using direct IO mode with flag IOCB_NOWAIT, and the PCC cached file map rang has pages...

      Attachments

        Activity

          People

            qian_wc Qian Yingjin
            qian_wc Qian Yingjin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: