Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10356

CLIO simplification broke a direct IO sometimes

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.4
    • Lustre 2.10.1
    • None
    • RHEL7 / 2.10 ; RHEL7 / master; # ./diotest5 -b 65536 -i 1000 -f /mnt/lustre/file2; test taken from LTP suite (need set a full debug to easy hit for me).
    • 3
    • 9223372036854775807

    Description

      3.10 and some early kernels (RHEL6 also have it likely).
      have a buffered write failback if direct io can failed after some offset.

                      written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
                                                              ppos, count, ocount);
                      /*
                       * If the write stopped short of completing, fall back to
                       * buffered writes.  Some filesystems do this for writes to
                       * holes, for example.  For DAX files, a buffered write will
                       * not succeed (even if it did, DAX does not handle dirty
                       * page-cache pages correctly).
                       */
                      if (written < 0 || written == count || IS_DAX(inode))
                              goto out;
                      pos += written;
                      count -= written;
                      written_buffered = generic_file_buffered_write(iocb, iov,
                                                      nr_segs, pos, ppos, count,
                                                      written);
                      /*
      

      it caused a situation when buffered io will be send long after direct io finished for this segment

      00000080:00200000:1.0:1512485492.000543:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=393216=60000  pages 16 (max 348160)
      00000080:00008000:1.0:1512485492.003959:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 1.
      00000080:00008000:1.0:1512485492.003984:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 2.
      00000080:00008000:1.0:1512485492.004008:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 3.
      00000080:00008000:1.0:1512485492.004031:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 4.
      00000080:00008000:1.0:1512485492.004054:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 5.
      00000080:00008000:1.0:1512485492.004077:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 6.
      00000080:00008000:1.0:1512485492.004147:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 7.
      00000080:00008000:1.0:1512485492.004172:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 8.
      00000080:00008000:1.0:1512485492.004195:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 9.
      00000080:00008000:1.0:1512485492.004218:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 10.
      00000080:00008000:1.0:1512485492.004241:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 11.
      00000080:00008000:1.0:1512485492.004264:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 12.
      00000080:00008000:1.0:1512485492.004287:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 13.
      00000080:00008000:1.0:1512485492.004311:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 14.
      00000080:00008000:1.0:1512485492.004327:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 15.
      00000080:00008000:1.0:1512485492.004343:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 16.
      00000080:00008000:1.0:1512485492.004491:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 17.
      00000080:00008000:1.0:1512485492.004506:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 18.
      00000080:00008000:1.0:1512485492.004522:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 19.
      00000080:00008000:1.0:1512485492.004537:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 20.
      00000080:00008000:1.0:1512485492.004553:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 21.
      00000080:00008000:1.0:1512485492.004568:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 22.
      00000080:00008000:1.0:1512485492.004583:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 23.
      00000080:00008000:1.0:1512485492.004599:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 24.
      00000080:00008000:1.0:1512485492.004614:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 25.
      00000080:00008000:1.0:1512485492.004629:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 26.
      00000080:00008000:1.0:1512485492.004653:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 27.
      00000080:00008000:1.0:1512485492.004676:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 28.
      00000080:00008000:1.0:1512485492.004699:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 29.
      00000080:00008000:1.0:1512485492.004722:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 30.
      00000080:00008000:1.0:1512485492.004745:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 31.
      00000080:00008000:1.0:1512485492.004767:0:6427:0:(rw26.c:770:ll_write_end()) queued page: 32.
      00000080:00200000:1.0:1512485492.004969:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=589824=90000, pages 16 (max 348160)
      00000080:00200000:1.0:1512485492.007839:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=655360=a0000, pages 16 (max 348160)
      00000080:00200000:1.0:1512485492.010532:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=720896=b0000, pages 16 (max 348160)
      00000080:00200000:1.0:1512485492.013680:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=786432=c0000, pages 16 (max 348160)
      00000080:00200000:1.0:1512485492.016534:0:6427:0:(rw26.c:496:ll_direct_IO()) VFS Op:inode=[0x200000407:0x15:0x0](ffff88007a96b690), size=65536 (max 1426063360), offset=851968=d0000, pages 16 (max 348160)
      00000080:00000001:1.0:1512485492.024648:0:6427:0:(file.c:3093:cl_sync_file_range()) Process entered
      00000080:00200000:1.0:1512485492.024686:0:6427:0:(vvp_io.c:943:vvp_io_write_commit()) commit ffff8800792bd438/ffff88003cbd14a0 async pages: 128, from 0, to 4096
      

      it caused a two problems.
      1) don't optimal RPC generated for buffered part, DIO segment was send and committed to the OST while buffered part still on client cache and send just as part of cl_sync_file_range, so extra seek needs.
      2) data corruption potentially as first page in this queue isn't same as expected offset for this operation.

      Attachments

        Activity

          [LU-10356] CLIO simplification broke a direct IO sometimes

          John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31432/
          Subject: LU-10356 llite: have ll_write_end to sync for DIO
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set:
          Commit: 98dbdb50a9f2bd7ddcd61588106703c4df961c0a

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31432/ Subject: LU-10356 llite: have ll_write_end to sync for DIO Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 98dbdb50a9f2bd7ddcd61588106703c4df961c0a

          Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31432
          Subject: LU-10356 llite: have ll_write_end to sync for DIO
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set: 1
          Commit: 0cbb48ce77e8463a3b10a16b665a5bd2d2cac9e6

          gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31432 Subject: LU-10356 llite: have ll_write_end to sync for DIO Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 0cbb48ce77e8463a3b10a16b665a5bd2d2cac9e6
          pjones Peter Jones added a comment -

          Landed for 2.11

          pjones Peter Jones added a comment - Landed for 2.11

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30659/
          Subject: LU-10356 llite: have ll_write_end to sync for DIO
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 6ea9171769db602b3a2b34419bdafacd38454cb4

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30659/ Subject: LU-10356 llite: have ll_write_end to sync for DIO Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6ea9171769db602b3a2b34419bdafacd38454cb4

          Jay,

          it not a mixed IO types from user perspective. It will be good if DIO code will put a pages into single queue and send all data in once. It unlikely to be data corruption - as pages will be locked all time and will removed from mapping immediately after move to the submit queue.
          Other solution - uses same way as NFS does. MM have a launder_page callback to solve problem with read page vs write vs invalidate page race.

          shadow Alexey Lyashkov added a comment - Jay, it not a mixed IO types from user perspective. It will be good if DIO code will put a pages into single queue and send all data in once. It unlikely to be data corruption - as pages will be locked all time and will removed from mapping immediately after move to the submit queue. Other solution - uses same way as NFS does. MM have a launder_page callback to solve problem with read page vs write vs invalidate page race.

          It's a known issue to read/write the same file with mixed cached and direct IO because it would cause data corruption. We could land the patch as you proposed but I don't think there would be any further actions required.

          jay Jinshan Xiong (Inactive) added a comment - It's a known issue to read/write the same file with mixed cached and direct IO because it would cause data corruption. We could land the patch as you proposed but I don't think there would be any further actions required.

          Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/30659
          Subject: LU-10356 llite: have ll_write_end to sync for DIO
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 168c0a6bbec75370e14a8ef4cfc25559ecdb9221

          gerrit Gerrit Updater added a comment - Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/30659 Subject: LU-10356 llite: have ll_write_end to sync for DIO Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 168c0a6bbec75370e14a8ef4cfc25559ecdb9221

          The direct IO write may proceed as a mixture of segments written via direct IO and segments written via page cache.

          Suppose a test interleaves direct IO write and buffered read (originated from ltp/testcases/kernel/io/direct_io/diotest5.c).

          The buffered read creates 20 pages in order to do readahead. Linux kernel batches multiple page additions to lru list via page vector of 14 (PAGEVEC_SIZE) pages (see __lru_cache_add). Due to that after the read there are 14 pages in lru list and 6 pages in page vector with increased reference counter.

          When direct IO write comes to write page #0 it tries to invalidate the page and to drain lru add page vector:

          __generic_file_aio_write() {
              if (O_DIRECT) {
                  generic_file_direct_write() {
                      filemap_write_and_wait_range();
                      invalidate_inode_pages2_range() {
                          invalidate_complete_page2() {
                          ll_releasepage();
                          pagevec_release() {
                              lru_add_drain() {
                                  lru_add_drain_cpu();
                              }
                          }
                      }
                      mapping->a_ops->direct_IO();
                  }
              }
          }
          
          

          The pages from #0 to #13 are releasable, and pagevec_release() is supposed to drain remaining 6 pages from page vector to lru list. The page vector is a per-cpu variable, and if the preceding buffered read ran on different cpu, this time pagevec_release() makes no effect of the pages #14..#19 and they remain in page vector associated with that cpu with increased reference counter and therefore are not releasable, as ll_releasepage() fails on page ref count check:

          ll_releasepage()
          ...
                  /* 1 for caller, 1 for cl_page and 1 for page cache */
                  if (page_count(vmpage) > 3)
                          return 0;
          ...
          
          

          The subsequent calls to invalidate_complete_page2() for pages from #14 to #19 fail. __generic_file_aio_write() performs buffered write for those pages.

          __generic_file_aio_write() {
              if (O_DIRECT) {
                  generic_file_direct_write() {
                  }
          /* after ll_releasepage() failures, direct IO resorts to buffered write */
                  generic_file_buffered_write();
                  filemap_write_and_wait_range();
                  invalidate_mapping_pages();
              }
          }
          
          

          Pages written via buffered write remain queued by ll_write_end() for async commit until vvp_io_write_commit() comes in. That is why filemap_write_and_wait_range(); invalidate_mapping_pages(); makes no effect on those 6 pages.

          On the next round buffered read creates only 14 pages in page vector (#0..#13), as pages #14..#19 still exist. The page vector is not drained so all 14 pages are not in the lru list yet and therefore are not releasable.
          Direct IO write tries to release page #0, fails and processes via buffered write. However, pages #1..#13 are made releasable by invalidate_inode_pages2_range()>pagevec_release()>lru_add_drain() called for page #0 and get written by direct IO. Pages #14..#19 are still not releasable and get written via buffered write.

          So, on this round pages #0, #14..#19 are queued for async commit which hits the below assertion which requires only subsequent pages to be queued.

          vvp_io_write_commit()
          ...
             LASSERT(page_list_sanity_check(obj, queue));
          ...
          
          
          vsaveliev Vladimir Saveliev added a comment - The direct IO write may proceed as a mixture of segments written via direct IO and segments written via page cache. Suppose a test interleaves direct IO write and buffered read (originated from ltp/testcases/kernel/io/direct_io/diotest5.c). The buffered read creates 20 pages in order to do readahead. Linux kernel batches multiple page additions to lru list via page vector of 14 (PAGEVEC_SIZE) pages (see __lru_cache_add). Due to that after the read there are 14 pages in lru list and 6 pages in page vector with increased reference counter. When direct IO write comes to write page #0 it tries to invalidate the page and to drain lru add page vector: __generic_file_aio_write() { if (O_DIRECT) { generic_file_direct_write() { filemap_write_and_wait_range(); invalidate_inode_pages2_range() { invalidate_complete_page2() { ll_releasepage(); pagevec_release() { lru_add_drain() { lru_add_drain_cpu(); } } } mapping->a_ops->direct_IO(); } } } The pages from #0 to #13 are releasable, and pagevec_release() is supposed to drain remaining 6 pages from page vector to lru list. The page vector is a per-cpu variable, and if the preceding buffered read ran on different cpu, this time pagevec_release() makes no effect of the pages #14..#19 and they remain in page vector associated with that cpu with increased reference counter and therefore are not releasable, as ll_releasepage() fails on page ref count check: ll_releasepage() ...          /* 1 for caller, 1 for cl_page and 1 for page cache */         if (page_count(vmpage) > 3)                 return 0; ... The subsequent calls to invalidate_complete_page2() for pages from #14 to #19 fail. __generic_file_aio_write() performs buffered write for those pages. __generic_file_aio_write() { if (O_DIRECT) { generic_file_direct_write() { } /* after ll_releasepage() failures, direct IO resorts to buffered write */ generic_file_buffered_write(); filemap_write_and_wait_range(); invalidate_mapping_pages(); } } Pages written via buffered write remain queued by ll_write_end() for async commit until vvp_io_write_commit() comes in. That is why filemap_write_and_wait_range(); invalidate_mapping_pages(); makes no effect on those 6 pages. On the next round buffered read creates only 14 pages in page vector (#0..#13), as pages #14..#19 still exist. The page vector is not drained so all 14 pages are not in the lru list yet and therefore are not releasable. Direct IO write tries to release page #0, fails and processes via buffered write. However, pages #1..#13 are made releasable by invalidate_inode_pages2_range() >pagevec_release() >lru_add_drain() called for page #0 and get written by direct IO. Pages #14..#19 are still not releasable and get written via buffered write. So, on this round pages #0, #14..#19 are queued for async commit which hits the below assertion which requires only subsequent pages to be queued. vvp_io_write_commit() ... LASSERT(page_list_sanity_check(obj, queue)); ...

          Peter,

          as I right understand - it's backside effect of CLIO simplification. Before this task - kernel was create a single page queue where DIO code was put pages and it was send to the CLIO stack once, but now it will be send as continues region. But CLIO simplification broke this way by removing transient pages in common queue.

          as about second part - cray have a patch and Vladimir Savelev should submit it, but it not a best solution from performance perspective.

          shadow Alexey Lyashkov added a comment - Peter, as I right understand - it's backside effect of CLIO simplification. Before this task - kernel was create a single page queue where DIO code was put pages and it was send to the CLIO stack once, but now it will be send as continues region. But CLIO simplification broke this way by removing transient pages in common queue. as about second part - cray have a patch and Vladimir Savelev should submit it, but it not a best solution from performance perspective.
          pjones Peter Jones added a comment -

          Bobijam

          Can you please advise?

          Thanks

          Peter

          pjones Peter Jones added a comment - Bobijam Can you please advise? Thanks Peter

          People

            bobijam Zhenyu Xu
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: