Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11667

sanity test 317: FAIL: Expected Block 8 got 48 for f317.sanity

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Lustre 2.12.0, Lustre 2.12.4
    • Arch: aarch64 (client)
    • 3
    • 9223372036854775807

    Description

      sanity test 317 failed on ARM clients as follows:

      == sanity test 317: Verify blocks get correctly update after truncate ================================ 15:30:27 (1542036627)
      1+0 records in
      1+0 records out
      5242880 bytes (5.2 MB) copied, 0.467256 s, 11.2 MB/s
      /mnt/lustre/f317.sanity has size 2097152 OK
      /mnt/lustre/f317.sanity has size 4097 OK
      /mnt/lustre/f317.sanity has size 4000 OK
      /mnt/lustre/f317.sanity has size 509 OK
      /mnt/lustre/f317.sanity has size 0 OK
      2+0 records in
      2+0 records out
      8192 bytes (8.2 kB) copied, 0.0562888 s, 146 kB/s
        File: '/mnt/lustre/f317.sanity'
        Size: 24575     	Blocks: 48         IO Block: 4194304 regular file
      Device: 2c54f966h/743766374d	Inode: 144115708605760525  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2018-11-12 15:30:29.000000000 +0000
      Modify: 2018-11-12 15:30:29.000000000 +0000
      Change: 2018-11-12 15:30:29.000000000 +0000
       Birth: -
       sanity test_317: @@@@@@ FAIL: Expected Block 8 got 48 for f317.sanity 
      

      Maloo report: https://testing.whamcloud.com/test_sets/074afc02-e7bf-11e8-815b-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11667] sanity test 317: FAIL: Expected Block 8 got 48 for f317.sanity

            Work around landed. Proper fix is being done in LU-15223

            simmonsja James A Simmons added a comment - Work around landed. Proper fix is being done in LU-15223

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45395/
            Subject: LU-11667 tests: Fix sanity test 317 for 64K PAGE_SIZE OST
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 63d4d9ff2f5c8cc992ca6b2f698bb43a3257bfb3

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45395/ Subject: LU-11667 tests: Fix sanity test 317 for 64K PAGE_SIZE OST Project: fs/lustre-release Branch: master Current Patch Set: Commit: 63d4d9ff2f5c8cc992ca6b2f698bb43a3257bfb3
            xinliang Xinliang Liu added a comment - - edited

             

            Hi paf0186 and  adilger,  As partial write is so complicated we might take a long time for it. let's create another Jira card for partial write and discuss there? I also send a draft patch for review.

            LU-15223

            xinliang Xinliang Liu added a comment - - edited   Hi  paf0186 and  adilger ,  As partial write is so complicated we might take a long time for it. let's create another Jira card for partial write and discuss there? I also send a draft patch for review. LU-15223

            We also need to ask:
            What's the benefit/what's the end goal, and how much do we have to do to get there?

            The benefit will be pretty limited if we can't also solve the RDMA issue.  The benefit would only apply for < page writes, and each one would have to be sent to disk by itself.

            One way to solve the RDMA problem would be to send full pages over the network, but attach extra data in the RPC telling the server the actual range for each page.  This would be very complicated, I think, and involve new ways of handling writes on the client and server.

            And this assumes we can solve the page cache issue!

            paf0186 Patrick Farrell added a comment - We also need to ask: What's the benefit/what's the end goal, and how much do we have to do to get there? The benefit will be pretty limited if we can't also solve the RDMA issue.  The benefit would only apply for < page writes, and each one would have to be sent to disk by itself. One way to solve the RDMA problem would be to send full pages over the network, but attach extra data in the RPC telling the server the actual range for each page.  This would be very complicated , I think, and involve new ways of handling writes on the client and server. And this assumes we can solve the page cache issue!

            One idea which Andreas and I had some time ago was the idea of something like marking the page as not up to date (this means not marking it as up to date, ie, the raw page state is not-up-to-date and up to date is a flag), so if the page was accessed, it would cause the client to re-read it from the server.

            This would mean the page was effectively uncached, which is a bit weird, but could work - I think the benefit is pretty limited since you can't easily combine these partial pages in to larger writes.  (RDMA issue again)

            But anyway, not setting written pages up to date turned out to be really complicated, and I decided it was unworkable.  The write code assumes pages are up to date as part writing them, and while I was able to work around a few things, I decided it felt like I was very much going against the intent of the code.

            paf0186 Patrick Farrell added a comment - One idea which Andreas and I had some time ago was the idea of something like marking the page as not up to date (this means not marking it as up to date, ie, the raw page state is not-up-to-date and up to date is a flag), so if the page was accessed, it would cause the client to re-read it from the server. This would mean the page was effectively uncached, which is a bit weird, but could work - I think the benefit is pretty limited since you can't easily combine these partial pages in to larger writes.  (RDMA issue again) But anyway, not setting written pages up to date turned out to be really complicated, and I decided it was unworkable.  The write code assumes pages are up to date as part writing them, and while I was able to work around a few things, I decided it felt like I was very much going against the intent of the code.

            How do you handle the page cache?  Like, what's in there?  And how do you get the range for the clipping?  Etc.  Some of these questions will be answered with the patch, of course

            But say you write this clipped partial page - What happens when you read it on the client which wrote it?  What is in the rest of the page?

            And, going on from there:
            What is in the rest of the page if the file was empty there?  And what is in the rest of the page if there was already data in the whole page when you write it?

            Basically what I am saying is unless you get very clever, this will break the page cache.

            You would also need to mark this page as non-mergable to avoid the RDMA issue, but that's easy to do.  The real sticking point is the page cache.

            paf0186 Patrick Farrell added a comment - How do you handle the page cache?  Like, what's in there?  And how do you get the range for the clipping?  Etc.  Some of these questions will be answered with the patch, of course But say you write this clipped partial page - What happens when you read it on the client which wrote it?  What is in the rest of the page? And, going on from there: What is in the rest of the page if the file was empty there?  And what is in the rest of the page if there was already data in the whole page when you write it? Basically what I am saying is unless you get very clever, this will break the page cache. You would also need to mark this page as non-mergable to avoid the RDMA issue, but that's easy to do.  The real sticking point is the page cache.
            xinliang Xinliang Liu added a comment - - edited

            Hi paf0186 and Andreas Dilger, thank you for the clarification about partial page write. It really helps me a lot.

            For ldiskfs backend filesystem,  I see that if the user issue a partial page cached write the Lustre (including client side and server side) will convert it in to a full page write. I want to make Lustre do a real partial page write inside which with the length less than a PAGE_SIZE no matter the start is zero or non-zero , so that Lustre can handle bellow sanity 317 test partial page write for a large PAGE_SIZE e.g. 64 KB and pass the test. That's the problem I want to solve.

            sanity.sh
            test_317() {
            ...
            23836     #
            23837     # sparse file test
            23838     # Create file with a hole and write actual two blocks. Block count
            23839     # must be 16.
            23840     #
            23841     dd if=/dev/zero of=$DIR/$tfile bs=$grant_blk_size count=2 seek=5 \
            23842         conv=fsync || error "Create file : $DIR/$tfile"
            23843
             ...
            
            

            I am trying to understand all the details and  limitation including some mentioned by you e.g. RDMA partial page write, GPU direct write etc.

            I have a draft patch now which make client side send a niobuf,  which contains non-zero file start offset  and the real file end offset , to the server. This requires clip the page in the client side. And in the server side it only writes the necessary range(i.e. from the real non-zero file start offset to the file end offset).

            I will send the patch for review soon. Let's try if we can work out a solution.  Thanks.

            xinliang Xinliang Liu added a comment - - edited Hi paf0186 and Andreas Dilger , thank you for the clarification about partial page write. It really helps me a lot. For ldiskfs backend filesystem,  I see that if the user issue a partial page cached write the Lustre (including client side and server side) will convert it in to a full page write. I want to make Lustre do a real partial page write inside which with the length less than a PAGE_SIZE no matter the start is zero or non-zero , so that Lustre can handle bellow sanity 317 test partial page write for a large PAGE_SIZE e.g. 64 KB and pass the test. That's the problem I want to solve. sanity.sh test_317() { ... 23836     # 23837     # sparse file test 23838     # Create file with a hole and write actual two blocks. Block count 23839     # must be 16. 23840     # 23841     dd if =/dev/zero of=$DIR/$tfile bs=$grant_blk_size count=2 seek=5 \ 23842         conv=fsync || error "Create file : $DIR/$tfile" 23843 ... I am trying to understand all the details and  limitation including some mentioned by you e.g. RDMA partial page write, GPU direct write etc. I have a draft patch now which make client side send a niobuf,  which contains non-zero file start offset  and the real file end offset , to the server. This requires clip the page in the client side. And in the server side it only writes the necessary range(i.e. from the real non-zero file start offset to the file end offset). I will send the patch for review soon. Let's try if we can work out a solution.  Thanks.

            Patrick, I was thinking that if we can handle a write (uncached) from the client that is RDMA 64KB, but has a non-zero start and end offset (4KB initially), it might be generalizable to any byte offset.

            I'm aware of the RDMA limitations, but I'm wondering if those can be bypassed (if necessary) by transferring a whole page over the network, but store it into a temporary page and copy the data for a cached/unaligned read-modify-write on the server to properly align the data. The content of the start/end of the page sent from the client would be irrelevant, since it will be trimmed by the server anyway when the copy is done

            While the copy might be expensive for very large writes, my expectation is that this would be most useful for small writes. That does raise the question of whether the data could be transferred in the RPC as a short write, but for GPU direct we require RDMA to send the data directly from the GPU RAM to the OSS. Maybe it is just a matter of generalizing the short write handling to allow copying from the middle of an RDMA page?

            adilger Andreas Dilger added a comment - Patrick, I was thinking that if we can handle a write (uncached) from the client that is RDMA 64KB, but has a non-zero start and end offset (4KB initially), it might be generalizable to any byte offset. I'm aware of the RDMA limitations, but I'm wondering if those can be bypassed (if necessary) by transferring a whole page over the network, but store it into a temporary page and copy the data for a cached/unaligned read-modify-write on the server to properly align the data. The content of the start/end of the page sent from the client would be irrelevant, since it will be trimmed by the server anyway when the copy is done While the copy might be expensive for very large writes, my expectation is that this would be most useful for small writes. That does raise the question of whether the data could be transferred in the RPC as a short write, but for GPU direct we require RDMA to send the data directly from the GPU RAM to the OSS. Maybe it is just a matter of generalizing the short write handling to allow copying from the middle of an RDMA page?

            By the way, I am happy to keep talking about this, if you have thoughts or questions or whatever.  I've looked at sub-page I/O a few times, but you may have a different idea than what I have tried.

            paf0186 Patrick Farrell added a comment - By the way, I am happy to keep talking about this, if you have thoughts or questions or whatever.  I've looked at sub-page I/O a few times, but you may have a different idea than what I have tried.

            "I am thinking if we should make blocks allocation aligned with BLOCK_SIZE as ext4, which could save space for large PAGE_SIZE e.g. 64K. Then no need to make change to the test case. And I have a look at the code it seems both OSC client  and OST server need to adjust for this. The client always sends no hole pages (currently page start offset is always 0) to the server for writing now. And the server side needs to adjust making blocks allocation aligned with block size. 
             "

            Can you talk more about what you're thinking?  I am not quite what the implication of changing block allocation on the server would be on the client.  Why does changing server block allocation filter back to the client like this?

            More generally, about partial page i/o:
            Generally speaking, we can't have partial pages except at the start and end of each write - that's a limitation of infiniband, but there are also page cache restrictions.

            In general, RDMA can be unaligned at the start, and unaligned at the end, but that's it.  This applies even when combining multiple RDMA regions - it's some limitation of the hardware/drivers.  So we have a truly unaligned I/O(with a partial page at beginning and end), but then we can't combine it with other I/Os.

            There is also a page cache limitation here.  The Linux page cache insists on working with full pages - It will only allow partial pages at file_size.  So, eg, a 3K file is a single page with 3K in it, and we can write just 3K.  But if we want to write 3K in to a large 'hole' in a file, Linux will enforce writing PAGE_SIZE.  This is not a restriction we can easily remove, it is an important part of the page cache.

            paf0186 Patrick Farrell added a comment - "I am thinking if we should make blocks allocation aligned with BLOCK_SIZE as ext4, which could save space for large PAGE_SIZE e.g. 64K. Then no need to make change to the test case. And I have a look at the code it seems both OSC client  and OST server need to adjust for this. The client always sends no hole pages (currently page start offset is always 0) to the server for writing now. And the server side needs to adjust making blocks allocation aligned with block size.   " Can you talk more about what you're thinking?  I am not quite what the implication of changing block allocation on the server would be on the client.  Why does changing server block allocation filter back to the client like this? More generally, about partial page i/o: Generally speaking, we can't have partial pages except at the start and end of each write - that's a limitation of infiniband, but there are also page cache restrictions. In general, RDMA can be unaligned at the start, and unaligned at the end, but that's it.  This applies even when combining multiple RDMA regions - it's some limitation of the hardware/drivers.  So we have a truly unaligned I/O(with a partial page at beginning and end), but then we can't combine it with other I/Os. There is also a page cache limitation here.  The Linux page cache insists on working with full pages - It will only allow partial pages at file_size.  So, eg, a 3K file is a single page with 3K in it, and we can write just 3K.  But if we want to write 3K in to a large 'hole' in a file, Linux will enforce writing PAGE_SIZE.  This is not a restriction we can easily remove, it is an important part of the page cache.

            People

              wc-triage WC Triage
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: