Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10405

FIEMAP on object with no stripe should not return ENODATA

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • Lustre 2.7.0, Upstream
    • 3
    • 9223372036854775807

    Description

      The FIEMAP ioctl is used by many tools like cp(1) to optimize sparse files, but any error is treated as failure and such tools usually fall back by reading the whole file (so, zeroes if there is no stripe)
      lov_object_fiemap returns -ENODATA if it could not get the object's lov_stripe_md. We should just fill the structure with no extent and return success instead.

      We have many files with no stripe here because our copytool destripes released files, so they get assigned new OSTs on restore (instead of being restored where they originally were created, which might be bad balance) ; this mostly only impact admins copying released data around very carefully so this is somewhat minor for us, but still a bug imo.

       

      FWIW we also have actual sparse files with data (e.g. VM images) on lustre with non-trivial striping so fiemap returning ENOTSUPP on call without FIEMAP_FLAG_DEVICE_ORDER, and cp would just read everything. I think we could support this better as well... But that's another issue

       

      It's actually not so easy to create a file without stripe for testing, so I attached a simple program that does.

      $ gcc -o create-nostripe create-nostripe.c
      $ ./create-nostripe /mnt/lustre/testfile
      $ strace -e ioctl,read cp -a /mnt/lustre/testfile /tmp |& head -n 20
      read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300j\0\0\0\0\0\0"..., 832) = 832
      read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\37\0\0\0\0\0\0"..., 832) = 832
      read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\23\0\0\0\0\0\0"..., 832) = 832
      read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\34\2\0\0\0\0\0"..., 832) = 832
      read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\25\0\0\0\0\0\0"..., 832) = 832
      read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\16\0\0\0\0\0\0"..., 832) = 832
      read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240l\0\0\0\0\0\0"..., 832) = 832
      read(3, "nodev\tsysfs\nnodev\trootfs\nnodev\tr"..., 1024) = 360
      read(3, "", 1024)                       = 0
      ioctl(3, FS_IOC_FIEMAP, 0x7ffd374e88e0) = -1 ENODATA (No data available)
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
      ....
      
      
      

      I will submit a patch for this right away.

      Attachments

        Activity

          [LU-10405] FIEMAP on object with no stripe should not return ENODATA
          pjones Peter Jones added a comment -

          Landed for 2.11

          pjones Peter Jones added a comment - Landed for 2.11

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30591/
          Subject: LU-10405 lov: fill no-extent fiemap on object with no stripe.
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 113fa79b7dc9b598c615d4cbfa6e3513d2c6d35b

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30591/ Subject: LU-10405 lov: fill no-extent fiemap on object with no stripe. Project: fs/lustre-release Branch: master Current Patch Set: Commit: 113fa79b7dc9b598c615d4cbfa6e3513d2c6d35b

          Hi and happy new year,

          I have update the patch with a new test, that checks fiemap on unstripped files with filefrag -e like the other fiemap tests (and your multiop line, thanks! Hadn't noticed 'o' could take a list of flags as arguments, this help output made it sound like 'o' is always just O_CREAT)

          Regarding hsm-files, I've double-checked this should be fine. I wasn't clear that a released file would always have a `lsm` so now this is cleared up my patch doesn't need to change, this won't cause any new false-positive of not copying data when we should.

           

          As a side-note, I do notice that released files have st_blocks set to 1 (presumably to tell tar and friends to look at the content), but I could not find in the code where that is set. cl_glimpse_lock() and vvp_object_glimpse() have a comment about that for dirty_cnt() but this function doesn't do any HSM-related check... Well, I guess this works so I don't really need to look further.

          Thanks,

          martinetd Dominique Martinet (Inactive) added a comment - Hi and happy new year, I have update the patch with a new test, that checks fiemap on unstripped files with filefrag -e like the other fiemap tests (and your multiop line, thanks! Hadn't noticed 'o' could take a list of flags as arguments, this help output made it sound like 'o' is always just O_CREAT) Regarding hsm-files, I've double-checked this should be fine. I wasn't clear that a released file would always have a `lsm` so now this is cleared up my patch doesn't need to change, this won't cause any new false-positive of not copying data when we should.   As a side-note, I do notice that released files have st_blocks set to 1 (presumably to tell tar and friends to look at the content), but I could not find in the code where that is set. cl_glimpse_lock() and vvp_object_glimpse() have a comment about that for dirty_cnt() but this function doesn't do any HSM-related check... Well, I guess this works so I don't really need to look further. Thanks,
          jhammond John Hammond added a comment -

          BTW, the multiop command line to use would be:

          multiop f0  oO_RDWR:O_CREAT:O_LOV_DELAY_CREATE:T10737418240c
          

          Unfortunately multiop does not support 64 bit file sizes. (This would be easy to fix however.)

          jhammond John Hammond added a comment - BTW, the multiop command line to use would be: multiop f0 oO_RDWR:O_CREAT:O_LOV_DELAY_CREATE:T10737418240c Unfortunately multiop does not support 64 bit file sizes. (This would be easy to fix however.)
          jhammond John Hammond added a comment -

          Dominique,

          You are correct about the truncate.

          jhammond John Hammond added a comment - Dominique, You are correct about the truncate.

          Hi John.

          As I said in a previous comment, the truncate command does an open followed by ftruncate so it allocates stripes for the file, unlike my test example. I did reproduce it on master (80515fa LU-10232 lov: call cl_object_attr_get under cl_attr lock) so I'm fairly sure of myself.

          I've also checked multiop and I don't see how to open a file with O_LOV_DELAY_CREATE. I can do mknod but then the file isn't opened so ftruncate doesn't help. I also double-checked the truncate(1) command but can't make it call truncate() with a path instead of ftruncate()... I'll do more checking we can do with normal lustre test programs, but please use my create-nostripe.c for now.

           

          I see the code part about lhsm released file now, I hadn't really looked at code past the FIEMAP_FLAG_DEVICE_ORDER check though because cp does not handle that. Your example shows a working fiemap because you likely have a stripecount of 1, which is not the case for our LHSM filesystem... I can't use lsm->lsm_is_released in my patch though, since lsm is returned as NULL, I'll need another way to check for released state? Will do some more digging, I'm not very familiar with this code.

          I'll also add new tests on refresh, it'll have to wait for new year though...

          martinetd Dominique Martinet (Inactive) added a comment - Hi John. As I said in a previous comment, the truncate command does an open followed by ftruncate so it allocates stripes for the file, unlike my test example. I did reproduce it on master (80515fa LU-10232 lov: call cl_object_attr_get under cl_attr lock) so I'm fairly sure of myself. I've also checked multiop and I don't see how to open a file with O_LOV_DELAY_CREATE. I can do mknod but then the file isn't opened so ftruncate doesn't help. I also double-checked the truncate(1) command but can't make it call truncate() with a path instead of ftruncate()... I'll do more checking we can do with normal lustre test programs, but please use my create-nostripe.c for now.   I see the code part about lhsm released file now, I hadn't really looked at code past the FIEMAP_FLAG_DEVICE_ORDER check though because cp does not handle that. Your example shows a working fiemap because you likely have a stripecount of 1, which is not the case for our LHSM filesystem... I can't use lsm->lsm_is_released in my patch though, since lsm is returned as NULL, I'll need another way to check for released state? Will do some more digging, I'm not very familiar with this code. I'll also add new tests on refresh, it'll have to wait for new year though...
          jhammond John Hammond added a comment - - edited

          In both 2.7 and master, we do return a "minimal FIEMAP" for released files. See lov_fiemap() and lov_object_fiemap(). And this works (at least on master) but we do not seem to have a test for it.

          We also return a non zero block count for release files.

          When you update your patch, please include a test that fails before the code change and passes after.

          jhammond John Hammond added a comment - - edited In both 2.7 and master, we do return a "minimal FIEMAP" for released files. See lov_fiemap() and lov_object_fiemap() . And this works (at least on master) but we do not seem to have a test for it. We also return a non zero block count for release files. When you update your patch, please include a test that fails before the code change and passes after.

          Ok, this does make sense, I see what's done here. The two functions here look insufficient for an HSM released file as far as I can see, because dirty_cnt will also be 0; will likely need to add a second check "if blocks is still 0 after dirty_cnt check for hsm state and set an arbitrary value (size/blocksize?)"

          I'll see if I can reproduce the problem with tar (cp no longer bases itself on i_blocks) and open a new ticket likely early next year.

           

          I'm thinking something similar is needed here, if there is no stripe then also check hsm state and either return NODATA as currently if released or return an empty map. I won't have time to add and test that this year so giving my own patch a -1 to hold it off.

          martinetd Dominique Martinet (Inactive) added a comment - Ok, this does make sense, I see what's done here. The two functions here look insufficient for an HSM released file as far as I can see, because dirty_cnt will also be 0; will likely need to add a second check "if blocks is still 0 after dirty_cnt check for hsm state and set an arbitrary value (size/blocksize?)" I'll see if I can reproduce the problem with tar (cp no longer bases itself on i_blocks) and open a new ticket likely early next year.   I'm thinking something similar is needed here, if there is no stripe then also check hsm state and either return NODATA as currently if released or return an empty map. I won't have time to add and test that this year so giving my own patch a -1 to hold it off.

          Yes, we have a special check in the client code to return a non-zero blocks count for stat() of the inode if there are dirty pages on the client (LU-417, see comment in cl_glimpse_lock() and vvp_object_glimpse()), even if no pages have been written to disk yet, to avoid problems with cp and tar.

          It makes sense to do the same for HSM released files if that isn't done already.

          adilger Andreas Dilger added a comment - Yes, we have a special check in the client code to return a non-zero blocks count for stat() of the inode if there are dirty pages on the client ( LU-417 , see comment in cl_glimpse_lock() and vvp_object_glimpse()), even if no pages have been written to disk yet, to avoid problems with cp and tar. It makes sense to do the same for HSM released files if that isn't done already.

          I had tried truncate -s, it seems to open + ftruncate so that assigns a striping. multiop would work, I'll have a look

           

          And yes, admin copying (HSM) released files is tricky... But we're trying to be careful. Migrating from one filesystem to another without reading all the tapes is an interesting operation... Might only work for us with our hybrid lustre/HSM and old CEA hsm monster still crawling around and helping around the edges.

          I actually hadn't thought of the impact of this for user cp, since lustre/hsm restores on first read it will wrongly think there is nothing to read, and this will cause real problems. We will apparently already have the same problem with tar that just looks at stat.st_blocks and skips anything with 0 with a lstat() without ever even opening the file. That's going to be a tricky problem, makes me think we will want to drop this patch and open a new issue about faking st_blocks for tar...

          martinetd Dominique Martinet (Inactive) added a comment - I had tried truncate -s, it seems to open + ftruncate so that assigns a striping. multiop would work, I'll have a look   And yes, admin copying (HSM) released files is tricky... But we're trying to be careful. Migrating from one filesystem to another without reading all the tapes is an interesting operation... Might only work for us with our hybrid lustre/HSM and old CEA hsm monster still crawling around and helping around the edges. I actually hadn't thought of the impact of this for user cp, since lustre/hsm restores on first read it will wrongly think there is nothing to read, and this will cause real problems. We will apparently already have the same problem with tar that just looks at stat.st_blocks and skips anything with 0 with a lstat() without ever even opening the file. That's going to be a tricky problem, makes me think we will want to drop this patch and open a new issue about faking st_blocks for tar...

          People

            martinetd Dominique Martinet (Inactive)
            cealustre CEA
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: