[LU-10405] FIEMAP on object with no stripe should not return ENODATA Created: 19/Dec/17  Updated: 05/Aug/20  Resolved: 25/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Upstream
Fix Version/s: Lustre 2.11.0

Type: Improvement Priority: Minor
Reporter: CEA Assignee: Dominique Martinet (Inactive)
Resolution: Fixed Votes: 0
Labels: cea, patch

Attachments: File create-nostripe.c    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The FIEMAP ioctl is used by many tools like cp(1) to optimize sparse files, but any error is treated as failure and such tools usually fall back by reading the whole file (so, zeroes if there is no stripe)
lov_object_fiemap returns -ENODATA if it could not get the object's lov_stripe_md. We should just fill the structure with no extent and return success instead.

We have many files with no stripe here because our copytool destripes released files, so they get assigned new OSTs on restore (instead of being restored where they originally were created, which might be bad balance) ; this mostly only impact admins copying released data around very carefully so this is somewhat minor for us, but still a bug imo.

 

FWIW we also have actual sparse files with data (e.g. VM images) on lustre with non-trivial striping so fiemap returning ENOTSUPP on call without FIEMAP_FLAG_DEVICE_ORDER, and cp would just read everything. I think we could support this better as well... But that's another issue

 

It's actually not so easy to create a file without stripe for testing, so I attached a simple program that does.

$ gcc -o create-nostripe create-nostripe.c
$ ./create-nostripe /mnt/lustre/testfile
$ strace -e ioctl,read cp -a /mnt/lustre/testfile /tmp |& head -n 20
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300j\0\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\37\0\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\23\0\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\34\2\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\25\0\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\16\0\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240l\0\0\0\0\0\0"..., 832) = 832
read(3, "nodev\tsysfs\nnodev\trootfs\nnodev\tr"..., 1024) = 360
read(3, "", 1024)                       = 0
ioctl(3, FS_IOC_FIEMAP, 0x7ffd374e88e0) = -1 ENODATA (No data available)
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
....


I will submit a patch for this right away.



 Comments   
Comment by Gerrit Updater [ 19/Dec/17 ]

Dominique Martinet (dominique.martinet@cea.fr) uploaded a new patch: https://review.whamcloud.com/30591
Subject: LU-10405 lov: fill no-extent fiemap on object with no stripe.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bbf2fcfe575de341f31074f8190d1a310c18f5cf

Comment by Andreas Dilger [ 19/Dec/17 ]

FYI, the "mcreate" program in the lustre-tests RPM is typically used in test scripts to create a file without any striping.  It uses mknod(2) syscall to create a regular file without opening it, since the mknod(1) utility does not allow the creation of regular files.

Comment by Dominique Martinet (Inactive) [ 19/Dec/17 ]

This doesn't work for my specific example here, as cp only uses fiemap on files given a certain size – my create-nostripe is misnamed and also ftruncate()s the file after creating it to reflect a bigger size like we have on our filesystems.

Thanks for the info though, I wasn't aware of it.

Comment by Andreas Dilger [ 19/Dec/17 ]

Two comments here:

  • the “truncate -s” command can change the file size without opening it, so that should work together with “mcreate” without adding another tool. There is also “mulitop” that should allow similar kinds of operations already.
  • you mention “copying (HSM) released files”, but in that case this could result in data loss to return an empty extent map to cp for a released file, since cp would think there is no data in the file, when in fact it is on the archive.
Comment by Dominique Martinet (Inactive) [ 20/Dec/17 ]

I had tried truncate -s, it seems to open + ftruncate so that assigns a striping. multiop would work, I'll have a look

 

And yes, admin copying (HSM) released files is tricky... But we're trying to be careful. Migrating from one filesystem to another without reading all the tapes is an interesting operation... Might only work for us with our hybrid lustre/HSM and old CEA hsm monster still crawling around and helping around the edges.

I actually hadn't thought of the impact of this for user cp, since lustre/hsm restores on first read it will wrongly think there is nothing to read, and this will cause real problems. We will apparently already have the same problem with tar that just looks at stat.st_blocks and skips anything with 0 with a lstat() without ever even opening the file. That's going to be a tricky problem, makes me think we will want to drop this patch and open a new issue about faking st_blocks for tar...

Comment by Andreas Dilger [ 20/Dec/17 ]

Yes, we have a special check in the client code to return a non-zero blocks count for stat() of the inode if there are dirty pages on the client (LU-417, see comment in cl_glimpse_lock() and vvp_object_glimpse()), even if no pages have been written to disk yet, to avoid problems with cp and tar.

It makes sense to do the same for HSM released files if that isn't done already.

Comment by Dominique Martinet (Inactive) [ 21/Dec/17 ]

Ok, this does make sense, I see what's done here. The two functions here look insufficient for an HSM released file as far as I can see, because dirty_cnt will also be 0; will likely need to add a second check "if blocks is still 0 after dirty_cnt check for hsm state and set an arbitrary value (size/blocksize?)"

I'll see if I can reproduce the problem with tar (cp no longer bases itself on i_blocks) and open a new ticket likely early next year.

 

I'm thinking something similar is needed here, if there is no stripe then also check hsm state and either return NODATA as currently if released or return an empty map. I won't have time to add and test that this year so giving my own patch a -1 to hold it off.

Comment by John Hammond [ 21/Dec/17 ]

In both 2.7 and master, we do return a "minimal FIEMAP" for released files. See lov_fiemap() and lov_object_fiemap(). And this works (at least on master) but we do not seem to have a test for it.

We also return a non zero block count for release files.

When you update your patch, please include a test that fails before the code change and passes after.

Comment by Dominique Martinet (Inactive) [ 21/Dec/17 ]

Hi John.

As I said in a previous comment, the truncate command does an open followed by ftruncate so it allocates stripes for the file, unlike my test example. I did reproduce it on master (80515fa LU-10232 lov: call cl_object_attr_get under cl_attr lock) so I'm fairly sure of myself.

I've also checked multiop and I don't see how to open a file with O_LOV_DELAY_CREATE. I can do mknod but then the file isn't opened so ftruncate doesn't help. I also double-checked the truncate(1) command but can't make it call truncate() with a path instead of ftruncate()... I'll do more checking we can do with normal lustre test programs, but please use my create-nostripe.c for now.

 

I see the code part about lhsm released file now, I hadn't really looked at code past the FIEMAP_FLAG_DEVICE_ORDER check though because cp does not handle that. Your example shows a working fiemap because you likely have a stripecount of 1, which is not the case for our LHSM filesystem... I can't use lsm->lsm_is_released in my patch though, since lsm is returned as NULL, I'll need another way to check for released state? Will do some more digging, I'm not very familiar with this code.

I'll also add new tests on refresh, it'll have to wait for new year though...

Comment by John Hammond [ 21/Dec/17 ]

Dominique,

You are correct about the truncate.

Comment by John Hammond [ 21/Dec/17 ]

BTW, the multiop command line to use would be:

multiop f0  oO_RDWR:O_CREAT:O_LOV_DELAY_CREATE:T10737418240c

Unfortunately multiop does not support 64 bit file sizes. (This would be easy to fix however.)

Comment by Dominique Martinet (Inactive) [ 09/Jan/18 ]

Hi and happy new year,

I have update the patch with a new test, that checks fiemap on unstripped files with filefrag -e like the other fiemap tests (and your multiop line, thanks! Hadn't noticed 'o' could take a list of flags as arguments, this help output made it sound like 'o' is always just O_CREAT)

Regarding hsm-files, I've double-checked this should be fine. I wasn't clear that a released file would always have a `lsm` so now this is cleared up my patch doesn't need to change, this won't cause any new false-positive of not copying data when we should.

 

As a side-note, I do notice that released files have st_blocks set to 1 (presumably to tell tar and friends to look at the content), but I could not find in the code where that is set. cl_glimpse_lock() and vvp_object_glimpse() have a comment about that for dirty_cnt() but this function doesn't do any HSM-related check... Well, I guess this works so I don't really need to look further.

Thanks,

Comment by Gerrit Updater [ 25/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30591/
Subject: LU-10405 lov: fill no-extent fiemap on object with no stripe.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 113fa79b7dc9b598c615d4cbfa6e3513d2c6d35b

Comment by Peter Jones [ 25/Jan/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:34:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.