[LU-2182] Add llapi_file_get_layout() function in liblustreapi Created: 15/Oct/12  Updated: 11/Aug/15  Resolved: 06/Oct/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Sebastien Buisson (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-2809 fix ll_setxattr() to always ignore ll... Resolved
is related to LU-4209 O_LOV_DELAY_CREATE conflict with __O_... Resolved
is related to LU-3840 llapi_layout API design discussion Closed
is related to LU-1707 Add llapi_* function to return the OS... Open
is related to LU-3602 make utility functions availalbe to u... In Progress
is related to LU-6135 improved support for selecting specif... Resolved
Severity: 3
Rank (Obsolete): 5220

 Description   

Add llapi_file_get_layout() function in liblustreapi, and deprecate llapi_file_get_stripe(), as it does not take a 'size' parameter to pass in the buffer size information.



 Comments   
Comment by Sebastien Buisson (Inactive) [ 15/Oct/12 ]

The patch is available here:
http://review.whamcloud.com/4272

Could you please review it?

Comment by Peter Jones [ 15/Oct/12 ]

Thanks Sebastien.

Bob

Could you please look into this one?

Peter

Comment by Bob Glossman (Inactive) [ 15/Oct/12 ]

Just recently broken out from LU-1857 and made into a separate bug and patch. Don't think there is anything new here.

Comment by Christopher Morrone [ 15/Oct/12 ]

I applaud the effort to add a size parameter to the striping lookup info, but I think if we are going to add a new API function we need to consider going further than that. It just seems like adding a size parameter doesn't go far enough to fixing the problems with llapi_file_get_stripe(), and I would rather not add a new API function that we'll need to replace again in the near future.

What happens when the the IOC_MDC_GETFILESTRIPE is passed a buffer too small to fit the striping information? Do we get an error back from the ioctl or does the app segfault?

This new function now takes a size option, but there is no description of what happens when the size is too small. And since we don't actually pass the size option through the ioctl(), it seems to be of questionable value. We should be returning an error that lets the user know that they need to size the buffer larger and try again.

But going even further, if we are making a new function to get striping info, maybe we should make it even better. The current method of exposing the multiple versions of the struct to users is Not Good.

I've had users that were very confused about how to even allocate memory for the lov_user_md structure.

I think we would probably be much better off hiding these wire protocol details from the user. Perhaps make a function to lookup the stripe info, and return an opaque object to the user. Then provide iterators and other functions to access the info in the opaque object.

The lookup function would allocate the needed buffers, and then we would provide a function to free the opaque object. Or something along those lines.

Comment by Bob Glossman (Inactive) [ 16/Oct/12 ]

> What happens when the the IOC_MDC_GETFILESTRIPE is passed a buffer too small to fit the striping information? Do we get an error back from the ioctl or does the app segfault?

From my reading of the code handling IOC_MDC_GETFILESTRIPE, when the buffer is too small for all the striping info we get an EOVERFLOW error. The app doesn't segfault. I think the intent is a user can make the call once with a buffer just big enough for a struct lov_user_md, see the EOVERFLOW error, use the lmm_stripe_count in the returned struct to know how big an array of lmm_objects[] he needs, then reallocate and make another call. Or a user can just allocate a buffer big enough for a max number of stripes. In either case the ioctl() call is safe.

Comment by Christopher Morrone [ 16/Oct/12 ]

I would have guessed EFAULT...I think EOVERFLOW is generally used when a number is too large to fit into the range of a data type.

But my general comment still stands that this would only be a slightly better api. How about something along these lines:

/* return an opaque data type containing layout for the file at "path" */
lustre_layout_t *llapi_layout_lookup_bypath(const char *path);
/* return an opaque data type containing layout for the file referenced
   by open file descriptor "fd" */
lustre_layout_t *llapi_layout_lookup_byfd(int fd);

/* free the memory allocated in the opaque type lustre_layout_t */
void llapi_layout_free(lustre_layout_t *layout);

/* functions to access data in opaque layout type */
unsigned llapi_layout_get_stripe_count(lustre_layout_t *layout);
unsigned llapi_layout_get_stripe_size(lustre_layout_t *layout);
unsigned llapi_layout_get_nth_ost_number(lustre_layout_t *layout, int n);
const char *llapi_layout_get_pool_name(lustre_layout_t *layout);

We really need to hide the details of the wire-protocol stuff in lustre_idl.h from the user. By using an opaque data type and accessor functions in the API, we hide the complexity of allocating buffers of various sizes, dealing with magic numbers and lov_mds_md_v1 versus lov_mds_md_v3. The accessor functions automatically handle version numbers, so for instance the "llapi_layout_get_pool_name" can just return the appropriate error if the user asks for the pool name from a v1 struct.

And then we could further consider the "set" equivalent of the "get" functions above. With some minor tweaking, what I suggest could make a better api to allow the user to easily create an arbitrary file layout. Of course, we would need ioctl and probably protocol change to allow that, but we might as well design the api with that in mind.

Comment by Andreas Dilger [ 07/Dec/12 ]

Chris, I agree that having a new API as you propose would be useful. As with anything, someone needs to take up the effort to implement the new API.

We're also looking at creating a new llapi_file_open_unlinked() function, which would be nice if it could use this API. We would need to add new "set" operations to your proposed API, something like:

/* allocate a new opaque layout for creating a file */
lustre_layout_t *llapi_layout_create(int stripe_pattern, int stripe_count);
/* set the size of each stripe */
int llapi_layout_set_stripe_size(lustre_layout_t *layout, int stripe_size);
/* set the pool of servers from which file objects will be allocated */
int llapi_layout_set_pool_name(lustre_layout_t *layout, const char *pool_name);
/* set the OST index for the specified file stripe.
 * NB: this only works for stripe_number=0 today. */
int llapi_layout_set_ost_index(lustre_layout_t *layout, int stripe_number, int ost_index);
/* create a file with the specified layout */
int llapi_layout_file_create(lustre_layout_t *layout, int mode, int flags)

I was wondering whether llapi_layout_create() should take no arguments at all, and then the stripe_pattern and stripe_count would be set afterward? The layout would be created with default values initially, and possibly need to be reallocated after the fact. For llapi_layout_set_ost_index() we are probably going to add support to allow specifying exact OST indexes for all stripes, so I'd prefer to have a generic API even though it would only currently be possible to specify the first stripe (it should return -EOPNOTSUPP for stripe_number != 0 until that is implemented).

Comment by Christopher Morrone [ 07/Dec/12 ]

As with anything, someone needs to take up the effort to implement the new API.

We have users on BG/Q that really want to use striping, but the implementation of the current lustre striping api uses ioctl's, which are notoriously difficult to function ship (we'd need custom implementations for lustre in IBM's function shipping system). Doable, but not particularly easy, and has a continuing support element.

But if I remember correctly, striping can now be set through xattrs. That is how the tar command supports restoring lustre striping settings, correct? So perhaps I could implement the new api to use xattrs instead of ioctls.

If that is really workable, that would motivate me to get something done before the end of January.

I was wondering whether llapi_layout_create() should take no arguments at all, and then the stripe_pattern and stripe_count would be set afterward?

Yes, I agree, I like that better.

For llapi_layout_set_ost_index() we are probably going to add support to allow specifying exact OST indexes for all stripes, so I'd prefer to have a generic API

Yes, I completely agree!

Comment by Andreas Dilger [ 07/Dec/12 ]

Yes, it is possible to set the layout through setxattr() before the file is opened, using a struct lov_mds_md header. Tar does it something like the following framework:

        rc = mknod(name, mode, 0);
        if (rc != 0)
                return rc;
        rc = setxattr(name, XATTR_NAME_LOV, lmm, sizeof(lmm), XATTR_CREATE);
        if (rc == 0)
                fd = open(name, flags);

This, however, breaks the ability to have O_EXCL creates with layouts. That isn't a problem for tar, but it would probably make applications unhappy.

It might be possible (haven't tried before, but would work in theory) to do:

        fd = open(name, flags | O_LOV_DELAY_CREATE, mode);
        if (fd < 0)
                return errno;
        rc = fsetxattr(fd, XATTR_NAME_LOV, lmm, sizeof(lmm));

since the O_LOV_DELAY_CREATE will have prevented the layout from being instantiated. This would allow O_EXCL to work on the file, though it would still be racy w.r.t. other threads accessing this file without the layout. I think that is still true with llapi_file_open() today, so it isn't necessarily a new limitation.

If this works and doesn't need client-side changes, then it would likely be compatible back to 1.8.something at least, where the tar/setxattr code was first added...

Comment by Christopher Morrone [ 12/Dec/12 ]

Andreas, you are correct, open+fsetxattr works. I had to use "lustre.lov" in user space instread of XATTR_NAME_LOV (which is "trusted.lov"), but otherwise it pretty much just worked.

Testing was on 2.1.2+ clients and servers.

Comment by Ned Bass [ 07/Feb/13 ]

Initial implementation of API discussed above:

http://review.whamcloud.com/#change,5302

Test program for the API:

https://gist.github.com/nedbass/4736037

One problem I ran into is that setting the pool name using fsetxattr() doesn't seem to work, while passing in the same lov data through ioctl does. That's likey a Lustre bug and not something I can fix in the API.

Comment by Andreas Dilger [ 08/Feb/13 ]

Ned, thanks for moving this forward. I agree there is likely a bug in ll_setxattr() that it doesn't copy the full pool information (using lov_mds_md_v1 instead of _v3 or something). Could you please file a bug for tracking that issue separately from the feature addition in this bug.

Comment by Ned Bass [ 08/Feb/13 ]

Could you please file a bug for tracking that issue separately

See LU-2786.

Comment by Robert Read (Inactive) [ 21/Mar/14 ]

This new API looks great. I suggest we consider putting this into a new library, though, instead of adding it to the existing liblustreapi. Ideally, we would also move hsm functions into the new library as well, and update the posix copytool (and other users of the hsm api) to use the new library. On first glance it looks like only a few functions such as fd2fid, path2fid will need to be reimplemented for the copytool to stop using old liblustreapi completely, but those pretty trivial.

Comment by James A Simmons [ 09/Jun/14 ]

Is this so the new libraries can be LGPL? Understand I'm not against moving to LGPL. The reason I ask is because I'm involved in the LU-5030 which is a rework of handling the parameters code. This would impact me as well.

Not to confuse users we should try to stick to the a similar library name. Can you create 2.0 version of the library of the same name and change the license? This way we can end up with a liblustreapi2.so with a LGPL license.

Comment by Andy Nelson [ 09/Jun/14 ]

I guess I would have though the best reason for a new library not mixed with the old is that the old one contains

1) redundant functionality that
2) is implemented in a much less user friendly way than the new api and
3) had/has problems with maintenance and bitrot over the years

From my end, as a user of the api, I'd rather only have to maintain one set of code
calling one set of functionality. The transition over to the new api will have its own pains
and such, and the old one is already causing me problems between the requirements that
my code compile/run on lustre 1.8x (crays) and on lustre 2.x (most everything else) and sill
accomodate the fact that none of these versions have a lustre_idl.h which compiles
in userspace at all. That means I have to hack the system various system ones for each version
and carry them around with me until I can just forget about that old api altogether.

The sooner the old api and the libraries implementing it die completely, the better.

Comment by James A Simmons [ 20/Jun/14 ]

Andy I brought up the discussion on hpdd-discuss about liblustreapi2.0. Feel free to voice your opinion.

Comment by Robert Read (Inactive) [ 08/Aug/14 ]

I agree with Andy, and would also like the new library be LGPL, if possible. Unfortunately, liblustreapi is a poor name for any library, not to mention It is confusingly similar to liblustre, and it isn't the Lustre API. If everything else is changing, then it makes sense to also choose a more suitable name for the new library as well, and then maybe we will be able to forget the old api, someday.

Comment by James A Simmons [ 29/Sep/14 ]

There is a bug in test 18 for llapi_layout_test.c which causes it to fail in my test setup. The error is due to the pool being set to lustre.* which is only valid if the file system name is lustre. It is possible to use a file system with a different name. Best way to handle this is make fsname in main() a global variable.

Comment by Ned Bass [ 29/Sep/14 ]

Works fine for me.

$ sudo FSNAME=ertsul OSTCOUNT=2 FSTYPE=zfs ./lustre/tests/llmount.sh
[...]
Starting client: t222:  -o user_xattr,flock t222@tcp:/ertsul /mnt/ertsul
Using TIMEOUT=20
seting jobstats to procname_uid
Setting ertsul.sys.jobid_var from disable to procname_uid
Waiting 90 secs for update
Updated after 6s: wanted 'procname_uid' got 'procname_uid'
disable quota as required
$ sudo ./lustre/utils/lctl pool_new ertsul.testpool
Pool ertsul.testpool created
$ sudo ./lustre/utils/lctl pool_add ertsul.testpool OST[0-1]
OST ertsul-OST0000_UUID added to pool ertsul.testpool
OST ertsul-OST0001_UUID added to pool ertsul.testpool
$ sudo LFS=/home/bass6/lustre-release/lustre/utils/lfs ./lustre/tests/llapi_layout_test  -d /mnt/ertsul
 test  0: Read/write layout attributes then create a file ... pass
 test  1: Read test0 file by path and verify attributes ..... pass
 test  2: Read test0 file by FD and verify attributes ....... pass
 test  3: Read test0 file by FID and verify attributes ...... pass
 test  4: Verify compatibility with 'lfs setstripe' ......... pass
 test  5: llapi_layout_get_by_path ENOENT handling .......... pass
 test  6: llapi_layout_get_by_fd EBADF handling ............. pass
 test  7: llapi_layout_get_by_path EACCES handling .......... pass
 test  8: llapi_layout_get_by_path ENODATA handling ......... pass
 test  9: llapi_layout_pattern_set() EOPNOTSUPP handling .... pass
 test 10: stripe_count error handling ....................... pass
 test 11: stripe_size error handling ........................ pass
 test 12: pool_name error handling .......................... pass
 test 13: ost_index error handling .......................... pass
 test 14: llapi_layout_file_create error handling ........... pass
 test 15: Can't change striping attributes of existing file . pass
 test 16: Default stripe attributes are applied as expected . pass
 test 17: LLAPI_LAYOUT_WIDE is honored ...................... pass
 test 18: Setting pool with fsname.pool notation ............ pass
 test 19: Maximum length pool name is NULL-terminated ....... pass
 test 20: LLAPI_LAYOUT_DEFAULT is honored ................... pass
 test 21: llapi_layout_file_create fails for non-Lustre file  pass
 test 22: llapi_layout_file_create applied mode correctly ... pass
 test 23: llapi_layout_get_by_path fails for non-Lustre file  pass
 test 24: LAYOUT_GET_EXPECTED works with existing file ...... pass
 test 25: LAYOUT_GET_EXPECTED works with directory .......... pass
 test 26: LAYOUT_GET_EXPECTED partially specified parent .... pass
 test 27: LAYOUT_GET_EXPECTED with non existing file ........ pass
 test 28: LLAPI_LAYOUT_WIDE returned as expected ............ pass
Comment by Jodi Levi (Inactive) [ 06/Oct/14 ]

Is any additional work begin tracked under this ticket, or can it be closed?

Comment by Ned Bass [ 06/Oct/14 ]

It can be closed. Thanks

Generated at Sat Feb 10 01:23:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.