[LU-3840] llapi_layout API design discussion Created: 27/Aug/13 Updated: 03/Feb/15 Resolved: 03/Feb/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Ned Bass | Assignee: | Ned Bass |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9942 | ||||||||||||||||
| Description |
|
Extensions to liblustreapi have been posted for review here. http://review.whamcloud.com/5302 The goal of the API extensions is to provided a user-friendly interface for interacting with the layout of files in Lustre filesystems, and to hide the wire-protocol details behind an opaque data type. This issue will provide a forum for potentials users and developers to discuss and critique the design of the proposed API extensions. |
| Comments |
| Comment by Andy Nelson [ 27/Aug/13 ] |
|
Hello folks, I am Andy Nelson. I am a user of the current Lustre API and expect As a bit more background, I maintain the IO capabilities in a large Anyway, before the new API gets implemented, I want to voice my input, Andy Nelson On the API itself I have seen the traffic on various linux lists about the merging Seems to me that it would be a good idea to float the Lustre interface On llapi_layout_t The idea of an opaque object to hold all of the various filesystem The problem I see is that you are handing the user a pointer that points to Given all that, it seems to me that a better solution is to provide the user On the llapi_layout man page While having a 'master' man page that holds all of the api functions On some of the llapi functions There appears to be much more functionality available via lfs than by I see no function for getting the number of OSTs in the filesystem. Near as I can tell, the llapi_layout_by_fid takes as its an argument llapi_layout_by_path: "likely, but not guaranteed to fail" on non-Lustre llapi_search_fsname is mentioned in the description to llapi_layout_by_path llapi_layout_stripe_size needs to have its get/set value be a 64bit integer. llapi_layout_pool_name[_set]: I am not sure whether or not pools can be On the error handling infrastructure of the API functions These comments are reproductions of some comments I made in a ticket The current error handling interface is not well put together. Get Ned mentioned in correspondence to me, the possibility of having Please, just go back to the unix philosophy that has existed for On the asymmetries between the get/set functions argument lists In current form, the 'get' functions return whatever it is they get Instead, they should both have arguments that define what it is they There is other precedent for gotten values being in the argument Speaking of versions and extensibility: there does not seem to be any naming The get and set functions in the api are paired together according int llapi_layout_stripe_count(const llapi_layout_t *layout); int llapi_layout_stripe_count_set(llapi_layout_t *layout, But there is naming asymmetry. The 'set' variant includes the 'set' string On an aesthetic level, when I read the 'llapi_layout_stripe_count_set', Ned made an argument to me for omitting 'get', based on the idea Here's why: Anyone using these api functions is with little doubt fairly sophisticated, The point that I'm getting to here is that every one of these interfaces is The thin shim interface does only the get/set stuff. All of the actual That makes having a 'get' in the name a useful thing, since the function |
| Comment by Andy Nelson [ 27/Aug/13 ] |
|
BTW: I should probably mention that my comments just above are based on what I |
| Comment by Andy Nelson [ 27/Aug/13 ] |
|
Doh. Just discovered a typo (not particularly important though). It |
| Comment by Ned Bass [ 27/Aug/13 ] |
|
Andy, thanks for your detailed input. Some of your comments go beyond the scope of the current effort, so I'm going to limit my response to the points we can move forward on.
The llapi_layout_t * points to user-space memory, and we're dealing with C here, so a user can scribble over the data no matter how many layers of abstraction are added. A file descriptor-like object would add complexity and overhead for no real additional protection. However, to help detect corruption we may want to add an internal magic value to the layout. The interface between the library and the kernel are well-established input-sanitizing system calls, namely getxattr and setxattr, so no new attack vector is introduced.
Agreed.
Actually, lfs uses llapi to implement its functionality. The problem is that llapi is largely undocumented, and if you think the proposed extensions are bad you would be horrified by the existing API. What we are trying to do here is implement a sane, portable, and well-documented set of functions to let users read and configure striping attributes. Overhauling and documenting the rest of llapi is sorely needed but well beyond the current scope. That said, there are existing ways to do the things you mention, but some of them rely on ioctl() so won't work on BGQ systems. I'll briefly mention the relevant llapi functions.
This is the API function. It uses an ioctl() that IBM claims to have implemented function-shipping support for on BGQ. However, it didn't work in my tests, so we need to continue working with them to fix it. /* @mnt - contains a path within the Lustre filesystem. * @count - value returned here * @is_mdt - get number of OSTs if 0, else get number of MDTs */ int llapi_get_obd_count(char *mnt, int *count, int is_mdt);
A function exists: int llapi_path2fid(const char *path, lustre_fid *fid); However, it relies on ioctl() so isn't portable. Typical users need not be concerned with lustre FIDs. They are primarily used by developers, filesystem monitoring software, or Hierarchical Storage Management (HSM) systems. It is the exactly the type of internal implementation detail that a proper API should hide from users. I did not include this function in my original design, but an Intel reviewer requested it.
It is explictly not a guarantee. In other words, don't rely on this, and instead use llapi_search_fsname() (which is admittedly undocumented, but we can provide an example). This lets the application check once on the directory, rather than the library blindly checking on every single file lookup, which could get rather expensive.
Lustre stipe size is a 32-bit type with a maximum valid value of 2GB. See the definition of lov_user_md in /usr/include/lustre/lustre_user.h.
No, assuming that by "from userspace" you mean "by an unprivileged user". Currently only system administrators can manage pools in Lustre.
Quoting the Lustre manual: "OST pools allows the administrator to associate a name with an arbitrary subset of OSTs in a Lustre cluster. A group of OSTs can be combined into a named pool with unique access permissions and stripe characteristics." LLNL does not use pools, and AFAIK it is not a commonly used feature in general.
This is the biggest flaw in the current design from my perspective. However, I think it can be addressed without abandoning the convention of getting values via the function return value and while still gracefully handling invalid user inputs.
In my view, the only good case for returing data via a pointer argument is to get a compound data type like a struct stat. In our case we are dealing with scalar integer types, and I don't forsee the need to change them to compound types in the future. After all, the point of the API was to abstract away the compound type struct lov_user_md behind simple accessor functions. Adding a pointer to the argument list increases the potential for user error and complicates the API specification without any real added utility. The real source of messiness is not that the return value can be overloaded with error information if something goes wrong (a well-established convention). Rather, the problem is that the API exposes the literal value (-1) of the exceptional return codes, rather than abstracting them behind macros. In error conditions the API can return the macro LLAPI_ERROR. Similarly, macros LLAPI_USE_ALL_OSTS and LLAPI_MDT_CHOOSES_OFFSET can be use in place of -1 to indicate those special cases.
I am willing to makes the names symmetric as you suggest. |
| Comment by Andy Nelson [ 28/Aug/13 ] |
|
More responses from me later, but one quick one now: I wrote: llapi_layout_stripe_size needs to have its get/set value be a 64bit integer. As far as I understand, even the current API permits stripe sizes up to 4GB and these would overflow a 32bit integer. As filesystems get bigger, I can imagine someone somewhere wanting bigger stripes than this too. You responded: Lustre stipe size is a 32-bit type with a maximum valid value of 2GB. See the definition of lov_user_md in /usr/include/lustre/lustre_user.h. and now I respond, to this with: The current documentation on the old API has this to say in section 32.3.2: stripe_size "Specifies stripe size (in bytes). Should be multiple of 64KB, not exceeding 4GB. and this to say in section 18.2.1 "The maximum stripe size is 4 GB" So is this documentation wrong, or is the new Lustre in versions >2.4 and its llapi reduced in functionality from the Lustre (the manual I got this from claims to be for version "2.x") and the old API? |
| Comment by Ned Bass [ 28/Aug/13 ] |
|
I was mistaken, the maximum stripe size is 4g - 64k, but the type is 32-bit and always has been, AFAIK. This is not a new limitation in Lustre 2.4. The documentation is inaccurate. |
| Comment by Andy Nelson [ 28/Aug/13 ] |
|
How is (4G-64k) possible with a signed integer type? |
| Comment by Ned Bass [ 28/Aug/13 ] |
|
Good point. It should be type size_t. |
| Comment by Andreas Dilger [ 30/Aug/13 ] |
|
Andy, thanks for your input. We often don't get enough feedback from end users, so I thank you for the time spent to write down this information. Some comments:
|
| Comment by Andy Nelson [ 30/Aug/13 ] |
|
Andreas, I've got more comments in draft form, but as far as "fetching the layout from the kernel Each of the get calls takes, as an argument from the user, the layout pointer. How are It is a very bad answer to say "the code will go "splat!" somewhere in the lustre library function, rather than simply handling the user footgun case and returning an error code that |
| Comment by John Hammond [ 03/Sep/13 ] |
|
Andy, Thanks again for your input. Can you give some more detail about which high level operations you would like? Especially any you see missing here. I expect that you want:
What would you add to this list? I am not familiar with the function shipping implementation used on BG. (But I believe that I can imagine how it works.) Could you provide a short description? Based on Ned's patch I assume that functions based on extended attributes are much better suited to function shipping that those based on ioctls(). (Again I can imagine why this might be.) Is that so? Are there any unusual limitations of function shipping that we should account for here? |
| Comment by Christopher Morrone [ 03/Sep/13 ] |
The compute nodes (CN) of BG/Q runs a light-weight OS. It provides many, but not all, of the common library and system calls found on Linux systems. The calls are basically converted into RPCs. The parameters are sent over the network to a node (called the I/O Node (ION)) that really runs Linux. The function/system call is made there on behalf of the user. When the function returns, the return code, error code, and associated buffers are copied back to the CN that made the call. The extended attribute functions have the pointer to the associated buffer and the buffer size clearly listed in the function parameter list. That makes it easy for the function shipping system to copy the complete buffer back and forth between the nodes. ioctl(), on the other hand, has special behavior for every ioctl request number. There is no single standard, which means that there is no generic function shipping code that can be written to handle all ioctl() calls. Special purpose code would need to be written for many of the ioctls. Therefore most ioctls, including all of the Lustre ioctls, are unsupported by the function shipping system. |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
John, Chris filled you in on the function shipping stuff better than I could. I'll just note that this type of thing is not at all specific to BG machines. Cray also has such for example, and I would guess that there are others. Basically any machine that runs some sort of reduced OS kernel is susceptible...someone has to make a choice somewhere along the lines of "what is in/what is out" and there is always something that someone wants that ends up on the wrong side of the line. The rest of your email I'll get to shortly. |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
All, Rather than respond to Ned's (and others) comments in one big response where things get lost, "The llapi_layout_t * points to user-space memory, and we're dealing with Re scribbling: that is true no matter which language if its in user |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
Re: ioctls: It appears that at least some of the "new" API still will require ioctls (e.g. the I thought that the whole point of the new API was to get rid of them altogether, due |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
"llapi_layout_by_path: "likely, but not guaranteed to fail" on non-Lustre It is explictly not a guarantee. In other words, don't rely on this, and Actually it is a guarantee: It is a guarantee that the result of the |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
"llapi_layout_stripe_size needs to have its get/set value be a 64bit" I'm glad you and Andreas now see this as a problem and will change to a 64bit type. |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
Re: lfs/llapi and such "Actually, lfs uses llapi to implement its functionality. The problem is ...well I guess you've got your work cut out for you then "What we are trying to do here is implement a sane, portable, and ok, then my comments should be taken as an indicator that some It seems to me that John Hammond (comment above) was also pushing in this |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
"I see no function for getting the number of OSTs in the filesystem. This is the API function. It uses an ioctl() that IBM claims to have /* @mnt - contains a path within the Lustre filesystem.
Some function that tells me how many OSTs (or whatever other things) I have The only other way to get that information (a messy hack) from where I work |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
"Near as I can tell, the llapi_layout_by_fid takes as its an argument A function exists: int llapi_path2fid(const char *path, lustre_fid *fid); Hmmm...that indicates that there are multiple customers of this API, For the kinds of stuff I do and for what I need, the FID stuff is |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
"llapi_layout_pool_name[_set]: I am not sure whether or not pools can be No, assuming that by "from userspace" you mean "by an unprivileged Also, it would be of some benefit to have some understanding of what a Quoting the Lustre manual: "OST pools allows the administrator to LLNL does not use pools, and AFAIK it is not a commonly used feature in This pool stuff exactly the sort of stuff I'd leave out of the API, if it were only |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
Below is a list of the current API functions as exposed in the llapi_layout llapi_layout_t *llapi_layout_by_path(const char *path); llapi_layout_t *llapi_layout_by_fd(int fd); These two functions are each requirements to have in an API that I use. llapi_layout_t *llapi_layout_by_fid(const char *lustre_dir, This function requires a fidstr pointer, which cannot be gotten from anything llapi_layout_t *llapi_layout_alloc(); void llapi_layout_free(llapi_layout_t *layout); These are useful/required functions under some conditions. There is a problem my_layout = llapi_layout_alloc(); rc = llapi_layout_stripe_count_get(my_layout,count); (pointer-ify as needed to get the syntax correct-my C is terrible): int llapi_layout_stripe_count(const llapi_layout_t *layout); int llapi_layout_stripe_count_set(llapi_layout_t *layout, int llapi_layout_stripe_size(const llapi_layout_t *layout); int llapi_layout_stripe_size_set(llapi_layout_t *layout, int llapi_layout_pattern(const llapi_layout_t *layout); int llapi_layout_pattern_set(llapi_layout_t *layout, int pattern); Must have all three pairs of these functions. The layout pattern function int llapi_layout_ost_index(const llapi_layout_t *layout, int llapi_layout_ost_index_set(llapi_layout_t *layout, const char *llapi_layout_pool_name(const llapi_layout_t *layout); int llapi_layout_pool_name_set(llapi_layout_t *layout, I cannot imagine any use for these functions and would remove them if it int llapi_layout_file_create(const llapi_layout_t *layout, Something like this is a requirement...have to be able to create files according I'd change the order of the arguments though, to match the order of open(2) with |
| Comment by Andy Nelson [ 03/Sep/13 ] |
|
"The current error handling interface is not well put together. Get This is the biggest flaw in the current design from my perspective. In my view, the only good case for returing data via a pointer argument As I understand your arguments, there are basically two reasons for 1) "Adding a pointer to the argument list increases the potential for user I would rebut this by saying that in my experience this is exactly the opposite As far as complicating the API, it actually simplifies it by making the get/set 2) People might want to use the return'd value in an expression as you wrote in In other words, you want to enable this kind of thing: bytes_in_a_full_stripe = llapi_get_stripe_count(layout)*llapi_get_stripe_size(layout); But what happens when you give the 'get' functions an invalid layout? And what happens when you have two or more of these functions in the same Better to design out this use case. Also, I reiterate here my argument about how I would use the gotten data: "The real source of messiness is not that the return value can be Perhaps in some contexts, but I'd claim your argument about 'using "Rather, the problem is that the API I don't really understand what this means. So what if the literal value |
| Comment by Ned Bass [ 05/Sep/13 ] |
|
John and Andreas, do you have a preference as to getting data via return value versus pointer argument? Valid arguments can be made for and against either style, so I'm happy go with the consensus as long as we're consistent throughout the API. |
| Comment by John Hammond [ 05/Sep/13 ] |
|
Ned, Either is fine be me as long as we're consistent and unambiguous about the error cases. I'd still like to see a list of the high-level operations that are needed from this simplified API. Much of what exists in llapi is there to support the administration and testing of Lustre and is unlikely to go away or be reformed at this time. We can however develop a simplified API (in it's own header or just highlighted in some way) tailored to application developers. To do that however, we need to enumerate the high-level use case operations(e.g. get stripe count of file, get stripe size for file, get max stripe count for FS, ...). Then we questions about ioctl vs xattr, return method, ... will be easier to answer. Note also that depending on the operations and the desire for backwards compatibility, we are not necessarily restricted to using the existing ioctls and xattrs. We could instead define a ioctl with no buffer but that just returns a single quantity (stripe size, count, ...) for the file or FS. This approach would bring fewer issues about function shipping, byte swapping, .... I assume that no real scientific applications use the lmm_objects[] part of lov_mds_md_vx and so if we can make thing easier by not handling it then all the better. Operation that would seem to require ioctls may not. They can be replaced with virtual xattrs and probably other low-level implementations. |
| Comment by Christopher Morrone [ 06/Sep/13 ] |
I think is a worthwhile discussion, but too broad in scope for this ticket. For this ticket, we need to remain focused on the final details of the llapi_layout functions so we can get this finished by the end of the month. |
| Comment by Andy Nelson [ 07/Sep/13 ] |
|
John, Here are some direct responses to your queries. • Functions to determine if a file descriptor or path (FDoP) belongs to a Lustre FS. These are already available by way of fstatfs/statfs…which has an entry for filesystem "super magic". Of course, • Functions to get the stripe size and stripe count for a FDoP. Yep, and any other fs parameters, such as raid type (though currently lustre can only do raid0 of course, • Functions to determine the max and min stripe size and count for files on a given Lustre FS. Yep. Presumably the max for count is the filesystem max…basically the number OST's or some such? • Functions to create files with given stripe size and count. Yep. Also, assuming lustre ever grows something beyond raid0, something to be able to set/get What would you add to this list? I'd say you covered what I need pretty well. There may also be a few other sorts of functions which The other thing that might be of some future importance (not now), would be if there suddenly Things I specifically don't need or want are things that have to do with "is this OST full or not" and Cheers, Andy |
| Comment by Ned Bass [ 02/Nov/13 ] |
|
I've attached manual pages for the revised API based on the discussion above. I've combined functions in a single man page where logical, i.e. free and alloc, _get and _set pairs, etc. Please review the new specification and provide feedback. llapi_layout.txt |
| Comment by Ned Bass [ 02/Nov/13 ] |
|
To highlight some of the changes from the previous version:
|
| Comment by Andy Nelson [ 05/Nov/13 ] |
|
Hi Ned et al., Here are some comments on the new API draft. Overall, this version looks Cheers, Andy 1) llapi_file_create/open I thought there was discussion to the effect that the layout argument would 2) It is implied by the code shown in the example in the llapi_layout manpage, but 3) Point out in a few more relevant places that llapi_layout_t points to an opaque 4) In the llapi_layout_pool_name_get/set page, it says that pools can be specified 5) I still see no function for getting the filesystem dimensions, in terms of |
| Comment by Ned Bass [ 05/Nov/13 ] |
|
Thanks for the feedback Andy.
You did make that suggestion, but I'm not convinced it's the right way to go. Consistency within the API should take precedence over consistency with external library functions. The llapi_layout_t handle is the unifying element across the API, so in my view it is natural for it to be the first argument for all functions that require it.
OK
OK
The current wording seems to be generating a lot confusion around OST pools. I'll try to make it more clear, but to avoid cluttering the man page I'll defer to the Lustre Operations Manual to provide the detailed background material.
I'll tackle this problem in a separate patch. |
| Comment by Ned Bass [ 05/Nov/13 ] |
|
Man pages revised based on above feedback, and other minor cleanup. llapi_layout.txt |
| Comment by Andy Nelson [ 05/Nov/13 ] |
|
Hi Ned, Here is one response to our remaining point of discussion, Andy You see the ordering as better the way it is now, with the My argument otherwise, is basically twofold: 1) There are actually two APIs that have to be considered here. One More than any other functions, the open/create functions My vote is still with the system API, largely to strengthen the 2) In my mind, the layout argument plays a very similar conceptual role 1)open 'filename' Basically, the 'open' and the 'filename' strings in the code, do (what) to <this> with <conditions> Separating 'do what' from <this>, breaks that grammar and readability do (what) with <conditions> to <this> with <more conditions> That isn't how I'd say it in words--its grammatically more complex. The get/set functions have a rather different sentence for me, For <this> tell me about <that> characteristic |
| Comment by Ned Bass [ 05/Nov/13 ] |
|
Andy, I can see you point so I've made the change: |
| Comment by Andy Nelson [ 05/Nov/13 ] |
|
with that change, and in the sense of the lkml patch submission process, Reviewed-By: Andy Nelson <andy.nelson@lanl.gov> When can I expect to see this implemented on sequoia? Thanks, Andy |
| Comment by Ned Bass [ 06/Nov/13 ] |
|
Andy, a compute node build is now installed on vulcan, rzuseq, and sequoia, under /usr/local/tools/liblustre. We haven't updated the packages installed on the LAC nodes yet, so you'll have to refer to the man pages attached to this issue for now. On vulcan, I was able to build and run the example program from the llapi_layout man page like this: $ mpicc -Wl,-rpath=/usr/local/tools/liblustre/lib -L/usr/local/tools/liblustre/lib -I/usr/local/tools/liblustre/include -llustreapi -dynamic -o liblustre_test liblustre_test.c $ srun -p pdebug -N 1 ./liblustre_test /p/lscratchv/`whoami`/`mktemp -u XXXX` Let's take further discussion unrelated to API design to an email thread, if needed. |
| Comment by Ned Bass [ 20/Nov/13 ] |
|
Andy, you may want to weigh in here. Andreas made this comment in the patch review system about llapi_layout_file_create() semantics. Do you have a preference regarding the current behavior versus what Andreas proposes? /**
* Create a file with the specified \a layout with the name \a path
* using permissions in \a mode and open() \a flags. Return an open
* file descriptor for the new file.
*/
int llapi_layout_file_create(const char *path, int flags, int mode,
const llapi_layout_t *layout)
This is inconsistent with the llapi_file_create() call, which just creates the
file but does not actually return the open file handle.
I don't think it makes sense to have this extra call just to avoid the
application needing to pass O_CREAT|O_EXCL to llapi_layout_file_open(),
but it is useful to have a version that does not return the open file handle.
|
| Comment by Andy Nelson [ 20/Nov/13 ] |
|
Hi Ned, I've gotten sidetracked on other things, so haven't fully implemented the new As to Andreas's comments: I cannot figure out any reason why I might want to "have a version that does not return the open Panasas has a similar thing in their API, which I also never understood. It is just needless As to having both llapi_layout_file_create/open, I like having both. That way makes it more nearly As to the open(2)/creat(2) calls, I just noticed another small skew between those calls and the lustre All that said, again I'll say that it makes more sense, and I prefer from a 'fullup API' As for looking the same as the old lustre api...the more different the better. I'd claim having two |
| Comment by Ned Bass [ 20/Nov/13 ] |
|
Thanks Andy. We have some discrepancies between what you think is useful versus what the patch reviewers at Intel think. I provided both open() and create() interfaces to try to appease both camps. I figured what you wanted was a guarantee that your file was created with the requested layout on a successful function return. That's why I chose those particular open() flags, as compared to creat(). O_EXCL because, as you say, the function doesn't make sense for existing files. Not O_WRONLY because I figured you might want the flexibility to do reads, but I'm happy to add it if you think consistency with creat() is more important. Not O_TRUNC because it doesn't make sense to combine with O_EXCL. On the other hand, some patch reviewers took exception to forcing those flags on in llapi_layout_file_open(), preferring to leave it up to the application. So, I settled on the providing both versions, which seemed like a reasonable middle ground. |
| Comment by Andy Nelson [ 20/Nov/13 ] |
|
Seems right to me. I expect the create function to create the file for me with the layout I specified. As to the O_WRONLY and O_TRUNC etc, I have no strong opinions about that. Just noticed it when I was |
| Comment by Ned Bass [ 20/Nov/13 ] |
|
Andy, also please be aware the names of the macros LLAPI_USE_FS_DEFAULT and LLAPI_USE_ALL_OSTS may change to LLAPI_LAYOUT_USE_FS_DEFAULT and LLAPI_LAYOUT_USE_ALL_OSTS. I left out LAYOUT to keep the names reasonably short, but Andreas requested the longer names for better consistency. Andreas, are you willing to keep the llapi_layout_file_create() interface as-is, in light of Andy's comments? What use case do you have in mind for a version that doesn't return an open file handle? |
| Comment by Andreas Dilger [ 16/Jul/14 ] |
|
A concern I have about the current patch that James thought would be better discussed in this ticket is the amount of layout validity checking that is part of the API. While some basic sanity checking is desirable in any API, it seems to me that there is now an excessive amount of sanity checking in userspace, which all needs to be replicated in the kernel anyway (because the kernel cannot depend on userspace to be correct and only populated with non-malicious users). A number of the sanity checks are "expensive" in that they open external /proc files and iterate over e.g. all mounted filesystems to check if the pathname belongs to a Lustre filesystem, or all OSTs to verify the OST pool specification, which can impact performance when creating millions of files or concurrently by hundreds of threads on a system with thousands of OSTs. My proposal is to have a separate function like llapi_layout_verify() or similar that can be called once to verify a layout after it has been created, and then the layout can be used repeatedly to create files. It could (optionally) store a flag in the opaque layout structure to indicate if the layout has been verified and this could be done automatically on the first use of the layout, and/or set by the implicit verification done by the kernel returning success. I think this handles both the desire for applications to get detailed error feedback if necessary, while not adding permanent runtime overhead to a commonly-used function. Applications which only care about success/failure can omit all checking and just get an error back from the kernel. |
| Comment by Ned Bass [ 16/Jul/14 ] |
|
I think this is a reasonable approach. Regarding a "verified" flag, we'd also need to store which filesystem(s) the layout was verified against. I don't see enough utility in saving the verification status to justify the added complexity, however. I say let the application track verification status. I also considered using a caching strategy to mitigate the validity checking overhead. The library could remember Lustre mount points, pool membership, OST lists, etc. But I don't care for the complexity of that approach either. |
| Comment by James A Simmons [ 16/Jul/14 ] |
|
Should it be int llapi_layout_verfiy(const struct llapi_layout *layout, char *fsname) or int llapi_layout_verfiy(const struct llapi_layout *layout, char *path) I personally tend to favor the fsname. approach. Once verified against a file system that layout can be used for any path. The other option is to change llapi_layout_create to have fsname has a option in creation. Of course the draw back is some layouts are created with only using the fd .Also we want to expose the validation function. |
| Comment by Ned Bass [ 16/Jul/14 ] |
|
It should take a path argument. Users should not have to bother figuring out the fsname from a path. They will just want to know if the layout is valid for the directory they're writing to. |
| Comment by Andy Nelson [ 16/Jul/14 ] |
|
I agree with Ned. I seem to recall times in my past when the name of the filesystem was obscured in some odd way, so that I could /scratch8/afn/blah being an ailas for something like this: /lustre/scratch8/afn/blah and so forth, done by the systems folk, always confuse me. They will confuse others too, I am sure. As far as the create function taking an fsname argument, I don't like that for exactly the same reason. Add to that, All that said, putting a "verified" flag into the opaque layout structure seems a good idea with one big flaw. Namely, In consequence, the layout structure can get corrupted in whole or in part, and its the 'in part' portion that is Splat. Or it has to go do all the verification over again, which is what the whole "verified" flag is supposed to short circuit. Given that, is it a better option to have the layout thingy (whatever its form), be some sort of entity that is more Perhaps a solution would be to make the "verified" flag some sort of checksum of the rest of the struct and if Another issue with using a path as the argument is probably the reason James likes the fsname alternative. Namely, |
| Comment by James A Simmons [ 16/Jul/14 ] |
|
Okay my local patch uses a path argument : int llapi_layout_verify(const struct llapi_layout *layout, const char *path) For llapi_file_open what I'm doing currently is {
/* If the user verifed the layout we still need to ensure that the
* requested file is located on the same lustre file system as the
* layout. */
if (strlen(layout->llot_fsname) != 0) {
/* Verify that the file path belongs to a lustre filesystem. */
rc = llapi_search_fsname(path, fsname);
if (rc < 0 || (strlen(fsname) == 0)) {
errno = ENOTTY;
return -1;
}
/* Verify the file system of the user supplied path is the
* same one the layout was created for. */
if (strcmp(fsname, layout->llot_fsname)) {
errno = ENOTTY;
return -1;
}
} else if (llapi_layout_verify(layout, path) < 0)
return -1;
So it forces a verify if it doesn't see llot_fsname. If verified we still need to make sure that the file is located on the same file system. A subdirectory could be another file system mount. So I handle my fsname paranoia. As for corruption, well that is harder to deal with. For checksum we have to write a check sum algothrim since userland libcfs is going away. Also with that small amount of data their is greater chance of checksum value collisons. Maybe we could get away with a hashtable. Again with libcfs going away that might be more challanging. Have to see if we can use just the header. Currently struct llapi_layout is opaque but I guess nothing stops the user from casting and scribbling. |
| Comment by John Hammond [ 16/Jul/14 ] |
|
What is the use case of llapi_layout_verfiy()? |
| Comment by Ned Bass [ 16/Jul/14 ] |
|
At some point I added a llot_magic canary field to use as a minimal sanity check. But I somehow lost that change in all the refreshes. I think we should add a magic field, but I'm opposed to further measures such as checksums to detect or prevent corruption. The application is ultimately responsible for not trashing its memory. |
| Comment by James A Simmons [ 16/Jul/14 ] |
|
The llot_magic is still there. Users will always find a way to break things :-/ |
| Comment by Ned Bass [ 16/Jul/14 ] |
|
I have to voice a concern that lack of a formal design process is threatening to derail this API. There is too much implementation going on with too little understanding of requirements, as John hinted at. There should be one architect/implementer to preserve unity of design and conceptual integrity, with others providing feedback, requirements, and code review. New public interfaces such as llapi_layout_verify() should be fully documented in man page format and reviewed here before an implementation is submitted for review in gerrit. So, let's precisely nail down exactly what our validity checking requirements are before charging ahead with implementation. We need concrete answers to at least:
|
| Comment by John Hammond [ 16/Jul/14 ] |
|
Let me rephrase. What are the use cases of validity checking in user space? I'm in favor of good interfaces that answer specific questions like: "How many stripes can I have?" or "What are the minimum and maximum stripe sizes?" These are easy to answer. On the other hand, offering an API to answer "Is this striping valid?" makes me a bit uncomfortable. It's a bit like being asked "Where babies come from?" by someone else's kid. There are too many details to ensure that striping that passes this function will be always be accepted and created by Lustre. Setting implementation aside, how do I use such a function? Do I create various hypothetical striping and pass them to verify? This seems like a parameter search to answer the questions from the first paragraph. |
| Comment by Ned Bass [ 16/Jul/14 ] |
|
Yes. John's comments cut to the heart of the matter are in line with what Andy has been asking for. The ultimate test of a layout's validity is whether Lustre accepts and creates it. But the API should provide interfaces that allow the user to determine valid layout values. At this point I think http://review.whamcloud.com/#/c/5302/ should be abandoned and replaced with a new change ID based on patch set 23. That was the last stable point in the revision series (aside from errno stomping bugs caught by Frank), and there has been too much churn since then for me to review the changes with any confidence. Future revisions should be based strictly on design changes agreed to here, and avoid unnecessary refactoring of code that has already been reviewed, tested, and debugged. For now, I respectfully request to be the only person to push changes to the review. Others are of course welcome to contribute dependent patches as separate reviews. I'm grateful that James took the initiative to move this forward, but I think it's important for consistency's sake to just have one chef in the kitchen. Please communicate any new requirements for To summarize, I propose that we do the following, in order.
Are others on board with this plan? |
| Comment by James A Simmons [ 16/Jul/14 ] |
|
Okay. I will take all the changes I did in 5302 and place it in the I promise to break the work I do in Perhaps Ned you could consider breaking your patch up into smaller pieces. It is a lot of work. |
| Comment by Ned Bass [ 16/Jul/14 ] |
|
That is one reason I want to start over from patch set 23. It was pretty thoroughly reviewed at that stage so I'd like to see it land with only minor changes and do follow-up work in separate patches. I don't see much benefit to breaking up what's already been reviewed though. The main obstacle to landing it is the expensive validity checks that Andreas objects to. But those checks can simply be removed for now, and replaced with improved interfaces as discussed above in subsequent patches. How would that sit with you, Andreas? |
| Comment by Andreas Dilger [ 17/Jul/14 ] |
|
I don't see any benefit to abandoning 5302and creating a new change over just pushing a new patch which drops the changes you don't want. Making a new change just means more places to look for information about this change. As for checks, I agree with John. Simple checks against hard (constant) limits are fine, but I'd prefer to do any complex checks (e.g. opening /proc files and iterating) in a separate function. The kernel has to do all of these checks itself anyway. The main drawback is that the xattr interface that is needed for BG/L is clumsy because it might "succeed" on any filesystem with xattr support, but not actually create the striped file as expected. The ioctl() interface was less troublesome in this regard since it is unlikely that any filesystem would handle the Lustre ioctl command. It is worthwhile to verify if setxattr("lustre.lov") will work on other filesystems or if they will refuse the "lustre.lov" xattr because it is not in one if the normal namespaces ("user", "system", "trusted", or "security") that are handled by the kernel. |
| Comment by Ned Bass [ 17/Jul/14 ] |
We can keep working there. I just find it becomes hard to navigate when the comment and revision history gets too long.
I have not found that to be the case in practice (ext4, zfs, tmpfs, and nfs). I believe (but haven't verified in the code) that a filesystem must explicitly register support for an xattr namespace beyond the standard ones, otherwise the kernel will return EOPNOTSUP. |
| Comment by Ned Bass [ 04/Aug/14 ] |
|
Updated attached man pages to reflect recent API changes. In particular
Please review the documentation of the new flags parameter in llapi_layout_get_by_fd.txt |
| Comment by Andreas Dilger [ 27/Aug/14 ] |
|
One thing that snuck past my review in these patches was that the new llapi_layout_*() functions are all returning "-1" to the caller and returning the error codes via "errno" instead of returning the negative error numbers directly to the callers. IMHO, "errno" is the domain of the kernel and libc and should not be used by application libraries. This is a global variable that could be touched by many parts of the process, and there is the danger that errno gets clobbered by other parts of the code, leading to ugliness like: if (lum == NULL) { tmp = errno; close(fd); errno = tmp; return -1; } and every piece of code that is returning an error having to do it twice: if (path == NULL || (layout != NULL && layout->llot_magic != LLAPI_LAYOUT_MAGIC)) { errno = EINVAL; return -1; } My preference would be to fix the new llapi_layout_*() functions to return the negative error number directly and avoid errno entirely. |
| Comment by Ned Bass [ 27/Aug/14 ] |
|
I'm on board with this. I just had a hallway discussion about the various approaches for returning errors, and negative errno return values were generally agreed to be the least evil of the following possibilities:
|
| Comment by Christopher Morrone [ 28/Aug/14 ] |
I think that the lustre library should be considered a system library, not an "application library". From a normal application's perspective, the lustre library is as system as they come: you are using library the interacts on your behalf directly with the kernel to influence a service offered by the kernel. In that respect I would argue that errno is entirely reasonable to use. I would argue that this kind of error handling is exactly what user-space C developers have come to expect from system level libraries. After all, if it isn't OK for use to use errno, is it really OK for us to reuse all of the standard error codes (EIO, EINVAL, etc.)? Shouldn't we have to invent our own error names and values if those things are only the purview of the kernel and Lib C? Granted, using temporary variable to implement the use of error is mildly annoying. But is that really enough justification to violate the principle of least surprise for the user-space developers who will be consuming our library functions? I think that the "Return negated errno values" approach is probably the least desirable of those proposed. This is a kernel-ism; the result of a clever hack that recognized that those memory values would never be valid so hey why not throw the error value in there. In user space, the programmers are going to think we have lost our minds if we force them to check for an negative version of an error code that is always positive everywhere else. At the very least, we would need to create macros or functions that the users would need to use to check the return code and another to translate the error code into the correct value. We would be shifting the annoying error code shuffling from the library writer to all of the library consumers. errno is defined by the C language standard. Most, if not all, of the values of errno that we use (EACCESS, EAGAIN, EIO, EISDIR, etc.) are defined by POSIX.1-2001 or C99, not inventions of the Linux kernel. If we use the same values as errno, users are going to want to use standard functions like perror() that assume the use of errno. Granted, strerror() also exists, but it is more difficult to use than perror(). It is already difficult to get users to check error codes, so I think it is important to keep things simple when reasonable to do so. |
| Comment by Andreas Dilger [ 28/Aug/14 ] |
|
If there is an insistence on setting errno to return errors, it would still be possible to also return the negative errno instead of "-1" all the time. I'm not suggesting to return PTR_ERR() instead of NULL in case of memory allocation failures, as I agree that this is not very common for userspace programs. |
| Comment by Andy Nelson [ 28/Aug/14 ] |
|
FWIW, I have already implemented the version of the api that uses errno stuff, and put strerror calls in various rc = llapi_layout_stripe_count_get(layout,&num_comps ); if(rc!=0){*ierr=-1;goto writeerrout;}; ... writeerrout: This api is consistent with how I do things in other parts of the code as well, e.g. with stat calls and such. As an implementer of userland code that uses the llapi functionality, I strongly prefer the errno approach. |
| Comment by Ned Bass [ 28/Aug/14 ] |
We could, but what would be the benefit? It would be awkward and potentially confusing for the API to specify more than one authoritative source of error codes. And as Andy points out, applications using both standard library and llapi_layout calls would prefer to use common error handling constructs. |
| Comment by Andreas Dilger [ 28/Aug/14 ] |
|
Because it would keep the same API that the rest of liblustreapi has today, and there is no drawback to doing so. Virtually every application I've seen checks if (rc < 0) instead of if (rc == -1), so as it is clearly documented in the man pages that the functions return a negative value on error instead of "-1" there shouldn't be any problem. That allows applications to choose which behaviour they want to use for programming, and I don't see it introducing any significant complexity into the library - at worst it would mean return -errno instead of return -1 in some places. |
| Comment by Ned Bass [ 28/Aug/14 ] |
If the man page merely specifies a negative value on error, than a properly written application shouldn't rely on the specific value returned. Therefore in order for this behavior to be useful, it most become a formal part of the API and documented as such. Simplicity is arguably the most important aspect of a well-designed API, and having two alternative means of returning error codes would add complexity for little gain. I'm not persuaded we should do it just because the legacy llapi functions do it. We should see this as an opportunity to get things right this time, not to carry over all the old baggage. |
| Comment by Andy Nelson [ 29/Aug/14 ] |
|
I am all for abandoning the old api as soon as possible. It is a maintenance nightmare for my applicaiton code to handle. There is ugly and otherwise unneeded ifdef goo for different machines all over the place. Please dump that api. What Ned says is exactly right: there is a chance to do it right/better this time around. As far as "negative value returned" vs "specific negative value returned (i.e. rc=-ETHE_ERROR_CODE), again, this leads to confusion and complexity that is just not needed. What happens when someone makes a commit sometime and gets them out of sync, such that the return code and errno are not the same any more, for example? Yes, thats a bug that |
| Comment by Andreas Dilger [ 29/Aug/14 ] |
|
Andy, while I'm all for updating applications to the new API, yours is not the only application in the world that uses it, so we can't remove the old API very quickly. Also, this new API has only just been landed into a development branch and would need to be backported into the maintenance releases before it even has a chance to be used by regular users. I think the best that can be done is to include this into all of the maintenance releases, update the old APIs to use this new code (James has started on that) and then it can be deprecated in a few years after it is available in those releases. It is also possible to mark those functions as deprecated in the headers so that application developers learn this before the API is removed. It would also make sense to update the user manual once this API is available in the maintenance releases. |
| Comment by James A Simmons [ 29/Aug/14 ] |
|
I will hold off on my patches until this is resolved. I sent out a email to some people in our applications department to see what they say. I will report on that feedback. |
| Comment by Andy Nelson [ 29/Aug/14 ] |
|
I fully understand the constraints of backwards compatibility and "can't get rid of that yet" cruft. That was very much the point of my previous comment: I've got to carry around backwards compatibility stuff for the different lustre api's until such time as the new one is available on the oldest machine I have to compile code on. I'm expecting that I can remove the "old API" functionality from my own code in something like 5 years or so, if the "new API" stuff lands in a distributed version right now. And that isn't even the case yet, so the date keeps getting pulled further and further out. An example of that pain is the lustre_idl.h file, which I have to hack by hand and carry around with my code since it doesn't compile in user space and I need stuff out of there. The example codes in the lustre manual include it for example too. All the more complicated since that file changes in odd ways across different versions and I can't just use a 1.8x lustre_idl.h file on a lustre 2.x installation and expect it to work... So this is why my emphasis on the "start now" and get the clock ticking. As far as new api format (errno and such discussions), we've gone over that, but here is a reiteration: My position as an applications developer is to have one way to access the error information, and for that way to be the errno setting stuff. |
| Comment by Christopher Morrone [ 04/Sep/14 ] |
The old API is seriously deficient in its design. Yes, we need to keep it around for some time, but I don't see a great deal of value in maintaining compatibility with poor design. This is our chance to make a clean break and make something that applications can really use.
Granted, that is not uncommon. But the man page usually says:
So it is perhaps not unreasonable to guess there are are also folks out there that are used to explicitly checking for -1. Lustre returning only -1 could be considered the least likely to trip anyone up because it will work for checks of either rc == -1 or rc < 0. |
| Comment by James A Simmons [ 11/Sep/14 ] |
|
So I got feedback from our users and middle ware developers. For our users they only care about lfs setstripe and getstripe working. They will most likely never program the striping api themselves. For the ones that do program the most common case is they test for success which means test for zero and then bomb out the app. They normally could care less about returned error values. What is most important to them is that a zero is returned for all functions on success. For our middle ware guys they also want zero to returned for all cases. Now for the error reporting they varied a bit on opinion. What I did get is they don't like negative errno values. They also were not fans of having to read errno itself after a function call. So they asked why not just return a positive errno instead since 0 is success and something else is failure. Also I found out in the discussion return 0 on success and a positive errno is defined in the POSIX.1c standard. |
| Comment by Andreas Dilger [ 12/Sep/14 ] |
|
I would much rather stick with returning -1 and errors in errno than having positive error numbers returned from the functions. Many programs I've seen check "if (rc < 0)" for errors instead of "if (rc == -1)" , so I don't think it makes sense to break these. |
| Comment by Jodi Levi (Inactive) [ 03/Feb/15 ] |
|
Duplicate of |