[LU-3840] llapi_layout API design discussion Created: 27/Aug/13  Updated: 03/Feb/15  Resolved: 03/Feb/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Ned Bass Assignee: Ned Bass
Resolution: Duplicate Votes: 0
Labels: None

Attachments: Text File llapi_layout.txt     Text File llapi_layout_alloc.txt     Text File llapi_layout_file_create.txt     Text File llapi_layout_get_by_fd.txt     Text File llapi_layout_ost_index_get.txt     Text File llapi_layout_pattern_get.txt     Text File llapi_layout_pool_name_get.txt     Text File llapi_layout_stripe_count_get.txt     Text File llapi_layout_stripe_size_get.txt    
Issue Links:
Related
is related to LU-2182 Add llapi_file_get_layout() function ... Closed
is related to LU-3480 Layout Enhancement Closed
is related to LU-4665 utils: lfs setstripe to specify OSTs Resolved
Severity: 3
Rank (Obsolete): 9942

 Description   

Extensions to liblustreapi have been posted for review here.

http://review.whamcloud.com/5302

The goal of the API extensions is to provided a user-friendly interface for interacting with the layout of files in Lustre filesystems, and to hide the wire-protocol details behind an opaque data type.

This issue will provide a forum for potentials users and developers to discuss and critique the design of the proposed API extensions.



 Comments   
Comment by Andy Nelson [ 27/Aug/13 ]

Hello folks,

I am Andy Nelson. I am a user of the current Lustre API and expect
to be a user of the proposed new API in the future. Probably more
than anyone else, my complaints about the previous version are the
reason it is getting attention at all In that context, I want to
make sure it is done right, so as to meet my needs.

As a bit more background, I maintain the IO capabilities in a large
code project at Los Alamos National Lab. This code has to run portably
(and for the most part, does so) across many very large systems, on
top of Lustre, Panasas, GPFS and anything else we happen to get handed
when a new machine is procured or we go someplace else. As an indicator
of scale, we have various tracking profiling in the codes we manage, which
indicate that our users typically consume around 500-1500 CPU centuries per
week and do IO at a sustained rate (i.e. averaged over all jobs during a
week or some such) of 500MB-a few GB/s, all the time. Specific jobs get
IO rates up to 32GB/s or so, to single files on Lustre 1.8, where we
are rate limited because 1.8 supports striping across only 160 OSTs.
I'm working with the LLNL folk to implement the new API on our codes
as they get ported to sequoia, and I'm hoping we can get much higher
rates there. Early on, Chris Morrone did a bit of profiling using
a synthetic, and as I recall guessed that we could get up to perhaps
200GB/s to single files there, once various bugs/limitations/etc were
worked out. We'll see.

Anyway, before the new API gets implemented, I want to voice my input,
so that it can be made to work for me in its deployed form. I had been
providing input to LLNL bug tracking tickets, but I think the stuff
below would benefit a broader audience and so am pushing it here as well.
As new AP draft versions come out, I'm guessing I'll have more comments
and inputs, but what you see below is what I'm able to do for this revision.
The punchline is: "Not ready yet, needs more work"

Andy Nelson

On the API itself
=================

I have seen the traffic on various linux lists about the merging
of Lustre into linux. Given this fact, the Lustre API will become
a part of the Linux API as well, which has its own benefits/constraints.
Right now, as a filesystem specific interface, the fact that
it is filesystem specific may not be particularly problematic.
As time passes though, there will probably be various thoughts along the
lines of providing a generic interface to the same/similar functionality
that exists in other fs's, that either exists now or may grow
in the future. Various bits of such things as these have historically
been provided by ioctls, which we have seen on sequoia get rather
problematic, not to mention fs specific and riddled with inconsistencies
and hacks to get someone/something going (not talking about Lustre
only here...I think its pretty ubiquitous).

Seems to me that it would be a good idea to float the Lustre interface
to some list like linux-fsdevel, for feedback. If they aren't interested
now, you at least want to be well positioned to say "I told you so" and
"You're late to the party, the wheel has already been invented" 5-10 years
down the line when they do start having that interest. No matter when they
express interest, its a guarantee that they will complain about something
though. Better to be early and prepared than late and left out. Then at
least you can be justified when you insult them back for being behind the
times and not thinking of things that are in your bread and butter land,
after they insult you for being clueless idiots who can't write code and
such (you do know that is exactly what they will do don't you???).

On llapi_layout_t
=================

The idea of an opaque object to hold all of the various filesystem
specific information for a given 'layout' is a good idea and the right way
to go. I have a suspicion that this implementation is not going to fly
very well though.

The problem I see is that you are handing the user a pointer that points to
some opaque blob of memory. I don't know whether this is kernel memory or
user memory but either way it is not a good option. If it is kernel memory,
its is problematic on all sorts of levels. It gives malicious users not just
a hole, but an actual, wide open door to stroll on through to get access
to/change OS data. While I am certainly not qualified to comment wisely on
the sundry details, it seems dangerous to me. If the layout data are in the
user's own space, it still seems like a more dangerous thing than it needs to
be. In userspace, a code can accidentally scribble on the layout data, so
you'd have wierd bugs cropping up due to invalid layout data being pointed
at by a valid layout pointer.

Given all that, it seems to me that a better solution is to provide the user
with some sort of thing that looks and behaves like a file descriptor.
Basically an int or something that system code can use in its own way
to point to its own opaque data structure for what it needs, but which the
user never ever sees directly. For the sake of having a name, call it a
layout_fd for the moment. It would be valid after being created and can be
applied to any future file creations on the filesystem for as long as the
job runs, or it is destroyed by some 'layout_fd free' function. Of course,
that means a bit of cleanup infrastructure to handle cases where the job
either aborts or forgets to free the layout_fd itself.

On the llapi_layout man page
============================

While having a 'master' man page that holds all of the api functions
is good, it seems to me to be a much better thing to have individual
man pages for each of the different functions, with only a summary
man page in place of the current llapi_layout page, that includes a
"see also" section at the bottom for the rest, like many other system
functions have. Seems a better idea to have individual pages for each
function, with the the get/set variants of each function in the same page.
That way the user doesn't have to pore through lots of useless-in-the-moment
stuff, just to find the one function that is needed in that moment.

On some of the llapi functions
==============================

There appears to be much more functionality available via lfs than by
llapi. Much of this functionality would be useful to various people from
llapi as well. In fact, some of the lfs produced information appears
pretty much required to be able to use some of the current llapi functions.
What plans do you have to extend llapi to include this stuff?

I see no function for getting the number of OSTs in the filesystem.
This is an absolutely essential function to have: How can I set valid
'stripe count' values if I don't know how many stripes I can have?

Near as I can tell, the llapi_layout_by_fid takes as its an argument
some sort of lustre internal-ish name (the 'fid')...but there is no
way to get this from any llapi function. Need to have that, if
llapi_layout_by_fid is to be useful to anyone.

llapi_layout_by_path: "likely, but not guaranteed to fail" on non-Lustre
filesystems???? It is unacceptable to provide guarantees that aren't
guarantees like this. Make it fail all the time on non-Lustre filesystems.

llapi_search_fsname is mentioned in the description to llapi_layout_by_path
but nowhere else. Describe it.

llapi_layout_stripe_size needs to have its get/set value be a 64bit integer.
As far as I understand, even the current API permits stripe sizes
up to 4GB and these would overflow a 32bit integer. As filesystems get
bigger, I can imagine someone somewhere wanting bigger stripes than
this too.

llapi_layout_pool_name[_set]: I am not sure whether or not pools can be
defined from userspace. Is it correct that they can be? Also, it would
be of some benefit to have some understanding of what a pool actually
is, why I might want one, and how to apply a pool to my IO. At least somewhere
if not in the man pages. I'm a bit confused on the concept right now.

On the error handling infrastructure of the API functions
=========================================================

These comments are reproductions of some comments I made in a ticket
to the LLNL JIRA system for dealing with Sequoia Blue Gene issues,
reproduced here for the broader audience.

The current error handling interface is not well put together. Get
functions return the value gotten, except when they don't, and when
they don't their return value is negative, except that some valid
values are negative so it isn't always true that a negative value
is an error, and so go check the errno to be sure. That is just a
broken by design way of doing an interface.

Ned mentioned in correspondence to me, the possibility of having
the get functions never fail. This is not an acceptable alternative
because the get call requires an argument from user code, typically
the pointer to the layout thing. Since this comes from the user,
it may be invalid: users and applications coders are inevitably
more inventive than you can ever hope to predict at design time (i.e. now).
Without an error/fail path, the code will end up in some sort
of unconditional "Splat!" in the lustre library code itself, which
is a very poor result.

Please, just go back to the unix philosophy that has existed for
decades and have one thing do one thing ONLY, but do it well, rather
than overload those return values with error characteristics in some
cases, but not even consistently for that.

On the asymmetries between the get/set functions
================================================

argument lists
==============

In current form, the 'get' functions return whatever it is they get
as a return value rather than as a function argument. The 'set'
functions take an argument of whatever it is they are setting. This
is asymmetric, and not a good thing. The way it is now, the get
functions have their return value be the thing that was gotten,
and the set values having it as an argument. In other words, it is
intrinsically a messier interface, exactly as noted above with the
error handling business.

Instead, they should both have arguments that define what it is they
are get/setting.

There is other precedent for gotten values being in the argument
list of the function itself. Namely stat/statfs, which each take a
struct argument that aggregates getting a bunch of stuff into one
call. It would be a good thing to follow this same model, though
obviously with the variation that here getting a non-opaque struct
of lots of stuff together is replaced by getting a non-opaque single
value. That leave the possibility for extensibility a lot more open,
since you no longer have to worry about new/old versions of some struct.

Speaking of versions and extensibility: there does not seem to be any
query function for getting the API version. This would be a good thing
to have.

naming
======

The get and set functions in the api are paired together according
to purpose. One half of the pair gets, the other half sets. Here
is one example:

int llapi_layout_stripe_count(const llapi_layout_t *layout);

int llapi_layout_stripe_count_set(llapi_layout_t *layout,
int stripe_count);

But there is naming asymmetry. The 'set' variant includes the 'set' string
in its name, the 'get' variant does not. This is not good.

On an aesthetic level, when I read the 'llapi_layout_stripe_count_set',
I immediately know what it does...it sets stripe count. The asymmetry
of not having the get variant say 'get' in its name, means that I don't
have that same self-documentation, which is jarring somehow to me. It
leaves me with some sort of a `what does it do?' hole in my expectations
or something.

Ned made an argument to me for omitting 'get', based on the idea
that the get functions are or should be most easily usable as 'inline'
things that can appear in various mathematical expressions and such.
In some sense, I see this as an implication that the functions are to
be read as actually being the thing they are getting rather than
being somehow separate from the thing they are getting. This is certainly
inconsistent with the way I already use various get/set functionality for
lustre and other other filesystems, and what I'd expect others in my
position to do as well.

Here's why:

Anyone using these api functions is with little doubt fairly sophisticated,
in terms of wanting to get the best performance etc. That means that they
also have a fairly large codebase to maintain and-most importantly-support
on lots of different platforms, of which Lustre is only one. Panasas is
another for which it is possible to do raid/striping/fs parameter optimization
and setting from inside a job. I think there are probably others as well.
GPFS in my experience has a bit, but not much usable by the likes of me (mostly
for sysadmins and such). XFS? Haven't used it except on my linux box
at home, but I know it runs on large machines in various places and has an
extensive API interface by way of something called 'xfsctl'. Etc.

The point that I'm getting to here is that every one of these interfaces is
different and requires different code and functions to get the various
parameters that are needed from the filesystem and to set them. Why is that
important? It is important because it means that the "right" way
for an application to code to that sort of portability requirement
is to have comparatively thin 'shim' interfaces to each filesystem, with
the bulk of the code in some common, platform neutral form that can
be used on all or most of the different platforms without change.

The thin shim interface does only the get/set stuff. All of the actual
manipulation gets done in the application's platform independent code layer
because once in that form, the application can do all of its own
manipulation in one place on its own data structures, rather than having
many duplications of the same thing, one for each platform. So in that model,
the get functions aren't doing anything except the actual get itself,
and so won't be used inline in expressions, as you were assuming.

That makes having a 'get' in the name a useful thing, since the function
is getting something other than itself, rather than actually being the
thing gotten as is assumed in Ned's model.

Comment by Andy Nelson [ 27/Aug/13 ]

BTW: I should probably mention that my comments just above are based on what I
believe is patch version 15, as enumerated in the review.whamcloud site.

Comment by Andy Nelson [ 27/Aug/13 ]

Doh. Just discovered a typo (not particularly important though). It
should be: 5-15 CPU centuries per week, not 500-1500CPU centuries per week.
The latter is rather a lot

Comment by Ned Bass [ 27/Aug/13 ]

Andy, thanks for your detailed input. Some of your comments go beyond the scope of the current effort, so I'm going to limit my response to the points we can move forward on.

The problem I see is that you are handing the user a pointer that points to some opaque blob of memory.
...
Given all that, it seems to me that a better solution is to provide the user with some sort of thing that looks and behaves like a file descriptor.

The llapi_layout_t * points to user-space memory, and we're dealing with C here, so a user can scribble over the data no matter how many layers of abstraction are added. A file descriptor-like object would add complexity and overhead for no real additional protection. However, to help detect corruption we may want to add an internal magic value to the layout.

The interface between the library and the kernel are well-established input-sanitizing system calls, namely getxattr and setxattr, so no new attack vector is introduced.

While having a 'master' man page that holds all of the api functions is good, it seems to me to be a much better thing to have individual man pages for each of the different functions, with only a summary man page in place of the current llapi_layout page, that includes a "see also" section at the bottom for the rest, like many other system

Agreed.

There appears to be much more functionality available via lfs than by llapi. Much of this functionality would be useful to various people from llapi as well. In fact, some of the lfs produced information appears pretty much required to be able to use some of the current llapi functions. What plans do you have to extend llapi to include this stuff?

Actually, lfs uses llapi to implement its functionality. The problem is that llapi is largely undocumented, and if you think the proposed extensions are bad you would be horrified by the existing API. What we are trying to do here is implement a sane, portable, and well-documented set of functions to let users read and configure striping attributes. Overhauling and documenting the rest of llapi is sorely needed but well beyond the current scope.

That said, there are existing ways to do the things you mention, but some of them rely on ioctl() so won't work on BGQ systems. I'll briefly mention the relevant llapi functions.

I see no function for getting the number of OSTs in the filesystem.

This is the API function. It uses an ioctl() that IBM claims to have implemented function-shipping support for on BGQ. However, it didn't work in my tests, so we need to continue working with them to fix it.

/* @mnt    - contains a path within the Lustre filesystem.
 * @count  - value returned here
 * @is_mdt - get number of OSTs if 0, else get number of MDTs
 */
int llapi_get_obd_count(char *mnt, int *count, int is_mdt);

Near as I can tell, the llapi_layout_by_fid takes as its an argument some sort of lustre internal-ish name (the 'fid')...but there is no way to get this from any llapi function. Need to have that, if llapi_layout_by_fid is to be useful to anyone.

A function exists:

int llapi_path2fid(const char *path, lustre_fid *fid);

However, it relies on ioctl() so isn't portable. Typical users need not be concerned with lustre FIDs. They are primarily used by developers, filesystem monitoring software, or Hierarchical Storage Management (HSM) systems. It is the exactly the type of internal implementation detail that a proper API should hide from users. I did not include this function in my original design, but an Intel reviewer requested it.

llapi_layout_by_path: "likely, but not guaranteed to fail" on non-Lustre filesystems???? It is unacceptable to provide guarantees that aren't guarantees like this. Make it fail all the time on non-Lustre filesystems.

It is explictly not a guarantee. In other words, don't rely on this, and instead use llapi_search_fsname() (which is admittedly undocumented, but we can provide an example). This lets the application check once on the directory, rather than the library blindly checking on every single file lookup, which could get rather expensive.

llapi_layout_stripe_size needs to have its get/set value be a 64bit integer. As far as I understand, even the current API permits stripe sizes up to 4GB and these would overflow a 32bit integer. As filesystems get bigger, I can imagine someone somewhere wanting bigger stripes than this too.

Lustre stipe size is a 32-bit type with a maximum valid value of 2GB. See the definition of lov_user_md in /usr/include/lustre/lustre_user.h.

llapi_layout_pool_name[_set]: I am not sure whether or not pools can be defined from userspace. Is it correct that they can be?

No, assuming that by "from userspace" you mean "by an unprivileged user". Currently only system administrators can manage pools in Lustre.

Also, it would be of some benefit to have some understanding of what a pool actually is, why I might want one, and how to apply a pool to my IO. At least somewhere if not in the man pages. I'm a bit confused on the concept right now.

Quoting the Lustre manual: "OST pools allows the administrator to associate a name with an arbitrary subset of OSTs in a Lustre cluster. A group of OSTs can be combined into a named pool with unique access permissions and stripe characteristics."

LLNL does not use pools, and AFAIK it is not a commonly used feature in general.

The current error handling interface is not well put together. Get functions return the value gotten, except when they don't, and when they don't their return value is negative, except that some valid values are negative so it isn't always true that a negative value is an error, and so go check the errno to be sure. That is just a broken by design way of doing an interface.

This is the biggest flaw in the current design from my perspective. However, I think it can be addressed without abandoning the convention of getting values via the function return value and while still gracefully handling invalid user inputs.

There is other precedent for gotten values being in the argument list of the function itself. Namely stat/statfs, which each take a struct argument that aggregates getting a bunch of stuff into one call. It would be a good thing to follow this same model, though

In my view, the only good case for returing data via a pointer argument is to get a compound data type like a struct stat. In our case we are dealing with scalar integer types, and I don't forsee the need to change them to compound types in the future. After all, the point of the API was to abstract away the compound type struct lov_user_md behind simple accessor functions. Adding a pointer to the argument list increases the potential for user error and complicates the API specification without any real added utility.

The real source of messiness is not that the return value can be overloaded with error information if something goes wrong (a well-established convention). Rather, the problem is that the API exposes the literal value (-1) of the exceptional return codes, rather than abstracting them behind macros. In error conditions the API can return the macro LLAPI_ERROR. Similarly, macros LLAPI_USE_ALL_OSTS and LLAPI_MDT_CHOOSES_OFFSET can be use in place of -1 to indicate those special cases.

But there is naming asymmetry. The 'set' variant includes the 'set' string in its name, the 'get' variant does not. This is not good.

I am willing to makes the names symmetric as you suggest.

Comment by Andy Nelson [ 28/Aug/13 ]

More responses from me later, but one quick one now:

I wrote:

llapi_layout_stripe_size needs to have its get/set value be a 64bit integer. As far as I understand, even the current API permits stripe sizes up to 4GB and these would overflow a 32bit integer. As filesystems get bigger, I can imagine someone somewhere wanting bigger stripes than this too.

You responded:

Lustre stipe size is a 32-bit type with a maximum valid value of 2GB. See the definition of lov_user_md in /usr/include/lustre/lustre_user.h.

and now I respond, to this with:

The current documentation on the old API has this to say in section 32.3.2:

stripe_size "Specifies stripe size (in bytes). Should be multiple of 64KB, not exceeding 4GB.

and this to say in section 18.2.1

"The maximum stripe size is 4 GB"

So is this documentation wrong, or is the new Lustre in versions >2.4 and its llapi reduced in functionality from the Lustre (the manual I got this from claims to be for version "2.x") and the old API?

Comment by Ned Bass [ 28/Aug/13 ]

I was mistaken, the maximum stripe size is 4g - 64k, but the type is 32-bit and always has been, AFAIK. This is not a new limitation in Lustre 2.4. The documentation is inaccurate.

Comment by Andy Nelson [ 28/Aug/13 ]

How is (4G-64k) possible with a signed integer type?

Comment by Ned Bass [ 28/Aug/13 ]

Good point. It should be type size_t.

Comment by Andreas Dilger [ 30/Aug/13 ]

Andy, thanks for your input. We often don't get enough feedback from end users, so I thank you for the time spent to write down this information.

Some comments:

  • Function return codes: I've got mixed feelings about this. I think as long as fetching the layout from the kernel works without errors, then extracting the actual values from the structure shouldn't be able to fail?
  • 4GB stripe_size limit: I agree this is a limit of the current kernel data structure, but if we are working on a new API it makes sense to use an explicit 64-bit value (not size_t) for this API. If someone specifies a value that is too large for the current data structures it can always return -EOVERFLOW or similar. I don't think the API needs to be too tied/limited to the current implementation, especially since we will be adding new types of layouts in the near future (mirrors, data on MDT, etc).
  • API for OST count: I suspect that it might be possible to implement a non-ioctl method find this information by digging around in /proc, but this would itself need to map the supplied path to a specific Lustre filesystem by comparing the supplied path to the mountpoint in /etc/fstab, which is neither robust nor efficient.
  • Symmetry between get/set: I'm also a fan of symmetry between function names in APIs (get/set, get/put, add/del, etc), so I'm also in favour of making these match.
Comment by Andy Nelson [ 30/Aug/13 ]

Andreas,

I've got more comments in draft form, but as far as "fetching the layout from the kernel
works without errors" goes, I have one question:

Each of the get calls takes, as an argument from the user, the layout pointer. How are
you going to handle the case when the user passes in a null/invalid pointer? This could easily
happen because someone has code where the layout 'setup' is simply bypassed or forgotten
or whatever.

It is a very bad answer to say "the code will go "splat!" somewhere in the lustre library function, rather than simply handling the user footgun case and returning an error code that
tells them they screwed up.

Comment by John Hammond [ 03/Sep/13 ]

Andy,

Thanks again for your input. Can you give some more detail about which high level operations you would like? Especially any you see missing here. I expect that you want:

  1. Functions to determine if a file descriptor or path (FDoP) belongs to a Lustre FS.
  2. Functions to get the stripe size and stripe count for a FDoP.
  3. Functions to determine the max and min stripe size and count for files on a given Lustre FS.
  4. Functions to create files with given stripe size and count.

What would you add to this list?

I am not familiar with the function shipping implementation used on BG. (But I believe that I can imagine how it works.) Could you provide a short description? Based on Ned's patch I assume that functions based on extended attributes are much better suited to function shipping that those based on ioctls(). (Again I can imagine why this might be.) Is that so? Are there any unusual limitations of function shipping that we should account for here?

Comment by Christopher Morrone [ 03/Sep/13 ]

I am not familiar with the function shipping implementation used on BG. (But I believe that I can imagine how it works.) Could you provide a short description?

The compute nodes (CN) of BG/Q runs a light-weight OS. It provides many, but not all, of the common library and system calls found on Linux systems. The calls are basically converted into RPCs. The parameters are sent over the network to a node (called the I/O Node (ION)) that really runs Linux. The function/system call is made there on behalf of the user. When the function returns, the return code, error code, and associated buffers are copied back to the CN that made the call.

The extended attribute functions have the pointer to the associated buffer and the buffer size clearly listed in the function parameter list. That makes it easy for the function shipping system to copy the complete buffer back and forth between the nodes.

ioctl(), on the other hand, has special behavior for every ioctl request number. There is no single standard, which means that there is no generic function shipping code that can be written to handle all ioctl() calls. Special purpose code would need to be written for many of the ioctls. Therefore most ioctls, including all of the Lustre ioctls, are unsupported by the function shipping system.

Comment by Andy Nelson [ 03/Sep/13 ]

John,

Chris filled you in on the function shipping stuff better than I could. I'll just note that this type of thing is not at all specific to BG machines. Cray also has such for example, and I would guess that there are others. Basically any machine that runs some sort of reduced OS kernel is susceptible...someone has to make a choice somewhere along the lines of "what is in/what is out" and there is always something that someone wants that ends up on the wrong side of the line.

The rest of your email I'll get to shortly.

Comment by Andy Nelson [ 03/Sep/13 ]

All,

Rather than respond to Ned's (and others) comments in one big response where things get lost,
I'll cut things into one topic at a time chunks. Here is a first chunk, re the layout_t thing:

"The llapi_layout_t * points to user-space memory, and we're dealing with
C here, so a user can scribble over the data no matter how many layers
of abstraction are added. A file descriptor-like object would add
complexity and overhead for no real additional protection. However, to
help detect corruption we may want to add an internal magic value to the
layout."

Re scribbling: that is true no matter which language if its in user
space, but the point I wanted to make is that it is very directly
available for scribbling if the pointer is not obfuscated in some
manner. Just write to what it points to and you've done it: even though
the pointer is still valid, the data it points to no longer are. I don't
know what this magic is that you are talking about, but if it adds to
the difficulty of scribbling all to the good. Anyway, I am willing to
live with it the way it is, but brought it up as a point of danger for
the robustness of a public interface.

Comment by Andy Nelson [ 03/Sep/13 ]

Re: ioctls:

It appears that at least some of the "new" API still will require ioctls (e.g. the
obd count function Ned notes)?

I thought that the whole point of the new API was to get rid of them altogether, due
to the implementation difficulties on various arches...

Comment by Andy Nelson [ 03/Sep/13 ]

"llapi_layout_by_path: "likely, but not guaranteed to fail" on non-Lustre
filesystems???? It is unacceptable to provide guarantees that aren't
guarantees like this. Make it fail all the time on non-Lustre
filesystems.

It is explictly not a guarantee. In other words, don't rely on this, and
instead use llapi_search_fsname() (which is admittedly undocumented, but
we can provide an example). This lets the application check once on the
directory, rather than the library blindly checking on every single file
lookup, which could get rather expensive."

Actually it is a guarantee: It is a guarantee that the result of the
function cannot be relied upon to be correct. It is not acceptable for
any code, particularly a library, to generate unreliable output.
Guaranteed correct and slow beats fast but unguaranteed and possibly
invalid, every time. The function MUST fail when given non-Lustre
information...what does a layout even mean for a non-Lustre path?

Comment by Andy Nelson [ 03/Sep/13 ]

"llapi_layout_stripe_size needs to have its get/set value be a 64bit"

I'm glad you and Andreas now see this as a problem and will change to a 64bit type.

Comment by Andy Nelson [ 03/Sep/13 ]

Re: lfs/llapi and such

"Actually, lfs uses llapi to implement its functionality. The problem is
that llapi is largely undocumented, and if you think the proposed
extensions are bad you would be horrified by the existing API."

...well I guess you've got your work cut out for you then

"What we are trying to do here is implement a sane, portable, and
well-documented set of functions to let users read and configure
striping attributes. Overhauling and documenting the rest of llapi is
sorely needed but well beyond the current scope."

ok, then my comments should be taken as an indicator that some
additional thought needs to be put into which parts to document and
which parts to omit. The point behind my comment was basically that some
of the llapi functions in the man page require information that is not
available from any of the other functions that are listed in the man
page, and other information that is omitted, but is needed to make the
functions that are there actually useful.

It seems to me that John Hammond (comment above) was also pushing in this
same direction--"what should the API look like?" I've got several different
responses to this, which will follow.

Comment by Andy Nelson [ 03/Sep/13 ]

"I see no function for getting the number of OSTs in the filesystem.

This is the API function. It uses an ioctl() that IBM claims to have
implemented function-shipping support for on BGQ. However, it didn't
work in my tests, so we need to continue working with them to fix it.

/* @mnt - contains a path within the Lustre filesystem.

  • @count - value returned here
  • @is_mdt - get number of OSTs if 0, else get number of MDTs
    */
    int llapi_get_obd_count(char *mnt, int *count, int is_mdt);"

Some function that tells me how many OSTs (or whatever other things) I have
available in the filesystem I'm running on is essential to being able to set
that number in the functions I use to set the stripe count.

The only other way to get that information (a messy hack) from where I work
would be to create a temporary file with "-1" stripes (aka let the fs stripe
it over everything) and then query for how many stripes it got striped over.

Comment by Andy Nelson [ 03/Sep/13 ]

"Near as I can tell, the llapi_layout_by_fid takes as its an argument
some sort of lustre internal-ish name (the 'fid')...but there is no way
to get this from any llapi function. Need to have that, if
llapi_layout_by_fid is to be useful to anyone.

A function exists:

int llapi_path2fid(const char *path, lustre_fid *fid);
However, it relies on ioctl() so isn't portable. Typical users need not
be concerned with lustre FIDs. They are primarily used by developers,
filesystem monitoring software, or Hierarchical Storage Management (HSM)
systems. It is the exactly the type of internal implementation detail
that a proper API should hide from users. I did not include this
function in my original design, but an Intel reviewer requested it."

Hmmm...that indicates that there are multiple customers of this API,
with very different needs. Would appear to be important to get a
list of requirements from that other community as well as the one
I am trying to represent then...exactly the sort of things that John
Hammond was just querying me for, for that other community.

For the kinds of stuff I do and for what I need, the FID stuff is
not useful, and is something to remove entirely. But if that other
community needs it, then I'd claim that there needs to be more in
the API than there currently is published, in order to make it at all
useful.

Comment by Andy Nelson [ 03/Sep/13 ]

"llapi_layout_pool_name[_set]: I am not sure whether or not pools can be
defined from userspace. Is it correct that they can be?

No, assuming that by "from userspace" you mean "by an unprivileged
user". Currently only system administrators can manage pools in Lustre.

Also, it would be of some benefit to have some understanding of what a
pool actually is, why I might want one, and how to apply a pool to my
IO. At least somewhere if not in the man pages. I'm a bit confused on
the concept right now.

Quoting the Lustre manual: "OST pools allows the administrator to
associate a name with an arbitrary subset of OSTs in a Lustre cluster. A
group of OSTs can be combined into a named pool with unique access
permissions and stripe characteristics."

LLNL does not use pools, and AFAIK it is not a commonly used feature in
general."

This pool stuff exactly the sort of stuff I'd leave out of the API, if it were only
for me. But if it is useful to other communities, then again, I'd say it needs to be
expanded to some extent, in order to expose a self consistent set of functionality.

Comment by Andy Nelson [ 03/Sep/13 ]

Below is a list of the current API functions as exposed in the llapi_layout
man page on the LLNL machine "vulcan". I have put in comments on each. The
oteher half of this comment is a "whats missing" statement. I'll put together
something like that asap.


llapi_layout_t *llapi_layout_by_path(const char *path);

llapi_layout_t *llapi_layout_by_fd(int fd);

These two functions are each requirements to have in an API that I use.

llapi_layout_t *llapi_layout_by_fid(const char *lustre_dir,
const char *fidstr);

This function requires a fidstr pointer, which cannot be gotten from anything
in the current API. It is therefore useless to include it in the API until
that defect is fixed. Further, I do not expect it would ever be a useful
thing to me, though Ned suggests that it might be to various fs monitoring
tools. You'll have to get that information from that user base though.


llapi_layout_t *llapi_layout_alloc();

void llapi_layout_free(llapi_layout_t *layout);

These are useful/required functions under some conditions. There is a problem
however, in that it is not clear to me whether the opaque blob of layout data
is already populated with something, or whether it is something I have to set
by various explicit llapi calls. If the latter is the case, then how do I
know what calls are required? And what happens if I don't set everything I
have to (and how could I ever ensure that I set everything anyway, with
an opaque blob?)? In other words, what happens when I do this:

my_layout = llapi_layout_alloc();

rc = llapi_layout_stripe_count_get(my_layout,count);
...

(pointer-ify as needed to get the syntax correct-my C is terrible):


int llapi_layout_stripe_count(const llapi_layout_t *layout);

int llapi_layout_stripe_count_set(llapi_layout_t *layout,
int stripe_count);

int llapi_layout_stripe_size(const llapi_layout_t *layout);

int llapi_layout_stripe_size_set(llapi_layout_t *layout,
int stripe_size);

int llapi_layout_pattern(const llapi_layout_t *layout);

int llapi_layout_pattern_set(llapi_layout_t *layout, int pattern);

Must have all three pairs of these functions. The layout pattern function
is currently sort of meaningless though since lustre only has raid0, but
perhaps someday it'll grow some other patterns.


int llapi_layout_ost_index(const llapi_layout_t *layout,
int stripe_number);

int llapi_layout_ost_index_set(llapi_layout_t *layout,
int stripe_number, int ost_index);

const char *llapi_layout_pool_name(const llapi_layout_t *layout);

int llapi_layout_pool_name_set(llapi_layout_t *layout,
const char *pool_name);

I cannot imagine any use for these functions and would remove them if it
were my API to design.


int llapi_layout_file_create(const llapi_layout_t *layout,
char *path, int flags, int mode);

Something like this is a requirement...have to be able to create files according
to the stripe characteristics I want to set.

I'd change the order of the arguments though, to match the order of open(2) with
the layout thing tacked onto the end.

Comment by Andy Nelson [ 03/Sep/13 ]

"The current error handling interface is not well put together. Get

This is the biggest flaw in the current design from my perspective.
...

In my view, the only good case for returing data via a pointer argument
is to get a compound data type like a struct stat. In our case we are
dealing with scalar integer types, and I don't forsee the need to change
them to compound types in the future. After all, the point of the API
was to abstract away the compound type struct lov_user_md behind simple
accessor functions. Adding a pointer to the argument list increases the
potential for user error and complicates the API specification without
any real added utility."

As I understand your arguments, there are basically two reasons for
having the function return as its return value (i.e. not as an argument),
the thing it is getting:

1) "Adding a pointer to the argument list increases the potential for user
error and complicates the API specification without any real added utility."

I would rebut this by saying that in my experience this is exactly the opposite
of what would happen because the return'd value is not overloaded with
two different meanings for valid/invalid result. The return'd value has only
success/fail information, while the argument has only the gotten datum.
There is never any ambiguity of which is which. This is particularly true in the
use case you advocate: expressions--see point 2 below.

As far as complicating the API, it actually simplifies it by making the get/set
functions be symmetric.

2) People might want to use the return'd value in an expression as you wrote in
other email to me.

In other words, you want to enable this kind of thing:

bytes_in_a_full_stripe = llapi_get_stripe_count(layout)*llapi_get_stripe_size(layout);

But what happens when you give the 'get' functions an invalid layout?
Suddenly, the user code has this expression that uses an error return code
rather than a valid value to calculate something. Then they end up with some derived
variable with a nonsense value...say the return code is '-1'...then the
value of 'bytes_in_a_full_stripe' possibly ends up negative, which ain't good. The
user code ends up going on with its life, only to train wreck at some completely
other place when that invalid data is used for something down the line. Check
the error code before going on you say? Can't do that in an expression! Only
in an out of line context, which completely nullifies your starting argument.
Better to design in the guidance to generate an applicaiton code error on the
spot, rather than later.

And what happens when you have two or more of these functions in the same
expression? Even worse, if someone uses 'layout1' in one of the functions
and 'layout2' in the other and only one of them is invalid? Presumably errno
gets set, but to what? The error it should have for the first function in
the expression or for the second? (why would someone do this? Helifino, but
I can speculate about trying to build optimal IO parameters or something).

Better to design out this use case.

Also, I reiterate here my argument about how I would use the gotten data:
immediately load it all into my own filesystem independent structure so that
I can deal with all different filesystems in a portable way. This makes
a return value version superfluous: all the calculations are done in the
fs independent parts of the code, abstracted completely away from the llapi
functions you presume will be in those expressions.

"The real source of messiness is not that the return value can be
overloaded with error information if something goes wrong (a
well-established convention)."

Perhaps in some contexts, but I'd claim your argument about 'using
the return'd data in an expression' points to the fact that the
convention is not universally valid...

"Rather, the problem is that the API
exposes the literal value (-1) of the exceptional return codes, rather
than abstracting them behind macros. In error conditions the API can
return the macro LLAPI_ERROR. Similarly, macros LLAPI_USE_ALL_OSTS and
LLAPI_MDT_CHOOSES_OFFSET can be use in place of -1 to indicate those
special cases."

I don't really understand what this means. So what if the literal value
is provided? So what if its a macro? I don't get what implications
you are getting at here.

Comment by Ned Bass [ 05/Sep/13 ]

John and Andreas, do you have a preference as to getting data via return value versus pointer argument? Valid arguments can be made for and against either style, so I'm happy go with the consensus as long as we're consistent throughout the API.

Comment by John Hammond [ 05/Sep/13 ]

Ned,

Either is fine be me as long as we're consistent and unambiguous about the error cases. I'd still like to see a list of the high-level operations that are needed from this simplified API. Much of what exists in llapi is there to support the administration and testing of Lustre and is unlikely to go away or be reformed at this time. We can however develop a simplified API (in it's own header or just highlighted in some way) tailored to application developers. To do that however, we need to enumerate the high-level use case operations(e.g. get stripe count of file, get stripe size for file, get max stripe count for FS, ...). Then we questions about ioctl vs xattr, return method, ... will be easier to answer.

Note also that depending on the operations and the desire for backwards compatibility, we are not necessarily restricted to using the existing ioctls and xattrs. We could instead define a ioctl with no buffer but that just returns a single quantity (stripe size, count, ...) for the file or FS. This approach would bring fewer issues about function shipping, byte swapping, .... I assume that no real scientific applications use the lmm_objects[] part of lov_mds_md_vx and so if we can make thing easier by not handling it then all the better.

Operation that would seem to require ioctls may not. They can be replaced with virtual xattrs and probably other low-level implementations.

Comment by Christopher Morrone [ 06/Sep/13 ]

We can however develop a simplified API (in it's own header or just highlighted in some way) tailored to application developers.

I think is a worthwhile discussion, but too broad in scope for this ticket. For this ticket, we need to remain focused on the final details of the llapi_layout functions so we can get this finished by the end of the month.

Comment by Andy Nelson [ 07/Sep/13 ]

John,

Here are some direct responses to your queries.

• Functions to determine if a file descriptor or path (FDoP) belongs to a Lustre FS.

These are already available by way of fstatfs/statfs…which has an entry for filesystem "super magic". Of course,
I wouldn't argue against a more standard/user friendly/universally implemented thing than that
super magic stuff (which is rather a mess if you go and google for definitive ways to query
filesystem type--just dive into a function called 'human_fstype' which lives deep in the bowels
of the gnu core utils package if you have any doubts), but I think what exists is sufficient for my needs.

• Functions to get the stripe size and stripe count for a FDoP.

Yep, and any other fs parameters, such as raid type (though currently lustre can only do raid0 of course,
it might grow something else in the future).

• Functions to determine the max and min stripe size and count for files on a given Lustre FS.

Yep. Presumably the max for count is the filesystem max…basically the number OST's or some such?
This one is certainly a requirement, as I wrote above. Others are similarly important, in terms of being
able to determine things like "what constraints do you have on values of stripe size or similar quantities,
that I need to be aware of when I choose my file/raid/IOsize parameters?". Some versions of statfs contain
an entry for 'optimal transfer size', which has rather ambiguous meaning in my experience. It would be of
some benefit to have something of that sort that was actually useful. Typically though, it is not
reliable enough in terms of its meaning on different fs's to use by itself.

• Functions to create files with given stripe size and count.

Yep. Also, assuming lustre ever grows something beyond raid0, something to be able to set/get
raid type (this exists in Ned's current draft).

What would you add to this list?

I'd say you covered what I need pretty well. There may also be a few other sorts of functions which
would be useful to return things like the fs block size and similar information. These would be for
those trying to do O_DIRECT, which has constraints on IO sizes and alignments and such. In the
little bit that I've tried O_DIRECT on lustre (mounted through some network interface as it typically
is), it doesn't seem to be very effective for me though, in spite of the fact that I do all of my own
buffering and the lack of a copy through kernel space "should" be superfluous. Have never followed
up on that, but anyway, I mention the need because someone else may have different experience.

The other thing that might be of some future importance (not now), would be if there suddenly
got to be some other sort of storage medium, like SSDs or some such. Assuming such possibilities,
they are likely to have some other odd constraints that would be useful to know about in an application.
What I would want at that point is obviously unclear, but the point is to make sure that the api is
extensible in a sensible way to make it possible to get/set such things as are relevant then.

Things I specifically don't need or want are things that have to do with "is this OST full or not" and
"start the file on OST number # and round robin from there" etc. All that stuff to do with pools too.

Cheers,

Andy

Comment by Ned Bass [ 02/Nov/13 ]

I've attached manual pages for the revised API based on the discussion above. I've combined functions in a single man page where logical, i.e. free and alloc, _get and _set pairs, etc. Please review the new specification and provide feedback.

llapi_layout.txt
llapi_layout_alloc.txt
[^llapi_layout_by_fd.txt]
llapi_layout_file_create.txt
llapi_layout_ost_index_get.txt
llapi_layout_pattern_get.txt
llapi_layout_pool_name_get.txt
llapi_layout_stripe_count_get.txt
llapi_layout_stripe_size_get.txt

Comment by Ned Bass [ 02/Nov/13 ]

To highlight some of the changes from the previous version:

  • _get and _set naming symmetry.
  • Attribute values are returned via pointer arguments, not function return
    values.
  • Added a llapi_layout_file_open() counterpart to
    llapi_layout_file_create(). _create() is just a wrapper to _open()
    with open flags O_EXCL|O_CREAT forced on.
  • llapi_layout_pool_name_get now copies the pool name into a
    user-supplied buffer, rather than returning a pointer into the
    layout.
  • Separate man pages.
  • Use 64-bit types for all integer attributes.
  • Added macros LLAPI_USE_FS_DEFAULT and LLAPI_USE_ALL_OSTS to use in
    place of "magic" meanings of 0 and -1. The library will internally
    translate these as appropriate.
  • Explicitly verify in llapi_layout_file_open() that path is a Lustre
    file. Same for llapi_layout_by_path().
  • Verify in llapi_layout_file_open() if the layout uses an OST pool that
    it exists and is not empty.
Comment by Andy Nelson [ 05/Nov/13 ]

Hi Ned et al.,

Here are some comments on the new API draft. Overall, this version looks
very, very much better than the previous version that I saw a couple of
months ago. Good work there, and thanks for that.

Cheers,

Andy

1) llapi_file_create/open

I thought there was discussion to the effect that the layout argument would
be moved to be the last argument in the call rather than the first. This
ordering would suit me a lot better in terms of its symmetry with open(2).
Then it would be exactly the same ordering, except for one more argument
on the end.

2) It is implied by the code shown in the example in the llapi_layout manpage, but
it should also appear and be noted explicitly with some verbiage in the
llapi_file_create/open man page too, that you can close a file created/opened
with the llapi functions with a close(2) call.

3) Point out in a few more relevant places that llapi_layout_t points to an opaque
entity. For example, I found that comment in the overall llapi_layout page, but
not the llapi_layout_alloc page. In this particular case, being redundant is a
good thing. It short circuits the question people will ask about "what is in that
entity?", and the inevitable fruitless searches for that information.

4) In the llapi_layout_pool_name_get/set page, it says that pools can be specified
by way of lctl if you are a sysadmin. It wasn't clear whether or not pools can
also be specified by this get/set pair, and if not, what the purpose of these
functions is otherwise. Also, iiuc, pools can be manipulated to have arbitrary
ost inventory. It is not at all clear to me how to set that, given the interface
that exists...it looks like I (as a sysadmin) could only specify a pool name
and that it had n OSTs in it, by way of previously calling llapi_layout_count_set
on some layout that I later give to the pool function. So I don't really see
how to fully use these yet. Of course, I can't see a reason for me (app guy)
to use them at all, but I'm commenting on behalf of some anonymous sysadmin
down the road who might want to.

5) I still see no function for getting the filesystem dimensions, in terms of
number of OSTs etc. If I may predict your answer (that it needs ioctls and
those are hard to make work everywhere), I will respond "yes, I understand, but
I still want this functionality when you can arrange it". For now, I suspect
there is a workaround, though clumsy. Namely, to create a test file with
a layout such that the number of stripes is set to LLAPI_USE_ALL_OSTS and then
query the file to see how wide it is.

Comment by Ned Bass [ 05/Nov/13 ]

Thanks for the feedback Andy.

I thought there was discussion to the effect that the layout argument would be moved to be the last argument in the call rather than the first.

You did make that suggestion, but I'm not convinced it's the right way to go. Consistency within the API should take precedence over consistency with external library functions. The llapi_layout_t handle is the unifying element across the API, so in my view it is natural for it to be the first argument for all functions that require it.

[state] in the llapi_file_create/open man page that you can close a file created/opened with the llapi functions with a close(2) call.

OK

Point out in a few more relevant places that llapi_layout_t points to an opaque entity.

OK

... OST pools ...

The current wording seems to be generating a lot confusion around OST pools. I'll try to make it more clear, but to avoid cluttering the man page I'll defer to the Lustre Operations Manual to provide the detailed background material.

I still see no function for getting the filesystem dimensions,

I'll tackle this problem in a separate patch.

Comment by Ned Bass [ 05/Nov/13 ]

Man pages revised based on above feedback, and other minor cleanup.

llapi_layout.txt
llapi_layout_alloc.txt
[^llapi_layout_by_fd.txt]
llapi_layout_file_create.txt
llapi_layout_ost_index_get.txt
llapi_layout_pattern_get.txt
llapi_layout_pool_name_get.txt
llapi_layout_stripe_count_get.txt
llapi_layout_stripe_size_get.txt

Comment by Andy Nelson [ 05/Nov/13 ]

Hi Ned,

Here is one response to our remaining point of discussion,
about the llapi_file_create/open argument ordering. In terms of
a global picture, the issue is rather small I think, so things
are looking pretty good there. In practice, I expect I will be
able to live with either of the orderings, but want to make my
thoughts known now, for consideration re why I still think its better
my way.

Andy

You see the ordering as better the way it is now, with the
layout argument first, based on API consistency and such.

My argument otherwise, is basically twofold:

1) There are actually two APIs that have to be considered here. One
is the LLAPI semantics (layout first) in which llape_file_create/open
are directly a part, and the other is the system API, which
you see as an external library.

More than any other functions, the open/create functions
can be thought of as a bridge between those two APIs--you can't use
any of the other system API calls (read/write/seek/close/etc) without
the llapi_file_open/create calls and, in that sense, they play a
substitute role for the open(2) system call. That makes the question
of their ordering a bit less clear than "I'm being consistent with the
API". The question has to be a bit broader in the sense of "which API
should I retain most consistency with?" The system API of which the
open call is a part and which the llapi overrides with its own
routines, or the rest of the llapi functions.

My vote is still with the system API, largely to strengthen the
bridge and for this additional reason:

2) In my mind, the layout argument plays a very similar conceptual role
to the 'mode' argument. The latter sets various file permission
characteristics. The former sets various filesystem characteristics.
In words and thinking as I'm reading the function itself, the argument
ordering then becomes

1)open 'filename'
2)with conditions on my access (read/write/whatever) given by 'flags',
3)and with permissions for my and others' future access given by 'mode'
4)and with filesystem characteristics given by 'layout'

Basically, the 'open' and the 'filename' strings in the code,
appearing adjacent to each other gives me a very big readability
lift as I go along, about what I'm doing. In terms of pseudo-English
grammar, it looks like this to me

do (what) to <this> with <conditions>

Separating 'do what' from <this>, breaks that grammar and readability
connection for me. More specifically, I read the layout first
ordering that you prefer, as

do (what) with <conditions> to <this> with <more conditions>

That isn't how I'd say it in words--its grammatically more complex.

The get/set functions have a rather different sentence for me,
that does feel right with the layour argument first in mind.
Basically this sentence:

For <this> tell me about <that> characteristic
or
For <this> specify <that> characteristic

Comment by Ned Bass [ 05/Nov/13 ]

Andy, I can see you point so I've made the change:

llapi_layout_file_create.txt

Comment by Andy Nelson [ 05/Nov/13 ]

with that change, and in the sense of the lkml patch submission process,
I will say for the API as I've seen in in these man page patches:

Reviewed-By: Andy Nelson <andy.nelson@lanl.gov>

When can I expect to see this implemented on sequoia?

Thanks,

Andy

Comment by Ned Bass [ 06/Nov/13 ]

Andy, a compute node build is now installed on vulcan, rzuseq, and sequoia, under /usr/local/tools/liblustre. We haven't updated the packages installed on the LAC nodes yet, so you'll have to refer to the man pages attached to this issue for now.

On vulcan, I was able to build and run the example program from the llapi_layout man page like this:

$ mpicc -Wl,-rpath=/usr/local/tools/liblustre/lib -L/usr/local/tools/liblustre/lib -I/usr/local/tools/liblustre/include -llustreapi -dynamic -o liblustre_test liblustre_test.c
$ srun -p pdebug -N 1 ./liblustre_test /p/lscratchv/`whoami`/`mktemp -u  XXXX`

Let's take further discussion unrelated to API design to an email thread, if needed.

Comment by Ned Bass [ 20/Nov/13 ]

Andy, you may want to weigh in here. Andreas made this comment in the patch review system about llapi_layout_file_create() semantics. Do you have a preference regarding the current behavior versus what Andreas proposes?

       /**
        * Create a file with the specified \a layout with the name \a path
	* using permissions in \a mode and open() \a flags.  Return an open
	* file descriptor for the new file.
	*/
	int llapi_layout_file_create(const char *path, int flags, int mode,
                                     const llapi_layout_t *layout)

This is inconsistent with the llapi_file_create() call, which just creates the
file but does not actually return the open file handle.

I don't think it makes sense to have this extra call just to avoid the
application needing to pass O_CREAT|O_EXCL to llapi_layout_file_open(),
but it is useful to have a version that does not return the open file handle.
Comment by Andy Nelson [ 20/Nov/13 ]

Hi Ned,

I've gotten sidetracked on other things, so haven't fully implemented the new
api in my code yet, so that's why I went dark all of a sudden.

As to Andreas's comments:

I cannot figure out any reason why I might want to "have a version that does not return the open
file handle" as Andreas writes. What use would that be, since I can't do anything with it, afaict?
I never even understood whether or not it even created the file I asked for, from what I could
make of the documentation etc in the old API.

Panasas has a similar thing in their API, which I also never understood. It is just needless
obfuscation and hoop jumping from where I sit, but I suppose I could be missing a part of the picture.

As to having both llapi_layout_file_create/open, I like having both. That way makes it more nearly
symmetric with the open/creat system calls. If there is an unstoppable effort to delete
one or the other (llapi_layout_file_create/llapi_layout_file_open), I'd actually go exactly
the other way, and zap the 'open' version. The reason is that it makes no sense to pass layout
stuff to an already existing file, which is the thing I would use the open version for. In fact
those arguments are documented to be ignored except for create, if I remember correctly. And
without the creation characteristics, why wouldn't I simply use a plain old 'open' call anyway?

As to the open(2)/creat(2) calls, I just noticed another small skew between those calls and the lustre
ones...the llapi_layout_file_create call adds O_CREAT|O_EXCL to the flags and then goes on
to pass those in to the llapi open function, while the creat(2) call adds O_CREAT|O_WRONLY|O_TRUNC,
as documented in the open(2) man page I'm looking at on a linux box here. Why the difference?
Just looking at that, I can't see a specific reason why that skew makes sense, but I could
easily be missing something. Something to fix? No? hmmm...

All that said, again I'll say that it makes more sense, and I prefer from a 'fullup API'
standpoint, to keep both.

As for looking the same as the old lustre api...the more different the better. I'd claim having two
APIs with almost the same appearance, but which are still quite different in actuality, is a very bad
thing. It leads to confusion. It leads to bother and it leads to maintenance headaches for me downstream.
The sooner you get rid of the old API, after instituting the new one, the better it will be for me.
I won't have to maintain two different code interfaces to it, to account for 'this machine has the
new thing, but that machine is still on the old', for nearly as long as if you keep them both
around for a long time.

Comment by Ned Bass [ 20/Nov/13 ]

Thanks Andy. We have some discrepancies between what you think is useful versus what the patch reviewers at Intel think. I provided both open() and create() interfaces to try to appease both camps. I figured what you wanted was a guarantee that your file was created with the requested layout on a successful function return. That's why I chose those particular open() flags, as compared to creat(). O_EXCL because, as you say, the function doesn't make sense for existing files. Not O_WRONLY because I figured you might want the flexibility to do reads, but I'm happy to add it if you think consistency with creat() is more important. Not O_TRUNC because it doesn't make sense to combine with O_EXCL.

On the other hand, some patch reviewers took exception to forcing those flags on in llapi_layout_file_open(), preferring to leave it up to the application. So, I settled on the providing both versions, which seemed like a reasonable middle ground.

Comment by Andy Nelson [ 20/Nov/13 ]

Seems right to me. I expect the create function to create the file for me with the layout I specified.
If it can't do that, it should return an error. One such error would be if the file already exists
because it will have some preexisting layout which may or may not be the one I wanted to specify. In
that case, it is entirely correct, in my mind, to return a "Can't do that" error. Perhaps this is
a point to clarify in the man page, re the create behavior...that it will fail if the file exists, because
the layout stuff can't be guaranteed. It is there implicitly because you note the O_EXCL flag, but
that doesn't hit me over the head quite so hard as is needed compared to an explicit, English text
statement of the same thing too.

As to the O_WRONLY and O_TRUNC etc, I have no strong opinions about that. Just noticed it when I was
reading the man pages very closely as I put together my response above. You might perhaps want to
put in a small note in the man page about the fact that the flags added and passed from the llapi
create to the llapi open calls are slightly different than what are passed from the creat(2) to
the open(2) call, just so that the skew is explicitly noted, but even that may be more than is
really needed or important to have in the man page.

Comment by Ned Bass [ 20/Nov/13 ]

Andy, also please be aware the names of the macros LLAPI_USE_FS_DEFAULT and LLAPI_USE_ALL_OSTS may change to LLAPI_LAYOUT_USE_FS_DEFAULT and LLAPI_LAYOUT_USE_ALL_OSTS. I left out LAYOUT to keep the names reasonably short, but Andreas requested the longer names for better consistency.

Andreas, are you willing to keep the llapi_layout_file_create() interface as-is, in light of Andy's comments? What use case do you have in mind for a version that doesn't return an open file handle?

Comment by Andreas Dilger [ 16/Jul/14 ]

A concern I have about the current patch that James thought would be better discussed in this ticket is the amount of layout validity checking that is part of the API. While some basic sanity checking is desirable in any API, it seems to me that there is now an excessive amount of sanity checking in userspace, which all needs to be replicated in the kernel anyway (because the kernel cannot depend on userspace to be correct and only populated with non-malicious users).

A number of the sanity checks are "expensive" in that they open external /proc files and iterate over e.g. all mounted filesystems to check if the pathname belongs to a Lustre filesystem, or all OSTs to verify the OST pool specification, which can impact performance when creating millions of files or concurrently by hundreds of threads on a system with thousands of OSTs.

My proposal is to have a separate function like llapi_layout_verify() or similar that can be called once to verify a layout after it has been created, and then the layout can be used repeatedly to create files. It could (optionally) store a flag in the opaque layout structure to indicate if the layout has been verified and this could be done automatically on the first use of the layout, and/or set by the implicit verification done by the kernel returning success.

I think this handles both the desire for applications to get detailed error feedback if necessary, while not adding permanent runtime overhead to a commonly-used function. Applications which only care about success/failure can omit all checking and just get an error back from the kernel.

Comment by Ned Bass [ 16/Jul/14 ]

I think this is a reasonable approach. Regarding a "verified" flag, we'd also need to store which filesystem(s) the layout was verified against. I don't see enough utility in saving the verification status to justify the added complexity, however. I say let the application track verification status.

I also considered using a caching strategy to mitigate the validity checking overhead. The library could remember Lustre mount points, pool membership, OST lists, etc. But I don't care for the complexity of that approach either.

Comment by James A Simmons [ 16/Jul/14 ]

Should it be

int llapi_layout_verfiy(const struct llapi_layout *layout, char *fsname)

or

int llapi_layout_verfiy(const struct llapi_layout *layout, char *path)

I personally tend to favor the fsname. approach. Once verified against a file system that layout can be used for any path.

The other option is to change llapi_layout_create to have fsname has a option in creation. Of course the draw back is some layouts are created with only using the fd .Also we want to expose the validation function.

Comment by Ned Bass [ 16/Jul/14 ]

It should take a path argument. Users should not have to bother figuring out the fsname from a path. They will just want to know if the layout is valid for the directory they're writing to.

Comment by Andy Nelson [ 16/Jul/14 ]

I agree with Ned. I seem to recall times in my past when the name of the filesystem was obscured in some odd way, so that I could
never figure out what the actual, definitive name for the filesystem was. Things like this:

/scratch8/afn/blah

being an ailas for something like this:

/lustre/scratch8/afn/blah

and so forth, done by the systems folk, always confuse me. They will confuse others too, I am sure.

As far as the create function taking an fsname argument, I don't like that for exactly the same reason. Add to that,
also the fact that the call loses its symmetry with the normal system file create call. Search above for my comment
of 6:34 Nov 5 2013, for what I mean about symmetry stuff.

All that said, putting a "verified" flag into the opaque layout structure seems a good idea with one big flaw. Namely,
that that structure, opaque or not, is still in the hands of a user, who may be malicious, or worse, inept. The latter
adjective applies to me, quite often. There are frequently times when I pass around some pointer and end up scribbling on
its data because of some call/callee error in dereferencing or some such.

In consequence, the layout structure can get corrupted in whole or in part, and its the 'in part' portion that is
the biggest problem. What happens when you have a partially corrupted layout, which still has the 'verified'
portion of it, uncorrupted? Then llapi goes ahead with the incorrect idea that the layout is valid, while it isn't.

Splat.

Or it has to go do all the verification over again, which is what the whole "verified" flag is supposed to short circuit.

Given that, is it a better option to have the layout thingy (whatever its form), be some sort of entity that is more
like a file descriptor, that indexes into a list of layouts that are kept somewhere in kernel/library space? I know I
commented about this in previous iterations of this discussion (search above somewhere), but things went a
different way at the time.

Perhaps a solution would be to make the "verified" flag some sort of checksum of the rest of the struct and if
the checksum is invalid when given to the open/create/whatever/other/call, then revalidate with all the costs
involved?

Another issue with using a path as the argument is probably the reason James likes the fsname alternative. Namely,
the question of what the "validity" applies to. Just that path alone? Everything under it in the directory tree? For example,
the path is /lustre/afn/blah, and it applies to /lustre/afn/blah/all/the/directories/and/files/under/it? Just to /lustre/afn/blah?
The option of "just that path" has simplicity, because of the complex possibility of one layout at one level of the tree, and another
layout specified for some subset of directories under it. But then what about "just that path" applying to a file? there is
no under, and when you have such a case, perhaps the thing you want to be doing to it is to take that layout and
replicate it on some other file you create somewhere, where it would not apply due to invalid verification status.

Comment by James A Simmons [ 16/Jul/14 ]

Okay my local patch uses a path argument :

int llapi_layout_verify(const struct llapi_layout *layout, const char *path)

For llapi_file_open what I'm doing currently is

{
       /* If the user verifed the layout we still need to ensure that the
         * requested file is located on the same lustre file system as the
         * layout. */
        if (strlen(layout->llot_fsname) != 0) {
                /* Verify that the file path belongs to a lustre filesystem. */
                rc = llapi_search_fsname(path, fsname);
                if (rc < 0 || (strlen(fsname) == 0)) {
                        errno = ENOTTY;
                        return -1;
                }

                /* Verify the file system of the user supplied path is the 
                 * same one the layout was created for. */
                if (strcmp(fsname, layout->llot_fsname)) {
                        errno = ENOTTY;
                        return -1;
                }
        } else if (llapi_layout_verify(layout, path) < 0)
                return -1;

So it forces a verify if it doesn't see llot_fsname. If verified we still need to make sure that the file is located on the same file system. A subdirectory could be another file system mount. So I handle my fsname paranoia.

As for corruption, well that is harder to deal with. For checksum we have to write a check sum algothrim since userland libcfs is going away. Also with that small amount of data their is greater chance of checksum value collisons. Maybe we could get away with a hashtable. Again with libcfs going away that might be more challanging. Have to see if we can use just the header. Currently struct llapi_layout is opaque but I guess nothing stops the user from casting and scribbling.

Comment by John Hammond [ 16/Jul/14 ]

What is the use case of llapi_layout_verfiy()?

Comment by Ned Bass [ 16/Jul/14 ]

At some point I added a llot_magic canary field to use as a minimal sanity check. But I somehow lost that change in all the refreshes. I think we should add a magic field, but I'm opposed to further measures such as checksums to detect or prevent corruption. The application is ultimately responsible for not trashing its memory.

Comment by James A Simmons [ 16/Jul/14 ]

The llot_magic is still there. Users will always find a way to break things :-/

Comment by Ned Bass [ 16/Jul/14 ]

I have to voice a concern that lack of a formal design process is threatening to derail this API. There is too much implementation going on with too little understanding of requirements, as John hinted at. There should be one architect/implementer to preserve unity of design and conceptual integrity, with others providing feedback, requirements, and code review. New public interfaces such as llapi_layout_verify() should be fully documented in man page format and reviewed here before an implementation is submitted for review in gerrit.

So, let's precisely nail down exactly what our validity checking requirements are before charging ahead with implementation. We need concrete answers to at least:

  • Is the return status of fsetxattr() a sufficient indicator of validity?
  • If supplemental validity checking is needed (beyond fsetxattr()), how much control over that checking needs to be exposed through the API?
  • What specific error conditions need to be communicated?
Comment by John Hammond [ 16/Jul/14 ]

Let me rephrase. What are the use cases of validity checking in user space? I'm in favor of good interfaces that answer specific questions like: "How many stripes can I have?" or "What are the minimum and maximum stripe sizes?" These are easy to answer.

On the other hand, offering an API to answer "Is this striping valid?" makes me a bit uncomfortable. It's a bit like being asked "Where babies come from?" by someone else's kid. There are too many details to ensure that striping that passes this function will be always be accepted and created by Lustre.

Setting implementation aside, how do I use such a function? Do I create various hypothetical striping and pass them to verify? This seems like a parameter search to answer the questions from the first paragraph.

Comment by Ned Bass [ 16/Jul/14 ]

Yes. John's comments cut to the heart of the matter are in line with what Andy has been asking for. The ultimate test of a layout's validity is whether Lustre accepts and creates it. But the API should provide interfaces that allow the user to determine valid layout values.

At this point I think http://review.whamcloud.com/#/c/5302/ should be abandoned and replaced with a new change ID based on patch set 23. That was the last stable point in the revision series (aside from errno stomping bugs caught by Frank), and there has been too much churn since then for me to review the changes with any confidence. Future revisions should be based strictly on design changes agreed to here, and avoid unnecessary refactoring of code that has already been reviewed, tested, and debugged.

For now, I respectfully request to be the only person to push changes to the review. Others are of course welcome to contribute dependent patches as separate reviews. I'm grateful that James took the initiative to move this forward, but I think it's important for consistency's sake to just have one chef in the kitchen. Please communicate any new requirements for LU-4665 integration here so that work can move forward.

To summarize, I propose that we do the following, in order.

  • Abandon change 5302
  • Submit a new gerrit review based on 5302 patch set 23
  • Identify the questions we need the API to answer as suggested by John
  • Post draft man pages for new interfaces for review here, revise as needed
  • Refresh the patch with implementation of new interfaces, along with complete test cases and documentation

Are others on board with this plan?

Comment by James A Simmons [ 16/Jul/14 ]

Okay. I will take all the changes I did in 5302 and place it in the LU-4665 patch. Andreas we will need to submit a different patch on top of Ned's new patch with your idea of a llapi_layout_verify. That is assuming people will accept your idea. Ned new base patch will be fine since it will not be the back end of anything. Further testing of the LU-4665, which will use the layout api with lfs getstripe and setstripe, on my part will expose any problems with Ned's design. Plus the new base patch will not handle the case of DNE directories so that work will have to be developed as well. Much work to be done.

I promise to break the work I do in LU-4665 into small incremental patch so people can properly review them.

Perhaps Ned you could consider breaking your patch up into smaller pieces. It is a lot of work.

Comment by Ned Bass [ 16/Jul/14 ]

That is one reason I want to start over from patch set 23. It was pretty thoroughly reviewed at that stage so I'd like to see it land with only minor changes and do follow-up work in separate patches. I don't see much benefit to breaking up what's already been reviewed though. The main obstacle to landing it is the expensive validity checks that Andreas objects to. But those checks can simply be removed for now, and replaced with improved interfaces as discussed above in subsequent patches. How would that sit with you, Andreas?

Comment by Andreas Dilger [ 17/Jul/14 ]

I don't see any benefit to abandoning 5302and creating a new change over just pushing a new patch which drops the changes you don't want. Making a new change just means more places to look for information about this change.

As for checks, I agree with John. Simple checks against hard (constant) limits are fine, but I'd prefer to do any complex checks (e.g. opening /proc files and iterating) in a separate function.

The kernel has to do all of these checks itself anyway. The main drawback is that the xattr interface that is needed for BG/L is clumsy because it might "succeed" on any filesystem with xattr support, but not actually create the striped file as expected. The ioctl() interface was less troublesome in this regard since it is unlikely that any filesystem would handle the Lustre ioctl command. It is worthwhile to verify if setxattr("lustre.lov") will work on other filesystems or if they will refuse the "lustre.lov" xattr because it is not in one if the normal namespaces ("user", "system", "trusted", or "security") that are handled by the kernel.

Comment by Ned Bass [ 17/Jul/14 ]

I don't see any benefit to abandoning 5302

We can keep working there. I just find it becomes hard to navigate when the comment and revision history gets too long.

the xattr interface ... might "succeed" on any filesystem with xattr support, but not actually create the striped file as expected.

I have not found that to be the case in practice (ext4, zfs, tmpfs, and nfs). I believe (but haven't verified in the code) that a filesystem must explicitly register support for an xattr namespace beyond the standard ones, otherwise the kernel will return EOPNOTSUP.

Comment by Ned Bass [ 04/Aug/14 ]

Updated attached man pages to reflect recent API changes. In particular

  • removed llapi_layout_expected()
  • llapi_layout_by_{fd,fid,path}() renamed to llapi_layout_get_by_{fd,fid,path}()
  • Added flags parameter to llapi_layout_get_by_{fd,fid,path}()
  • Implemented flag LAYOUT_GET_EXPECTED which is accepted by llapi_layout_get_by_path() to implement functionality formerly provided by llapi_layout_expected()

Please review the documentation of the new flags parameter in llapi_layout_get_by_fd.txt.

Comment by Andreas Dilger [ 27/Aug/14 ]

One thing that snuck past my review in these patches was that the new llapi_layout_*() functions are all returning "-1" to the caller and returning the error codes via "errno" instead of returning the negative error numbers directly to the callers. IMHO, "errno" is the domain of the kernel and libc and should not be used by application libraries. This is a global variable that could be touched by many parts of the process, and there is the danger that errno gets clobbered by other parts of the code, leading to ugliness like:

        if (lum == NULL) {
                tmp = errno;
                close(fd);
                errno = tmp;
                return -1;
        }

and every piece of code that is returning an error having to do it twice:

        if (path == NULL ||
            (layout != NULL && layout->llot_magic != LLAPI_LAYOUT_MAGIC)) {
                errno = EINVAL;
                return -1;
        }

My preference would be to fix the new llapi_layout_*() functions to return the negative error number directly and avoid errno entirely.

Comment by Ned Bass [ 27/Aug/14 ]

I'm on board with this. I just had a hallway discussion about the various approaches for returning errors, and negative errno return values were generally agreed to be the least evil of the following possibilities:

  • Use errno.
    This is the current approach. It's evil for the reasons given above by Andreas.
  • Use a library-specific version of errno.
    e.g. llapi_errno. It wouldn't get stomped on, but we'd have to handle thread safety. Ick.
  • Implement our own class of error codes.
    This might be cleaner and more flexibile, but with a higher implementation and maintenance cost, plus UNIX programmers will be already familiar errno values.
  • Return negated errno values.
    This is the proposed approach. The only real downside is there's little precedent outside the kernel and llapi. But it's thread safe and cleaner than the current approach.
Comment by Christopher Morrone [ 28/Aug/14 ]

IMHO, "errno" is the domain of the kernel and libc and should not be used by application libraries.

I think that the lustre library should be considered a system library, not an "application library". From a normal application's perspective, the lustre library is as system as they come: you are using library the interacts on your behalf directly with the kernel to influence a service offered by the kernel. In that respect I would argue that errno is entirely reasonable to use.

I would argue that this kind of error handling is exactly what user-space C developers have come to expect from system level libraries. After all, if it isn't OK for use to use errno, is it really OK for us to reuse all of the standard error codes (EIO, EINVAL, etc.)? Shouldn't we have to invent our own error names and values if those things are only the purview of the kernel and Lib C?

Granted, using temporary variable to implement the use of error is mildly annoying. But is that really enough justification to violate the principle of least surprise for the user-space developers who will be consuming our library functions?

I think that the "Return negated errno values" approach is probably the least desirable of those proposed. This is a kernel-ism; the result of a clever hack that recognized that those memory values would never be valid so hey why not throw the error value in there. In user space, the programmers are going to think we have lost our minds if we force them to check for an negative version of an error code that is always positive everywhere else. At the very least, we would need to create macros or functions that the users would need to use to check the return code and another to translate the error code into the correct value. We would be shifting the annoying error code shuffling from the library writer to all of the library consumers.

errno is defined by the C language standard. Most, if not all, of the values of errno that we use (EACCESS, EAGAIN, EIO, EISDIR, etc.) are defined by POSIX.1-2001 or C99, not inventions of the Linux kernel.

If we use the same values as errno, users are going to want to use standard functions like perror() that assume the use of errno. Granted, strerror() also exists, but it is more difficult to use than perror().

It is already difficult to get users to check error codes, so I think it is important to keep things simple when reasonable to do so.

Comment by Andreas Dilger [ 28/Aug/14 ]

If there is an insistence on setting errno to return errors, it would still be possible to also return the negative errno instead of "-1" all the time. I'm not suggesting to return PTR_ERR() instead of NULL in case of memory allocation failures, as I agree that this is not very common for userspace programs.

Comment by Andy Nelson [ 28/Aug/14 ]

FWIW, I have already implemented the version of the api that uses errno stuff, and put strerror calls in various
error paths. For example:

rc = llapi_layout_stripe_count_get(layout,&num_comps ); if(rc!=0){*ierr=-1;goto writeerrout;};

...

writeerrout:
sprintf(cerr,"Lustre layout definition error: %s\n",strerror(errno));

This api is consistent with how I do things in other parts of the code as well, e.g. with stat calls and such.

As an implementer of userland code that uses the llapi functionality, I strongly prefer the errno approach.

Comment by Ned Bass [ 28/Aug/14 ]

If there is an insistence on setting errno to return errors, it would still be possible to also return the negative errno instead of "-1" all the time.

We could, but what would be the benefit? It would be awkward and potentially confusing for the API to specify more than one authoritative source of error codes. And as Andy points out, applications using both standard library and llapi_layout calls would prefer to use common error handling constructs.

Comment by Andreas Dilger [ 28/Aug/14 ]

Because it would keep the same API that the rest of liblustreapi has today, and there is no drawback to doing so. Virtually every application I've seen checks if (rc < 0) instead of if (rc == -1), so as it is clearly documented in the man pages that the functions return a negative value on error instead of "-1" there shouldn't be any problem. That allows applications to choose which behaviour they want to use for programming, and I don't see it introducing any significant complexity into the library - at worst it would mean return -errno instead of return -1 in some places.

Comment by Ned Bass [ 28/Aug/14 ]

so as it is clearly documented in the man pages that the functions return a negative value on error instead of "-1" there shouldn't be any problem.

If the man page merely specifies a negative value on error, than a properly written application shouldn't rely on the specific value returned. Therefore in order for this behavior to be useful, it most become a formal part of the API and documented as such. Simplicity is arguably the most important aspect of a well-designed API, and having two alternative means of returning error codes would add complexity for little gain. I'm not persuaded we should do it just because the legacy llapi functions do it. We should see this as an opportunity to get things right this time, not to carry over all the old baggage.

Comment by Andy Nelson [ 29/Aug/14 ]

I am all for abandoning the old api as soon as possible. It is a maintenance nightmare for my applicaiton code to handle. There is ugly and otherwise unneeded ifdef goo for different machines all over the place. Please dump that api. What Ned says is exactly right: there is a chance to do it right/better this time around.

As far as "negative value returned" vs "specific negative value returned (i.e. rc=-ETHE_ERROR_CODE), again, this leads to confusion and complexity that is just not needed. What happens when someone makes a commit sometime and gets them out of sync, such that the return code and errno are not the same any more, for example? Yes, thats a bug that
someone introduced, but that is the whole point. You can remove that bug years ahead of time by making it impossible to do that by design.

Comment by Andreas Dilger [ 29/Aug/14 ]

Andy, while I'm all for updating applications to the new API, yours is not the only application in the world that uses it, so we can't remove the old API very quickly. Also, this new API has only just been landed into a development branch and would need to be backported into the maintenance releases before it even has a chance to be used by regular users.

I think the best that can be done is to include this into all of the maintenance releases, update the old APIs to use this new code (James has started on that) and then it can be deprecated in a few years after it is available in those releases. It is also possible to mark those functions as deprecated in the headers so that application developers learn this before the API is removed. It would also make sense to update the user manual once this API is available in the maintenance releases.

Comment by James A Simmons [ 29/Aug/14 ]

I will hold off on my patches until this is resolved. I sent out a email to some people in our applications department to see what they say. I will report on that feedback.

Comment by Andy Nelson [ 29/Aug/14 ]

I fully understand the constraints of backwards compatibility and "can't get rid of that yet" cruft. That was very much the point of my previous comment: I've got to carry around backwards compatibility stuff for the different lustre api's until such time as the new one is available on the oldest machine I have to compile code on. I'm expecting that I can remove the "old API" functionality from my own code in something like 5 years or so, if the "new API" stuff lands in a distributed version right now. And that isn't even the case yet, so the date keeps getting pulled further and further out.

An example of that pain is the lustre_idl.h file, which I have to hack by hand and carry around with my code since it doesn't compile in user space and I need stuff out of there. The example codes in the lustre manual include it for example too. All the more complicated since that file changes in odd ways across different versions and I can't just use a 1.8x lustre_idl.h file on a lustre 2.x installation and expect it to work...

So this is why my emphasis on the "start now" and get the clock ticking. As far as new api format (errno and such discussions), we've gone over that, but here is a reiteration: My position as an applications developer is to have one way to access the error information, and for that way to be the errno setting stuff.

Comment by Christopher Morrone [ 04/Sep/14 ]

Because it would keep the same API that the rest of liblustreapi has today, and there is no drawback to doing so.

The old API is seriously deficient in its design. Yes, we need to keep it around for some time, but I don't see a great deal of value in maintaining compatibility with poor design. This is our chance to make a clean break and make something that applications can really use.

Virtually every application I've seen checks if (rc < 0) instead of if (rc == -1), so as it is clearly documented in the man pages that the functions return a negative value on error instead of "-1" there shouldn't be any problem.

Granted, that is not uncommon. But the man page usually says:

On error, -1 is returned, and errno is set appropriately.

So it is perhaps not unreasonable to guess there are are also folks out there that are used to explicitly checking for -1. Lustre returning only -1 could be considered the least likely to trip anyone up because it will work for checks of either rc == -1 or rc < 0.

Comment by James A Simmons [ 11/Sep/14 ]

So I got feedback from our users and middle ware developers. For our users they only care about lfs setstripe and getstripe working. They will most likely never program the striping api themselves. For the ones that do program the most common case is they test for success which means test for zero and then bomb out the app. They normally could care less about returned error values. What is most important to them is that a zero is returned for all functions on success. For our middle ware guys they also want zero to returned for all cases. Now for the error reporting they varied a bit on opinion. What I did get is they don't like negative errno values. They also were not fans of having to read errno itself after a function call. So they asked why not just return a positive errno instead since 0 is success and something else is failure. Also I found out in the discussion return 0 on success and a positive errno is defined in the POSIX.1c standard.

Comment by Andreas Dilger [ 12/Sep/14 ]

I would much rather stick with returning -1 and errors in errno than having positive error numbers returned from the functions. Many programs I've seen check "if (rc < 0)" for errors instead of "if (rc == -1)" , so I don't think it makes sense to break these.

Comment by Jodi Levi (Inactive) [ 03/Feb/15 ]

Duplicate of LU-2182

Generated at Sat Feb 10 01:37:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.