Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3
-
None
-
3
-
17483
Description
Ran into a problem that uncovers a potentially catastrophic issue.
The patch http://review.whamcloud.com/10393 made some non-backward-compatible changes to the liblustreapi interface. The offending changes are:
-extern int libcfs_ukuc_start(lustre_kernelcomm *l, int groups); +extern int libcfs_ukuc_start(lustre_kernelcomm *l, int groups, int rfd_flags);
extern int llapi_hsm_copytool_register(struct hsm_copytool_private **priv, - const char *mnt, int flags, - int archive_count, int *archives); + const char *mnt, int archive_count, + int *archives, int rfd_flags);
The second change is most serious, since it is rearranging the arguments for the function llapi_hsm_copytool_register() for no particularly good reason.
Running an OLD copytool linked against the old liblustreapi.so (Lustre 2.5.2 or earlier) with the NEW liblustreapi.so dynamic library (Lustre 2.5.3+ or Lustre 2.6.0+) exposes this problem. Because all of the required symbol references were found in the library, the linking was successful, so the function llapi_hsm_copytool_register() was called WITH BAD ARGUMENTS.
In this case, the result was harmless – the registration call simply failed. However, it's easy to imagine scenarios where swapping arguments could result in wiping out critical user data – switching the length and offset in a write operation, for instance. At the very best, this results in problems that are difficult to diagnose as in this case.
It appears that http://review.whamcloud.com/12836 landed a series of API check in lustre/tests/llapi_hsm_test.c, but this was only landed for Lustre 2.6.93 so the API change was unnoticed.
To allow changes like this to happen safely in the future there needs to be a version in the lib*.so files. There are basically two ways to do this:
1) Least burden on developer, but inconveniences users:
When you create the library, you use:
gcc -o liblustreapi.so.1 -Wl,soname,liblustreapi.so.1 ...
This creates a file named liblustreapi.so.1, with a dynamic linker name of liblustreapi.so.1. The application is compiled against liblustreapi.so.1, and will find the file liblustreapi.so.1 containing the correct linker references.
Library fixes that are fully backward-compatible simply "update" the old liblustreapi.so.1 by overwriting it. The burden on the library provider is merely to ensure that all changes are, in fact, backward-compatible.
When you change the API in a non-backward-compatible way (as in this case), you roll the version number:
gcc -o liblustreapi.so.2 -Wl,soname,liblustreapi.so.2 ...
The OLD application will no longer be able to resolve references to liblustreapi.so.1, and fails with a link error, which is easy to diagnose. The NEW application will link with liblustreapi.so.2, and will work properly using the new syntax and the new dynamic library.
The downside of this is that users need time to adapt to the incompatible changes in the API, and could be without any workable solution until that is finished. It also prevents the same binary from working on different versions of Lustre, which is the whole intent of liblustreapi.
2) More flexible solution
In this case, when you introduce an non-backward-compatible change to the library, you release TWO new libraries:
gcc -o liblustreapi.so.1 -Wl,soname,liblustreapi.so.1 ... gcc -o liblustreapi.so.2 -Wl,soname,liblustreapi.so.2 ...
The installation overwrites the old library, and adds the new library.
You must guarantee strict backward-compatibility in the latest 1 version (preserving syntax and functionality), and you provide the new API in the 2 version. Both work, so the OLD application (compiled against liblustreapi.so.1) and the NEW application (compiled against liblustreapi.so.2) both work.
If application is using the old API, or chooses to not move forward until the new library is thoroughly tested, or encounters a difficult problem with the new code, they can continue to compile against the old headers and liblustreapi.so.1 until all issues are resolved, and the user is not left without a solution.
Version 1 is now in a deprecated state, and can eventually be obsoleted and removed in the RPM %post scriptlet.
This ensures continuity for the user and less pressure on the users.
=================
Suggested solution:
Please implement 1) at the very least to prevent user behavior from creating a disaster. To summarize:
- Append a version number to liblustreapi.so.
- Roll the version number forward any time non-backward-compatible changes to the API are introduced.
- Add a %post scriptlet to the API to remove old versions of liblustreapi.so
This may open a gap where the customer will be unable to run after upgrading Lustre, but it will at least prevent catastrophic problems.
solution 2) above is probably the best in the long term. It is a bit more work for the developer, but it greatly enhances the user experience. It also imposes a counter-incentive to making arbitrary minor changes to the API.