lustre build system improvments
(LU-3953)
|
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Technical task | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Christopher Morrone |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
There are numerous oddities about Lustre's system for applying version numbers in the configuration and packaging system. At this point, it is looking strongly like an overhaul is in order. |
| Comments |
| Comment by Christopher Morrone [ 23/Jan/16 ] |
|
Why do we have both lustre_ver.h and lustre_build_version.h? So silly. |
| Comment by Gerrit Updater [ 23/Jan/16 ] |
|
Christopher J. Morrone (morrone2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/18107 |
| Comment by Christopher Morrone [ 23/Jan/16 ] |
|
I have a first rough version of the first overhaul patch. It is a net reduction in almost 300 lines, so I must be doing something right. I think the patch is already decent, but I haven't really considered how lbuild is going to react to the changes, or looked at how the rpm or dpkg packaging systems are going to handle it. It may need further work to get along with all of that. But it is worth running this version through the build farm as a fortestonly patch to see what blows up first. There is quite a bit of additional cleanup that can be done, but I want to try to avoid putting too much change in one patch. I'll add additional cleanup in a follow-on patch. Some things that I'll look into (just so I don't forget):
|
| Comment by Christopher Morrone [ 23/Jan/16 ] |
|
Two more revisions to fix easy problems. The first major issue is now what I expected: I need to teach lbuild about the new versioning scheme. It also looks like the ppc64 builder isn't happy either. I'm not sure that it provides enough information to do anything about it though...maybe I can make a good enough guess to get it working. But that will all have to wait until next week. |
| Comment by Gerrit Updater [ 23/Jan/16 ] |
|
Christopher J. Morrone (morrone2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/18108 |
| Comment by Gerrit Updater [ 23/Jan/16 ] |
|
Christopher J. Morrone (morrone2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/18110 |
| Comment by Gerrit Updater [ 24/Jan/16 ] |
|
Christopher J. Morrone (morrone2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/18111 Project: fs/lustre-release |
| Comment by Gerrit Updater [ 24/Jan/16 ] |
|
Christopher J. Morrone (morrone2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/18112 |
| Comment by Christopher Morrone [ 25/Jan/16 ] |
|
OK, I think I'm ready for the first round of reviews on the current four patches: |
| Comment by Christopher Morrone [ 17/Feb/16 ] |
|
It is now just three patches: |
| Comment by Christopher Morrone [ 15/Mar/16 ] |
|
In a http://review.whamcloud.com/18107 review, Oleg pointed out a potential issue with lack of a special-case for RC releases. I admit that I totally missed how RC releases are currently being done. To make a long story short, I am going to argue that we really shouldn't do it that way any more. Correct me if I am wrong, but as I now understand it, even though a tag might have a name of "2.8.0-RC1", the build system is currently stripping off the "-RC1" and building the code as if it is a final "2.8.0" release. As Oleg explained it, the intention here has been that when an RC is tested and agreed to be the final release, the packages from the RC build are already named with the final release version and can be released as-is. The advantage here is that we don't need to trust our build procedures to be consistent, we can simply release the packages from the most recent RC. The down side should be pretty clear though too: RC releases are things that we explicitly want to release and have the general community test and vet. So takeing 2.8.0 as an example, which to date has five RC releases (-RC1 through -RC5), there are exactly five different versions of the code now out there in the wild with no way to tell them apart. I would argue that this downside of having multiple different codes out in the wild with exactly the same version vastly out weighs the possibility that we might not compile and generate packages the same way twice. So I would argue that not special-casing RC releases is a feature of change 18017, not a defect. We will need to come up with a new way of versioning RC releases. I would suggest that we use the obvious method already established (but not actually fully employeed): Release candidates will use the previous release's number, but have the thrid number start at 90. For example, release candidates for the 2.8.0 release would have version numbers like: 2.7.90, 2.7.91, 2.7.92, 2.7.93, etc. This also has the advantage that it works well with packaging systems like RPM. Many rpm-based distros explicitly advise against using substrings like "RC1" in version strings, because rpm just doesn't handle them, and assiciated package upgrades in any particularly intelligent way. Granted, we somewhat avoided that problem by stripping off the RC altogether. But I think it is very bad form to have multiple different versions of the code to have the same version string. |
| Comment by Oleg Drokin [ 15/Mar/16 ] |
|
2.7.90, 2.7.91, ... are already used. they mean code drops in code freeze - i.e. when only blockers remain. RCs (release candidates) on the other hand are generated when there are no more blockers left (At the time of tagging, anyway). I guess it's possible to continue the versioning as is, but we might run our of digits quite soon in some cases. The "five versions of code" issue is real, I guess, but on the other hand it could be solved with other means. E.g. once we transition to signed packages, the real releases might be signed with a different key than the unsigned (or signed with a "build bot" only) key. Also we don't widely distribute these RCs and so nobody could get them. We can also destroy interim RCs from download site if needed (or it happens automatically over a relatively short time). And we can tell them apart by a build date too. The "we don't need to trust our build procedures to be consistent" is a real risk on the other hand on many levels. There might have been a system update between the latest RC and the release that updated some library breaking something in process (updating gcc, whatever), a glitch during building for other reasons producing a dud (ram error, disk error), somebody might have compromised the build server meanwhile (or inserted some code into the git tree), ... In short, there's certain appeal to release code that was actually tested vs code that was only built even when it's supposedly the same as the one that was tested because it was supposedly built from the same source. |
| Comment by Christopher Morrone [ 15/Mar/16 ] |
OK, fine, then they could start at .120, or .500, or whatever. I don't care so much about the details as long as we follow some simple rules that make the versioning work well for the various packaging systems like RPM. For instance, each release (including release candidate releases) should have a unique version number. The version number should be mostly numerical, and atomically increasing. Letters should be used sparingly, and preferably not at all, in the main part of the version string. (Where main part is the first four period-separated numbers. We allow for alphanumeric strings for per-organization branding after that)
With all due respect, I do no think that will be an acceptable solution. The version string is specifically designed to make it easy to tell different builds and packages apart. Unique version numbers are something packaging systems like rpm, and dpkg understand. The relationship between different keys and how they effect install precedence, is not. You say that there are other reasonable methods, but I do not really think we are likely to find one. The version string is very explicitly the universal solution for this problem. We are introducing larger problems by trying to avoid that universal solution.
Right, and no doubt that is how this has flown under the radar for so long. Because this part of your development phase is kept pretty secret. But this is pretty much the opposite of how a good open source development project, with an open development process, should work. If we had a more open process with good communication, then every RC release would have an announcement on mailing lists letting people know where to get the packages and asking for help in testing. This announcement shouldn't just go to lustre-devel, it should also got to lustre-discuss. We want adventurous users/sysadmins that can carve out some time to help with testing to do so. They should not have to go to git to do that, they should be able to download packages. If we make those packages reasonably well known and easy to access (which we should definitely do for 2.9.0 RCs and into the future), then it will not be at all acceptable to have them all use exactly the same version number. They have to be uniquely and clearly versioned. The "we don't need to trust our build procedures to be consistent" is a real risk on the other hand on many levels. There might have been a system update between the latest RC and the release that updated some library breaking something in process (updating gcc, whatever) I agree, that is a more reasonable concern. But it is one that is pretty much entirely manageable with a small amount of process and good communication. When RC releases are about to start, we send an announcment out to the build farm admins saying "RCs are about to begin. Please freeze all changes to builders for branch X until further notice". If you are really worried, you call them on the phone and get a verbal confirmation that they understand. Many, many, many software projects are able to handle that level of release management in the real world. I believe that we can on Lustre as well. I'm putting my money where my mouth is too, by spending LLNL manpower on setting up a new buildfarm for Lustre. We'll present the prototype at LUG this year. Furthermore, there are those that would argue that if Lustre can't compile against multiple different libraries with the same ABI, then Lustre is just broken. Remember that Lustre does not own the entire OS. It really needs to work correctly anywhere the packages install without being forced. But I think keeping the builders static during the RC phase is totally reasonable.
Thats a one-in-a-billion type of problem. I don't think that in even remotely as much of a problem as releasing several packages with different contents, but versioning them identically.
That isn't a problem specific to release condidates, so I don't think it is relevant to the conversation.
I actually think it is fairly reasonable to build the same commit with no other change but the version string and not do any additional testing. But if people want to do more testing, that is certainly fine with me. I would argue that 8-24 hours of automated-only testing would be more than sufficient as a sanity check that nothing went horribly wrong between the two tags (of the exact same commit).
Yes, I understand the appeal and your points. But I don't think they are strong enough to justify the offsetting bad packing practice that you have introduced. We need to stop having multiple packages of different contents with the same exact version. That is simply not acceptable. |
| Comment by Gerrit Updater [ 23/Mar/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18107/ |
| Comment by James A Simmons [ 11/Apr/16 ] |
|
I tried a spin with the latest lustre on Ubuntu 15.04 and I get this error: dpkg-buildpackage: warning: debian/changelog(l1): version '2.8.51_17_g5bd8f72-1' is invalid: version number contains illegal character '_' So I did some digging and I found that '_' is actually an invalid character for fedora as well. See this link: https://fedoraproject.org/wiki/Packaging:NamingGuidelines#Separators So basically we will need to fix the lustre version not to contain any '_' for proper packaging on rpm and deb systems. |
| Comment by Christopher Morrone [ 11/Apr/16 ] |
Actually, that page says the following:
We are not using underscores in the name parts, i.e. the Name field. Underscores are only used in the Version. But yes, it looks like we'll need to work something out for dpkg systems. The main reason that we use underscores in the version is because Lustre currently has a variable number of version fields, which makes it difficult to tell the delineation between the upstream version and the following part of the version, whether third-party or development (git describe) information. We can't use a dash (-) in the version string on rpm systems. That pretty much leaves a period as the only other sane option if we decide to stop using underscores. But maybe we stick with underscores on rpm systems, and munge the string in some acceptable way for dpkg? |
| Comment by James A Simmons [ 11/Apr/16 ] |
|
I discovered the |
| Comment by Christopher Morrone [ 11/Apr/16 ] |
Yeah, that is true. For now, you will just need to rerun autogen.sh to make any version changes. This could be automated in the future, but we pretty much need to fix autoreconf (LU-7700) to do it sanely. |
| Comment by James A Simmons [ 12/Apr/16 ] |
|
I tracked down what needs to be changed for make debs. We have: lversion=$$(sed -ne 's/^#define VERSION "(.*)"$$/\1/p' config.h); The sed command needs to be modified to change '_' to '-'. I'm no sed master so Chris what needs to be done to make that so. |
| Comment by Christopher Morrone [ 12/Apr/16 ] |
|
Thanks for the legwork on that, James! We don't even need sed any more; I'll just use tr. |
| Comment by Gerrit Updater [ 12/Apr/16 ] |
|
Christopher J. Morrone (morrone2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/19488 |
| Comment by James A Simmons [ 12/Apr/16 ] |
|
Yeah more simplification. Debian platforms are back in business. Thanks Chris. |
| Comment by Giuseppe Di Natale (Inactive) [ 13/Apr/16 ] |
|
Already mentioned this to Chris, but this may have broken SUSE builds on the Lustre Buildbot. The error I'm seeing on builds is "error: line 86: Empty tag: Release:". Example log is here. |
| Comment by Christopher Morrone [ 13/Apr/16 ] |
I don't think so. I think that this is a problem in the buildbot configuration. I will open a github ticket shortly. |
| Comment by Gerrit Updater [ 17/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18108/ |
| Comment by Gerrit Updater [ 21/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18111/ |
| Comment by Gerrit Updater [ 02/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19488/ |
| Comment by James A Simmons [ 03/May/16 ] |
|
Chris is everything for this ticket landed. Can it be closed? |
| Comment by Christopher Morrone [ 03/May/16 ] |
|
Yes, I think now is a good time to close this one. |