lustre build system improvments (LU-3953)

[LU-7643] Remove kernel version string from Lustre release field Created: 08/Jan/16  Updated: 06/Jun/17  Resolved: 06/Dec/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Technical task Priority: Critical
Reporter: Christopher Morrone Assignee: Minh Diep
Resolution: Fixed Votes: 0
Labels: llnl, patch

Issue Links:
Blocker
is blocking LU-7645 Stop controlling the RPM Release fiel... Closed
is blocked by LU-5614 use %kernel_module_package for weak-u... Closed
is blocked by LU-7699 Overhaul lustre's versioning Closed
Related
is related to LU-8343 build failures on SLES Resolved
is related to LU-20 patchless server kernel Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Right now when RPM packages are built, we insert into Lustre's release field the version string from the kernel against which Lustre was built. For instance:

$ rpm -qpi lustre-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm 
Name        : lustre
Version     : 2.7.0
Release     : 2.6.32_504.8.1.el6_lustre.x86_64

Side note: A sysadmin is going to (and have in the past) think we messed up because of the ".x86_64.x86_64" in the file name, but the reason for it is that the first one is part of the Linux kernel version string, as we can see in the Release field above. The second .x86_64 is Lustre's.

The reason for including the kernel's version string in Lustre's Release field because Lustre has traditionally been packaged to work with one, and only one, specific version of a kernel. If you have two very slightly different kernel versions "2.6.32_504.8.1.el6" and "2.6.32_504.8.2.el6", for instance, then you currently need to compile lustre against both kernels individually. While the "rpm -requires" should also list the specific required version number, because there are so many very closely compatible kernels for which we need to juggle lustre builds, it was simpler for sysadmins and developers alike to add the kernel's version string into Lustre's release field.

But fortunately, this need to build lustre for every specific kernel is a self-imposed restriction, and work is under way to lift that restriction in LU-5614.

For many years, it has been possible to compile kernel modules once and then use them with any kernel that is ABI compatible. The Linux distro mechanism that allows this is often called "weak modules". LU-5614 should bring Lustre into the year 2006 and get it working with weak modules.

Once that is done, we can finally drop the kernel version string.

This is especially fortuitous for anyone using koji as a build system, because koji makes this sort of abuse of standard packaging practice pretty close to impossible. koji is used by fedora and its cousins, and it has also been adopted by LLNL for its RHEL-based TOSS distribution.



 Comments   
Comment by James A Simmons [ 08/Jan/16 ]

This will be an important step to support patchless servers.

Comment by Christopher Morrone [ 11/Apr/16 ]

I have the patch for this waiting in the wings. I'm not going to bother pushing it to gerrit yet because it will need frequent refreshes until the LU-5614 patch lands.

Comment by Gerrit Updater [ 03/May/16 ]

Christopher J. Morrone (morrone2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/19954
Subject: LU-7643 build: Remove Linux version string from RPM release field
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b14c0b91f0623f301ea00fc7b2a053912e93c42d

Comment by Christopher Morrone [ 03/May/16 ]

I decided to push change 19954 after all. I just won't assign reviewers until it looks like LU-5614 change 12063 is closer to landing.

Comment by Christopher Morrone [ 21/Jun/16 ]

With LU-5614 close to landing, I rebased this issues' patch. It is ready to go through the normal review process.

Comment by Christopher Morrone [ 29/Jun/16 ]

LU-5614 is done. Patch 19954 for this ticket was already based on that ticket's patch, so no rebase should be necessary. We just need to get the review process under way on this one.

Comment by Minh Diep [ 06/Jul/16 ]

Chris,

while it's fine on client rpm where we can move to different kernel version, it's difficult for lustre server rpm to move to different kernel version since it requires patched kernel. Without kernel version string in the name, it's not easy to find out the kernel it's built on.

Thanks
-Minh

Comment by Christopher Morrone [ 06/Jul/16 ]

Lustre server rpms do not require a patched kernel.

The only thing that needs a patched kernel at this point is ldiskfs testing. That should be fixed soon in LU-684. If for some reason that fails, we can take the approach suggested by James Simmons in LU-20.

Comment by Bob Glossman (Inactive) [ 06/Jul/16 ]

I strongly disagree. LU-684 doesn't eliminate the need for a lustre build with ldiskfs to be tightly tied to a specific linux kernel version. As long as we build ldiskfs by patching a particular upstream ext4 on the fly during a lustre server build we are subject to variations in the upstream ext4 that aren't represented by just having a compatible advertised kernel ABI that is constant. upstream ext4 changes occur unpredictably and in an unscheduled way in RHEL and SLES kernel updates.

Comment by Christopher Morrone [ 06/Jul/16 ]

Compatibility problems that don't change the ABI won't necessarily be addressed by freshly applying patches and recompiling either. We have had problems in the past where ext4 internal semantics changed without changing the API and without breaking ldiskfs patch application. If you care that much, and since those problems have actually hit, you should probably stop using the ldiskfs-as-patches approach altogether.

At least with the ldiskfs module fully compiled in the past, we eliminate the problem of overlooking ext4 internal semantic changes for a single version of the packages. The ldiskfs module is going to be in a known good frozen state. It will only be when larger semantic changes happen between the larger OS and filesystems that a recompile will be needed. And hopefully those types of changes are as rare as the issues inherent to the ldiskfs-as-patches approach within a stable OS kernel release series.

But if folks still insist on ldiskfs being tied to a single kernel, we can certainly do that and while also removing the kernel string from the lustre packages. The kernel string does not belong in the lustre package name. That is incorrect packaging and needs to stop.

The proper way would be to add a Requires to only the osd-ldiskfs subpackage. Is that going to be required to land this patch? It will probably cause us some trouble, but I'm willing to compromise and try adding that.

Comment by Bob Glossman (Inactive) [ 06/Jul/16 ]

I think it would be an acceptable solution if only the osd-ldiskfs rpm has a Requires for the particular and specific kernel version it was built on. That would directly enforce and tie it to the upstream ext4 version source it was built from. This is only my opinion. I think we need buy in from all concerned. Would really like to see comment from Minh, Andreas, Dmitry, or other experts.

Not sure how that is properly carried thru with the Provides exported by osd-ldiskfs or the Requires in other lustre rpms.

Comment by Christopher Morrone [ 06/Jul/16 ]

Not sure how that is properly carried thru with the Provides exported by osd-ldiskfs or the Requires in other lustre rpms.

I'm not entirely sure what you mean, but I'll take a stab at explaining. If the various other parts are working now, they will continue to work. The "lustre-osd" requirement is currently supplied by either the osd-zfs or osd-ldiskfs packages. One or both of them must be installed to install in order to install the main "lustre" package. With the proposed new specific kernel requirement in the osd-ldiskfs package, if the osd-zfs package is selected, everthing will install fine with any kernel that supplies the required versions of various symbols. If osd-ldiskfs is selected, it can only be installed if the correct kernel is installed.

Of course, multiple kernels can be installed at the same time, so there is no reason that the admin needs to boot the required kernel. Only that it be installed. But that can already happen now with the packages that contain the kernel version string.

If you think that too is too much of a problem, you are basically arguing that weak modules can't ever be used with lustre.

Comment by Bob Glossman (Inactive) [ 06/Jul/16 ]

Your outline makes sense to me, but I'm a little worried about something unexpected creeping in if people start trying to mix and match pieces from different builds. Up until now with all lustre rpms tied to a specific kernel they all had to come from the same build. I'm not at all arguing that it was correct or even good that they were all tied that way, but it did have the beneficial side effect of blocking mix & match.

With osd-ldiskfs exporting only a generic "lustre-osd" Provides and other lustre packages having only that as Requires, there is now the possibility of taking an osd-ldiskfs from one build and installing it with lustre rpms from a different build. While I don't know that this would cause problems I'm very much worried that it might.

Comment by Christopher Morrone [ 06/Jul/16 ]

With osd-ldiskfs exporting only a generic "lustre-osd" Provides and other lustre packages having only that as Requires, there is now the possibility of taking an osd-ldiskfs from one build and installing it with lustre rpms from a different build.

If that issue exists it is independent of the topic in this ticket. You can open a new ticket for that if you like.

Comment by Bob Glossman (Inactive) [ 06/Jul/16 ]

It isn't independent. Without the changes proposed here it couldn't happen. With the changes proposed here it can happen. That makes it dependent on exactly this topic.

I may agree that it could be covered as a separate ticket from this one in spite of that. Let me think about it a bit.

Comment by James A Simmons [ 06/Jul/16 ]

I see the discuss is moving in the direction of ldiskfs <> osd_ldiskfs <> lustre having strong dependencies on each other which is needed. This is orthogonal to the dependency on the external linux kernel. What both lustre and ldiskfs depend on is that the kernel's VFS api staying stable. If that doesn't happen then weak module support is not possible. Am I missing something if this is not the discussion?

This brings up interesting point. Do we need version hand shaking between ldiskfs and lustre? Anyways this is indeed a separate ticket.

Comment by Bob Glossman (Inactive) [ 06/Jul/16 ]

James,
I think what you are saying is correct, but it isn't just the VFS api that must stay stable. It is all the internal kernel APIs that aren't part of the well defined ABI that must stay stable. lustre kernel modules both ldiskfs and not use lots of calls to symbols that are EXPORTs, but may or may not be part of the ABI and can and do occasionally change from time to time within the same major linux version from an upstream vendor.

Comment by Christopher Morrone [ 06/Jul/16 ]

Now you are definitely arguing that one should never use weak modules. I don't even see why it would be acceptable use them on clients if you can't trust the semantics to stay constant with the same symbol version. If you can't trust them, you can't trust them.

As to APIs that are not part of the ABI...in the kernel I don't believe that there is any such distinction. A linux distro vendor may choose to advertise a subset of the kernel ABI that it considers stable and safe. Red Hat has a kernel symbol whitelist. Suse, in contrast, does not.

Yes, the osd-ldiskfs package has a long list of RHEL whitelist violations. Otherwise, the usage is pretty small. In a recent random build for master on CentOS 7.2, I see only the following non-osd-ldiskfs related off-whitelist symbols (in other words, osd-ldiskfs uses many off-whitelist symbols that I am not listing here):

PDE_DATA
__fentry__
__free_pages
__stack_chk_fail
kernel_stack
kstrtoull
seq_lseek
seq_open
seq_read
remove_wait_queue

Some of those Red Hat might be amenable to adding to the whitelist. Some maybe we can choose a different symbol. Some we might not care and decide the level of risk is completely acceptable.

I don't see an issue with weak-modules use with zfs. Sure, maybe the way ldiskfs is currently produced makes it more vulnerable. If you want to add extra restrictions and high barriers to usage to ldiskfs then, speaking as an all-zfs house, I don't have too much of a concern about that. Some concern though...we do have some labs that might still be using ldiskfs from TOSS's lustre.

Anyhow, I think James is right about this getting off topic for this ticket.

Comment by James A Simmons [ 07/Jul/16 ]

So the question related to the patch for this ticket is does having the kernel name string in the rpm provide any gain. I would say no. Not landing this patch will not change the kernel version dependency issues. Even when we resolve the dependency issues does having the kernel version string in the release field improve anything. Again I would say no. You can figure out kernel dependency using rpm queries instead.

Comment by Bob Glossman (Inactive) [ 07/Jul/16 ]

That's just it. These changes also delete Requires and Provides in packages too. Can't figure out dependencies using rpm queries either.

Comment by Christopher Morrone [ 07/Jul/16 ]

Bob, I think you may be thinking of another change. Change 19954 does not delete any Requires or Provides.

Comment by Bob Glossman (Inactive) [ 11/Jul/16 ]

Christopher,
You are probably correct. I may have been thinking of some other change.

If you go ahead and add a kernel version Requires for osd-ldiskfs as discussed in earlier comments I will be satisfied with that.

Comment by Christopher Morrone [ 11/Jul/16 ]

Yeah, but as we got to later in the discussion, that isn't very useful. I think I have to retract that offer.

Just having that one kernel installed doesn't stop the osd-ldiskfs package's modules from being used in any of the other kernels that are also installed. The weak-updates system will still symlink the modules into all kernels with compatible symbols.

Comment by Bob Glossman (Inactive) [ 11/Jul/16 ]

even if having a proper Requires permits (incorrect) weak-updates links of osd-ldiskfs in other not really matching kernels that happen to be installed, having the Requires at least gives better tracking of exactly which kernel version the particular osd-ldiskfs was derived from and is intended for.

While I would really prefer enforcement I want to at least have visibility into the dependency.

Comment by Christopher Morrone [ 11/Jul/16 ]

I'm not really on board with adding a dependency that won't function correctly.

Comment by Gerrit Updater [ 20/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19954/
Subject: LU-7643 build: Remove Linux version string from RPM release field
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 28c17d40e5a597a3d2f10f1f43039ef92425954e

Comment by Joseph Gmitter (Inactive) [ 06/Dec/16 ]

reopening to fix fields to show up in 2.9.0 changelog

Comment by Cory Spitz [ 06/Jun/17 ]

FYI:
http://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Improved-Versioning_Morrone_DiNatale.pdf
https://www.youtube.com/watch?v=50VN9Q7BHYA&index=5&list=PLA5dHg1_l3V8r6eqrUnl8DiB-KH1K5yLH

Generated at Sat Feb 10 02:10:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.