[LU-1199] lustre build system overhaul Created: 08/Mar/12  Updated: 11/Mar/15  Resolved: 16/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.4.0, Lustre 2.5.0, Lustre 2.6.0

Type: Improvement Priority: Minor
Reporter: Christopher Morrone Assignee: Minh Diep
Resolution: Fixed Votes: 1
Labels: None

Attachments: File lustre.spec.in    
Issue Links:
Duplicate
is duplicated by LU-2290 build to support automake-1.12 Resolved
is duplicated by LU-2841 rpm: Clean up spec file Resolved
is duplicated by LU-3953 lustre build system improvments Closed
Related
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-3408 client-only "make rpms" ldiskfs detec... Technical task Closed Minh Diep  
LU-3462 Eliminate ldiskfs recursive & indepen... Technical task Resolved Minh Diep  
LU-3463 Create mock-based build farm capability Technical task Closed Minh Diep  
LU-3464 Create lustre-client and lustre-serve... Technical task Closed Minh Diep  
LU-3465 Reevaluate method for installing rpms... Technical task Closed Minh Diep  
LU-3466 Make lustre-iokit a subpackage of lustre Technical task Resolved Minh Diep  
LU-3476 Eliminate bad macros in lustre spec file Technical task Closed Minh Diep  
LU-3673 lustre/lvfs/fsfilt-ldiskfs.c: No such... Technical task Closed Minh Diep  
LU-3724 Fix horrible file names in standard path Technical task Closed Minh Diep  
Rank (Obsolete): 5154

 Description   

I think that we need to take a step back and address some fundamental problems with the lustre spec file.

My number one complaint is that lustre source rpms are not buildable (using rpmbuild) in any sane fashion. This has been a constant source of work for folks here at Livermore, because our entire automated build infrastructure is built around the concept of taking source rpms as input and generating binary rpms.

Because the lustre srpms are not buildable, no one tries to do that (except for us). So we fix things for us, but then almost every time the spec file is touched, it breaks something for us again.

I think we need to get on the same page, and finally clean up the lustre spec file. I think the requirements here are:

1) lustre srpm must be rebuildable
2) spec file must take sane options, to allow configuration of lustre at srpm rebuild time
3) must make no attempt to parse options out of configure_args

I think the greatest area of pain will be dealing with "make rpms". The current way that the build system passes option from the initial configuration step down through make rpms and into the spec file is, at best, jerry-rigged. Passing a single giant "configure_args" define into rpmbuild and then parsing out options to set rpm variables is just backwards from the way that that an rpm spec file SHOULD work.

A spec file should take options (rpmbuild -with/-without options when applicable), and then assemble a configure line based on THAT.

Another example of the spec file having it the wrong-way around is "is_client". In order to perform a build that leaves out the server code and rpm dependencies, you set the package NAME! Thats crazy, for not the least reason that now no one can choose to name the package differently when they want a client-only build.

A better approach is to have an rpmbuild "--without servers" option. That sets a variable in the spec file. The spec file uses that flag to conditionally include Requires directives and such, and the lustre package name is changed to "lustre-client" IFF the lustre_name variable has not already been set. This would allow me, for instance, to generate a set of binary rpms with a package name of "lustre-orion", and also make it a client-only build. I cannot accomplish that with the current spec file.

So now the big question: Do we need "make rpms" at all? Where is it really needed? I suspect that Whamcloud's build farm uses it, but I think that you guys should really switch to generating binary rpms from a single source rpm once we get that working reliably.

If we do, I think there are ways that we can make that work with a spec file and source rpm that are done the Right Way, but if we can just drop it altogether that would certainly be easier.



 Comments   
Comment by Christopher Morrone [ 08/Mar/12 ]

See the attached lustre.spec.in file. This is my first pass at cleaning it up, but is by no means complete. I have not yet removed all instances of the spec file peeking into the configure_args string, for instance.

But take a look at how I added "-with zfs" and "-without server" options. Thats the template for how I want to procede.

Comment by Christopher Morrone [ 08/Mar/12 ]

Another problem with the spec file is that we've completely punted on the idea of correctly setting BuildRequires and Requires lines. I think that Brian Behlendorf's spl-modules.spec file demonstrates the way to make that work for something that builds against the kernel (which is packaged differently for every distro), and make a source rpm that is buildable on any rpm-based distro. AND it still allows building against a kernel source tree in a non-standard location as well. He's really done an impressive packaging job with ZFS and SPL (and actually tests them all in a VM build farm), and I'd like to bring some of that to Lustre.

https://github.com/zfsonlinux/spl/blob/master/spl-modules.spec.in

Comment by Andreas Dilger [ 09/Mar/12 ]

Chris,
I agree that the "make rpms" support is shaky at best. The Lustre build system actually uses the "lbuild" script, but I'm personally not very fond of this and endeavour to keep "make rpms" working since this is what most users use.

If you have updated the lustre.spec file, that is great, though I'd prefer if we still kept a "make rpms" target that calls "rpmbuild

{options}

lustre.spec" (or similar as appropriate) for compatibility reasons since this is what is documented and what other sites use (AFAIK). It wouldn't be a terrible idea to pose this question on lustre-discuss and wc-discuss to get feedback on how other sites build Lustre.

We'll definitely need some feedback from Brian Murrell about the current Lustre build system to see how this change will affect our own build scripts. I don't know what our plans are for "lbuild", but it is definitely not trivial to maintain, and perhaps we can do some cleanup at this time.

Comment by James A Simmons [ 09/Mar/12 ]

I started some of that work in ticket LU-281. With the recent testing I have not gotten to finishing it up. http://review.whamcloud.com/#change,508 is the initial patch. I initial started it so I could build lustre rpms that respect --prefix option to configure which it currently doesn't.

Comment by Peter Jones [ 09/Mar/12 ]

Chris

I agree that this does continue to bite us. This will be a major piece of development that will need to be carefully managed. Perhaps this is something to raise at the OpenSFS TWG to assess the level of interest in this initiative and funding it as a formal project.

Peter

Comment by Christopher Morrone [ 09/Mar/12 ]

Andreas,

I would like to see lbuild go away as well.

And ok, we'll need to keep "make rpms" target. It will need rewriting to work with the new spec file options, but I agree that it is the right thing to do. Its very convenient for developers, and possibly end users that get the source instead of a .src.rpm.

Brian Behlendorf again has some examples in zfs of how to create more sane srpm and rpm make targets. Here's a pointer to the m4 for rpm creation in zfs:

https://github.com/zfsonlinux/zfs/blob/master/config/rpm.am

Comment by Brian Murrell (Inactive) [ 12/Mar/12 ]

I would like to see lbuild go away as well.

lbuild can't really "go away". lbuild does much more than simply build Lustre RPMs (which is all that make rpms does). lbuild also builds the kernel, OFED and any other ancillary products/drivers required.

That said, I fully support the direction this ticket appears to be headed in. In fact James A Simmons and I have already been discussing similar issues in LU-281 where I had already indicated that writing configure options directly into the lustre.spec by way of substing into lustre.spec really needed to be improved upon using rpm macro definitions (i.e. --with/-without rpmbuild options).

But back to lbuild... I have agreed for a long time in principle with Andreas that it would be nice if lbuild would just use make rpms to do the lustre building part as it would force us to eat our own (make rpms) dogfood and ensure that that code does not bitrot. And yes, some less-than-elegant hacks have been made to the lustre.spec.in and make rpms to try to head in that direction but I agree that it requires a more comprehensively designed solution than time for this task in the past has ever afforded.

It seems to me that there needs to be some kind of two way mapping, or an algorithmic connection between (the lustre-specific) configure options and the rpm --with/--without options since we will typically need to convert in both directions. We need to be able to create an rpmbuild list of --with/--without options that reflects the user's configure options (i.e. for configure ... && make rpms) and we also need to convert from a list of rpmbuild (i.e. --with/--without) options back into configure options inside the lustre.spec.in.

I suppose each .m4 macro that is a configure option that needs to be turned into an rpmbuild option can build up a string of options to give to rpmbuild, but some kind of table that exists in one place only seems easier to maintain than having option list building scattered throughout the autoconf macro files. A wrapper macro (for macros that present configure options) around such macros that included the rpmbuild option that it creates could work also I guess.

Comment by Christopher Morrone [ 12/Mar/12 ]

Maybe lbuild lives on in some other form, but in my opinion it can and should be removed from the lustre tree.

With lustre finally becoming patchless on the server side, having scripts in the lustre tree to build a kernel become obviously inappropriate.

Having scripts in lustre to build ofed seem similarly inappropriate. I would need an explain of why one would want to do that. But even if that is useful, it does not belong in the lustre source tree.

Comment by Christopher Morrone [ 12/Mar/12 ]

I don't think the spec file and configure script interactions are quite as complex as you are implying. Or at least not the solution.

The spec file needs to clearly document which options it takes in comments at the top. The "make rpms" target needs to convert configure options into spec file options where appropriate.

These operations are both one-way. A lookup table isn't really needed (and is probably too much additional complexity).

I don't think a new two-way translation tool will be reasonable to write or maintain. If you think that will work out well, I think you need to demonstrate someone else that has done it that way successfully in the past.

I think you'll get yourself into trouble when you start thinking about it as a complex two-way interaction. It really is just two one-way interfaces.

The spec file needs to know how to build a configure line based on its input.
The configure script needs to know how to construct rpmbuild options based on its input. Those are both fairly straight-forward tasks. Lets not over complicate it.

Comment by Brian Murrell (Inactive) [ 12/Mar/12 ]

I don't think the spec file and configure script interactions are quite as complex as you are implying. Or at least not the solution.

Right, not necessarily complex for somebody who knows to write configure macros and rpmbuild macros, etc.

What I am trying to avoid is having 3 different places one has to make changes every time a new configure option is added and increasing the self-maintainability of the code.

But of course, I am not signing up to do all of the work so it's easy for me to armchair quarterback.

The spec file needs to clearly document which options it takes in comments at the top. The "make rpms" target needs to convert configure options into spec file options where appropriate.

Right, and these are the sorts of changes that non-build developers classically forget [how] to do.

I think you'll get yourself into trouble when you start thinking about it as a complex two-way interaction. It really is just two one-way interfaces.

That I suspect will wind up being reciprocally 1:1 with each other. I could be wrong but that's just my gut feeling. It might be as simple as a flat text file with two lists, one of the configure option and one of the rpmbuild option and grep would (inefficiently) be your mapping tool.

But since you are implementing it, I will of course defer to your implementation.

The spec file needs to know how to build a configure line based on its input.

Right. And I suspect that the above list could make the rpm macro argument processing and argument list building automatic. But again, that's just a gut feeling.

The configure script needs to know how to construct rpmbuild options based on its input.

Right, again, something I think could be written once to use the above list.

Those are both fairly straight-forward tasks. Lets not over complicate it.

Again, since you are implementing it, I will defer. I was just throwing out ideas to minimize future confusion, code review iterations and implementation details.

Comment by Brian Murrell (Inactive) [ 12/Mar/12 ]

Maybe lbuild lives on in some other form, but in my opinion it can and should be removed from the lustre tree.

Why remove it? If people want to see how we build the binary packages, it's all there for them to see. No secret "build system" sauce. On the other hand, leaving it there does nothing in terms of usability of the rest of the tree. If people want to roll their own build scripts, having lbuild there does nothing to prevent that.

With lustre finally becoming patchless on the server side,

Is that "there" yet? I'm not quite in the development loop of lustre lately but I didn't think that was a reality yet.

Having scripts in lustre to build ofed seem similarly inappropriate.

Why? If nothing else it serves to document the process of building OFED and then building lustre with it. Again, what harm is leaving it in the tree doing?

I would need an explain of why one would want to do that.

Do what? Build OFED and Lustre with OFED? People frequently want to use OFED rather than the kernel supplied I/B stack, typically because they have hardware that is newer than the support their vendor's kernel is providing them. We, therefore, need to be able to test our builds with OFED to ensure we have not broken something.

But even if that is useful, it does not belong in the lustre source tree.

Again, I humbly disagree since it provides no harm to have it there.

Comment by Christopher Morrone [ 13/Mar/12 ]

Why remove it? If people want to see how we build the binary packages, it's all there for them to see. No secret "build system" sauce. On the other hand, leaving it there does nothing in terms of usability of the rest of the tree. If people want to roll their own build scripts, having lbuild there does nothing to prevent that.

lbuild is not documentation. Documentation is documentation. And let me tell you, lbuild is already pretty inscrutable secret sauce, public or not. The harm is the additional development burden that it places on the community.

So let me turn that question around to you: Why should lbuild be in-tree?

It is a script that is pretty specific to WC's build farm. Make the argument for why the community should support it.

Everyone's build farm needs a little scripting glue. But we can't possibly predict what the will be, and we really shouldn't try. What we should have is a build system with clearly defined and well documented options. That is something that we can all agree on, and is beneficial to all.

Comment by James A Simmons [ 11/Apr/12 ]

Posted my first run at this patch.

http://review.whamcloud.com/#change,508

Its a first runs so its far from complete but with feedback it can be made to work for everyone.

Comment by Christopher Morrone [ 14/May/12 ]

This is on hold at the moment. The Livermore team is swamped with 2.1 bugs and Orion branch work.

Comment by Christopher Morrone [ 26/Jun/12 ]

I took a look at my spec file work again. Here's what I have so far:

  • Mostly works on RHEL6
  • "make rpms" works. It has the "srpm" target as a prerequisite, and therefore help ensure that we are always testing that the lustre source rpm is rebuildable (which is great, because it never was in the past).

But I've hit a bit of a roadblock in the clean up. Namely: ldiskfs.

Every time LLNL tries to clean up the build system we are thwarted by ldiskfs. To go any further, I really need to break the two-way interconnections between the ldiskfs and lustre build systems. build/autoMakefile.am.toplevel, for instance, is an abomination.

Comment by Christopher Morrone [ 16/Jul/12 ]

I've pushed a couple of patches to show my work-in-progress:

http://review.whamcloud.com/3420
http://review.whamcloud.com/3421

The former is just an obvious cleanup that can land now. The 3421 patch is the meat of the changes, but it is not finished. It works for LLNL's needs, but has some obvious work still needed:

1) It breaks the recursive builds. So, for instance, no ldiskfs at the moment.
2) Only RHEL support (possibly only RHEL6) at the moment.

Comment by Christopher Morrone [ 12/Oct/12 ]

Some more thoughts on where this needs to go: recursive rpm builds need to go away. There is no sane way to make it work for most people. spec is designed to allow multiple packages to be built, but all from ONE spec file.

So I would suggest that we:

1) remove ldiskfs from the lustre tree. Make it a stand-alone project with its own build system and spec file (LLNL already does this, so it won't be too technically difficult)

2) Either make libsysio and lustre-iokit separate repos, or just merge them more cleanly into lustre's build system

Comment by Christopher Morrone [ 09/Nov/12 ]

The following change is ready for review:

http://review.whamcloud.com/4409

This make's ldiskfs's build system almost entirely independent of lustre's build system. With this patch, ldiskfs will build even when copied out of the lustre tree.

Comment by Christopher Morrone [ 26/Nov/12 ]

Following Andreas's recommendation, I created a few patches before 4409 that attempt to improve formatting and remove obsolete hunks in some of the build system files. The new patches are:

4677, 4678, 4679, 4680

Comment by Christopher Morrone [ 05/Dec/12 ]

Patch 4409 is stuck in autotest/maloo purgatory, and could use someone holding down the "rerun" button. Or we could just land it without waiting for the irrelevant autotest failures.

I have a new patch 4728 that builds on the previous patch. This one is a bit more radical, stripping out much of the unnecessary complexity that was left over from the Lustre build system. It also moves ldiskfs's build system much closer to what LLNL does in our external ldiskfs repo. There is even more that can be cleaned up, but I'll hold some of that for a future patch.

Comment by Christopher Morrone [ 17/Dec/12 ]

4409 is in the next level of purgatory: gerrit final landing.

Comment by Christopher Morrone [ 28/Dec/12 ]

Note that I have a patch in the works to overhaul the ldiskfs spec file. Among other things, it adds an ldiskfs-devel package, which Brian Murrell requested in another patch. We've been making a -devel package in LLNL's external ldiskfs package for years. This latest patch basically pulls more things from LLNL's package into lustre's ldiskfs tree. Mostly I just need to rework lustre itself to use the new packaging in a sane way before I can push it up.

Comment by Christopher Morrone [ 07/Jan/13 ]

4409 landed, yay! There was much rejoicing!

Still waiting for 4728.

Comment by James A Simmons [ 08/Jan/13 ]

Chris can you post links to all your pending patches here.

Comment by Peter Jones [ 08/Jan/13 ]

James

You can get a dynamic list of the patches relating to this ticket (and one false positive) using http://review.whamcloud.com/#q,status:open+message:1199,n,z

Peter

Comment by Christopher Morrone [ 08/Jan/13 ]

I've already been noting the ones that are ready for general consumption as I go along. Like Peter said, the gerrit search will normal find things if the commit message is correct (and I try to keep them so). I would search on "message:LU-1199".

4728 landed yesterday.

I may not have mentioned 4693, but it is more minor. Although there was another patch somewhere that was marked down in preference of this one.

3421 is marked "fortestonly", and hasn't been updated in some time. We use it at LLNL, so it is kind of a preview of some things I plan to do. But there is much to do between where master stands and getting 3421 landed. Pretty soon I am going to butt up against the upstream build system, and things will get significantly more difficult. But I'm working to avoid rewriting the lbuild scripts until really necessary.

Right now I am working on a script to make the ldiskfs source rpm buildable, and to make the "make rpm" use the srpm as the source. Like I said in this comment, I mainly need to rework Lustre's part of things to work with the new ldiskfs spec file and build method. But thats easier said than done. Every time I enter a build file in lustre I see ten things that need fixing.

Comment by Christopher Morrone [ 08/Jan/13 ]

Added a simple, self-contained cleanup patch 4975.

Comment by Andreas Dilger [ 10/Jan/13 ]

Chris, can you please look at the Ubuntu build failure introduced by the http://review.whamcloud.com/4409 patch. We don't build Ubuntu for each review, only for the daily full builds.

http://build.whamcloud.com/job/lustre-master/arch=x86_64,build_type=client,distro=ubuntu1004,ib_stack=inkernel/1154/changes

+ mkdir BUILD
+ cd BUILD
+ rc=0
+ [[ ubuntu1004 = ubuntu* ]]
+ ln ../lustre-2.3.58.tar.gz lustre_2.3.58.orig.tar.gz
+ tar zxf ../lustre-2.3.58.tar.gz
+ for dir in ldiskfs libsysio build . lustre-iokit
+ cp ../ldiskfs/autogen.sh lustre-2.3.58/ldiskfs
+ for dir in ldiskfs libsysio build . lustre-iokit
+ cp ../libsysio/autogen.sh lustre-2.3.58/libsysio
+ for dir in ldiskfs libsysio build . lustre-iokit
+ cp ../build/autogen.sh lustre-2.3.58/build
cp: cannot stat `../build/autogen.sh': No such file or directory

Comment by Christopher Morrone [ 10/Jan/13 ]

Ah, I didn't realize the Ubuntu builder was only daily. I'll take a look.

Comment by Christopher Morrone [ 10/Jan/13 ]

Andreas: This one is on you guys. Your build system is doing bad things. I can't even fathom why it is doing what you see above.

At some time in the past, you guys must have had some patch file that you maintained outside of the lustre tree. Something I only know as "../../debian.diff", and I've never seen its contents. Does that thing even still exist? If so, can you tell me what it is? In any event this has "Bad Idea" written all over it.

I did, in fact, modify your "jenkins script 2" for you guys to fix this problem already. But for some reason my version of the script is no longer running in your build farm. Maybe Gearing only applied part of my changes? Or the script wasn't put in a permanent location and got overwritten with an old version when you moved or restored from backups or something?

So someone on your side needs to fix this.

But this is a bit of foreshadowing for the problems to come. Problems in your builders are really slowing progress on this ticket. I can see the way forward for Lustre's build system, but leaving ldiskfs in tree, AND doing it in a way that your buildfarm will be happy with adds probably a couple hundred lines of difficult-to-understand recursive autotools magic to the lustre build system.

I'm either going to need a buildfarm person working closely with me in a month or so when I am further along, or I'm going to need hands-on access to the builder's internals.

Comment by Andreas Dilger [ 10/Jan/13 ]

Chris (M) I've added Chris (G) to the ticket, since he is the guy in charge of the build systems, though he only inherited the scripts. I personally don't know much about it at all.

I think everyone would be happy if we had less complex build scripts. If you are willing to do that I think everyone would be grateful (hopefully not fearful as my autocorrect wanted to write . I'm not sure of the details of the scripts, but there may be some complexity due to having the build system continue to be able to build the older releases of Lustre (1.8 and 2.1 at least). I don't know how easy or hard that will be.

It probably will be easier to leave the old building code alone and only change the scripts for the master branch.

Comment by Christopher Morrone [ 16/Jan/13 ]

Since your internal build system is still largely opaque to me, it is difficult to judge just how difficult it will be to fix. But I would definitely like to get access to its internals so I that I can be better informed.

One would naively think that using one script to build the b1_8 and b2_1 branches, and another to build master+ would not be a terribly difficult thing to do.

Comment by Christopher Morrone [ 16/Jan/13 ]

Could someone please explain what "../build/lbuild.old_school" is? How do I work around it?

I ran afoul of it in patch 5035. I pretty much expected my first revision to fail (hence "fortestonly"), but I'm not sure I could have predicted this particular failure.

We're seriously violating the Principle of Least Astonishment in this build system.

Comment by Christopher Morrone [ 17/Jan/13 ]

Sorry, my bad! I must have flubbed the grep or something, because the lbuild.old_school is in the lbuild script in the the tree.

Comment by Christopher Morrone [ 17/Jan/13 ]

Status update.

These two patches are pretty simple, and ready for landing:

http://review.whamcloud.com/4975
http://review.whamcloud.com/5004

This one just has Andreas's approval and needs a second (this change already happened to ldiskfs's autogen.sh as part of an earlier patch, so I rebased to bring forward the lustre autogen.sh change):

http://review.whamcloud.com/4693

This one is new, just starting review process (removal of HAVE_

{EXT4,JBD2}

_JOURNAL_CALLBACK_SET, which was at Andreas's suggestion):

http://review.whamcloud.com/#change,5055

I made significant updates to the following patch. We'll see if my new scheme survives all of the builders, and then it will need to go through full reviews (new scheme so lustre doesn't have to guess ldiskfs's compilation options):

http://review.whamcloud.com/5005

I got tired of looking for the needle in the haystack in the build/ directory, so I decided to clean it out. This new patch may be more controversial (and is still "fortestonly" until I get it to build reliably, so it is just a preview at the moment):

http://review.whamcloud.com/5035

The last patch in gerrit is a preview of where I am going with the lustre spec file. It doesn't work in the Intel build farm, and there are probably several patches more needed between the above patches landing and this next one landing.

http://review.whamcloud.com/3421

Not seen in gerrit is a work-in-progress that overhauls the ldiskfs spec file. But lustre's build system needs a fair number of changes to work with the new scheme, so I am still working on that. It basically works for me at the command line, and will probably work with LLNL's mock-based build farm shortly. Working with Intel's build farm will take a little more work.

Comment by Christopher Morrone [ 18/Jan/13 ]

Change 5035 made it though the build system, so I have dropped the "fortestonly" designation and added review requests.

Comment by Christopher Morrone [ 11/Feb/13 ]

Status update.

There have been more landings. Most of the above listed patches have landed, with only these two outstanding:

  • 5005 - Cleanup ldiskfs kernel config defines
  • 5035 - Clean out the build directory

Both were at least partially reviewed, but both had to be rebased and are back at 0.

I have pushed one of my work-in-progress branches to github. I figured it would be good for folks to see what I am working on, even though Intel's build farm isn't able to build it (and therefore it is pointless to push to gerrit). See:

Note that this branch will be rebased and reworked quite a bit. Two or three of the smaller patches will be pushed to gerrit this week.

Comment by Christopher Morrone [ 20/Feb/13 ]

brian, can you give me an overview of how you use lbuild? Perhaps an introduction to how you test it on your own machine, so I can emulate that?
Maybe I can get started on reworking parts in a VM until I get access to the Intel build system to figure out what I need to deal with.

By the way, I assume that the CVSROOT bits in lbuild are ancient history, and can be removed?

Comment by Brian Murrell (Inactive) [ 21/Feb/13 ]

Chris,

I don't really do much (any) lbuild hacking any more but when I did, I used to simply do what Jenkins did. I even had a copy of the Jenkins build script which I slightly modified to run outside of the Jenkins environment. It is unfortunate that there is a script in Jenkins that's not in our source tree. It was always a goal of mine to get that into the source tree so that Jenkins had a one-line call to it but I transitioned out of build maintenance before that could happen.

Before one can effectively use lbuild though one probably needs to do a bit of prep of one's local work environment like creating a "cache" dir, etc. Before I transitioned out though I did quite a bit of work to make lbuild use a reasonable set of defaults so that one didn't have to use much more than "./lbuild", assuming one's environment was set up correctly. I agree though it would be nice to get this environment documented so that others can set one up easily.

Comment by Christopher Morrone [ 21/Feb/13 ]

brian did you have to always give lbuild the "--release" option? As far as I can tell, lbuild won't run without one of --tag or --release specified, and at first glance it looks like --tag is completely irrelevant since we stopped using cvs.

Oh, wait. I think I see. jenkins is hard coded to always do "--tag foo". And then as long as you also specify --linux, the fact that --tag is bogus is basically missed lbuild (because it was only really employed for CVS). But it satisfies the requirement of using --release or --tag.

Comment by Brian Murrell (Inactive) [ 22/Feb/13 ]

morrone: you shouldn't have to use --release. --tag or --release are only required if you don't specify --lustre. Current use of lbuild by Jenkins is to create a tarball and give it to lbuild with --lustre, IIRC. That process helps ensure that "make dist" is working. The other option of course is to build right out of the tree, but given the current Jenkins flow, I certainly would not be surprised to see some bitrot in that mode.

Comment by Christopher Morrone [ 22/Feb/13 ]

Shouldn't have to, but am required. And actually, I had that wrong about --linux. If you do not specify --linux, you must specify either --release or --tag, which explains why the jenkins script does "--tag foo" (literally "foo", I am not using that as a place-holder).

I've fixed that in my branch.

Comment by Christopher Morrone [ 12/Mar/13 ]

Does anyone know why the lustre-source-*.rpm exists? Will anyone miss it if I get rid of it? After all, we have a lustre-*.src.rpm so lustre-source seems strange and redundant.

Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ]

I know certain distributions used to provide kernel-source*.rpm (it would make nice fully patched kernel source in /usr/src/linux-), I think now they also just do kernel-.src.rpm. I would guess this is a legacy issue.

Comment by Christopher Morrone [ 29/Mar/13 ]

I'm resigned to the fact that the spec file overhaul work won't land for 2.4. I'll push hard on it again for 2.5.

In the mean time, I'll continue to work some smaller changes into 2.4.

One thing that might be nice is if we could rename the lustre-ldiskfs package to simply ldiskfs in the 2.4 time frame. A bit of change, but not too much? What do folks think?

Comment by Andreas Dilger [ 29/Mar/13 ]

Chris, could you please explain what the benefit of renaming to ldiskfs would gain, besides confusion for the users? At one point we considered making ldiskfs available as a general purpose filesystem, but since then ext4 has most of the improvements, and effort would better be spent on merging ldiskfs changes into ext4 upstream rather than making ldiskfs ready for general usage.

Comment by Christopher Morrone [ 29/Mar/13 ]

It is mainly part of my quest to make ldiskfs a stand-alone package so we can simplify Lustre's build system. Because ldiskfs is really a lustre prerequisite of lustre. LLNL has been doing this for years, and we've long named that external version of ldiskfs simply "ldiskfs".

My thinking is that having the ldiskfs rpm share the "lustre-" prefix seems to imply that it is built by the lustre rpm spec file, since all of the packages from the lustre spec use that prefix. Generally the way that a spec file works is that there is a base name declared in the name field, something like this:

Name: lustre

Then individual packages are declared with the package macro like so:

%package modules

or

%package osd-ldiskfs

rpm combines the Name field and the package field with a dash to make the beginning of the package names. That would be lustre-modules and lustre-osd-ldiskfs respectively for the examples above.

So a package name of "lustre-ldiskfs" makes it look like ldiskfs is part of the lustre rpm spec. But it is not. ldiskfs is packaged independently with its own spec file.

Granted, much of this can be overridden one way or another, so you can do just about anything. But I would suggest that following common conventions will make Lustre's packaging easier to maintain and understand. It is my opinion that just naming the ldiskfs packages with the the rpm Name "ldiskfs" would be more straight-forward and conventional than what we currently have.

If we are worried about folks using ldiskfs for purposes other than lustre, we can have a sternly worded warning in the package "description" section, and even make it clear in the rpm Summary with a line like "Backend File System for Lustre".

I think that would be pretty unambiguous, and doesn't leave any implications that the ldiskfs packages are built from the lustre spec file.

Comment by Christopher Morrone [ 04/Apr/13 ]

Ok, I'm getting push back on the lustre-ldiskfs to ldiskfs rename. I don't completely understand the resistance to improving things. You need to break an egg to make an omelet. (Insert other cliche here).

Help me understand the scope of the resistance.

I have made it clear that I think that ldiskfs needs to move into its own source tree, in its own git repo. Am I going to get resistance on that as well? If not, do you envision that repository name being lustre-ldiskfs?

Why is the osd package named lustre-osd-ldiskfs instead of lustre-osd-lustre-ldiskfs? After all, it that would be more consistent with lustre-osd-zfs naming, and make it lustre-osd-{backend package name}.

And then there is ldiskfsprogs. Is that incorrectly named? Should that be lustre-ldiskfsprogs?

I know you guys don't use ldiskfsprogs upstream, but we think you should. After all, you are making it quite clear that ldiskfs should never be used without lustre. Why then should we taint the system-wide ext tools with lustre-only changes?

If you want to get adoption of lustre from Linux distros, which is becoming a very real possibility as we get close to patchless kernels, you need to do things in a way they will be comfortable with. After patchless kernel, the next think preventing more uptake of Lustre is our completely horrible packaging system.

Since LLNL makes our own Linux distribution, we have some expertise in this area. And Brian has further packaged zfs for every distro under the sun, and that has almost all of the same packaging issues that lustre has.

Here's a case in point: Some gentoo guys have started packaging Lustre. But the ldiskfs stuff is so horrible that the gentoo packages will be zfs-only.

That is the current state of things.

So if you guys are going dig in your heals against the ldiskfs rename...well fine. That is not the end of the world. As long as this isn't foreshadowing of a larger resistance to continued reworking the of ldiskfs to packaging, I can live with the current poorly chosen package name.

Comment by Andreas Dilger [ 05/Apr/13 ]

I don't completely understand the resistance to improving things. You need to break an egg to make an omelet.
Help me understand the scope of the resistance.

I'm all for cleaning up the build system, but in this case I don't see what functional benefit renaming this package provides over the inconvenience to users?

I have made it clear that I think that ldiskfs needs to move into its own source tree, in its own git repo. Am I going to get resistance on that as well? If not, do you envision that repository name being lustre-ldiskfs?

Provably not, and yes, respectively. You've been running with a separate ldiskfs repo for some time, so you know better if this is practical from a code maintenance POV to coordinate changes between lustre and ldiskfs. Are they always modified and installed in lockstep? If yes, then is there a benefit to having them in separate repos?

As for naming, we also have lustre-iokit, and other related packages, even if they are not built by the same .spec file. This is because these packages are largely only useful as part of lustre, and it (IMHO) makes life easier for the admin to know what a package is used for.

Why is the osd package named lustre-osd-ldiskfs instead of lustre-osd-lustre-ldiskfs? After all, it that would be more consistent with lustre-osd-zfs naming, and make it lustre-osd-

Unknown macro: {backend package name}

.

The name is "lustre-osd-

{filesystem name}

", which is consistent. If we needed to modify ZFS for Lustre, we might indeed have a lustre-zfs or lustre-lzfs package.

And then there is ldiskfsprogs. Is that incorrectly named? Should that be lustre-ldiskfsprogs?

IMHO, I think you are making too big a deal of the RPM package names. The naming convention (or occasional lack thereof) is somewhat historical, but the "lustre- prefix means it was built from the lustre.spec file" convention probably only matters to a very small number of people (i.e. people working with the packaging) and not at all for end users. Conversely, (IMHO) the "lustre- prefix means this package is used for Lustre" convention is useful for every user.

I know you guys don't use ldiskfsprogs upstream, but we think you should. After all, you are making it quite clear that ldiskfs should never be used without lustre. Why then should we taint the system-wide ext tools with lustre-only changes?

I'm not wholly against moving over to a separate ldiskfsprogs, but it would also need some way to transition. That said, IMHO some of the changes in the Lustre e2fsprogs are useful for regular ext4 users (e.g. ibadness or your lost+found changes), but I've had difficulty getting it included upstream for various reasons.

Here's a case in point: Some gentoo guys have started packaging Lustre. But the ldiskfs stuff is so horrible that the gentoo packages will be zfs-only.

I think this is entirely to do with the fact that ldiskfs needs many kernel specific patches, while ZFS does not. I not aware that this has anything to do with the packaging (which would likely be completely different for Gentoo anyway?)

So if you guys are going dig in your heels against the ldiskfs rename...well fine. That is not the end of the world. As long as this isn't foreshadowing of a larger resistance to continued reworking the of ldiskfs to packaging, I can live with the current poorly chosen package name.

I've stated repeatedly that I'm in favour of packaging cleanup, and am very appreciative of your efforts in this area. I personally don't see much real benefit of this rename, and didn't realize it was such a big deal for you. I wanted to make sure I'm not reading things incorrectly from the end-user perspective here, which is why I also asked John and James to comment on it.

While I definitely have an opinion on this and other topics, they ate very rarely set in stone, and I'm always open to reasonable arguments.

It would be possible to rename the ldiskfs package and have a "Provides: lustre-ldiskfs" and/or "Obsoletes: lustre-ldiskfs" for upgrades, I just don't see the end-user benefit in this case.

Comment by Christopher Morrone [ 05/Apr/13 ]

You've been running with a separate ldiskfs repo for some time, so you know better if this is practical from a code maintenance POV to coordinate changes between lustre and ldiskfs. Are they always modified and installed in lockstep? If yes, then is there a benefit to having them in separate repos?

Yes, it is practical to coordinate changes between lustre and ldiskfs. With the changes that James and I worked on to have ldiskfs explicitly list its features in an exported header file, lustre can now easily check to see if the ldiskfs that it is compiling against has the required features.

Our experience is that the problems with maintaining an external ldiskfs stemmed almost entirely from having to maintain a solution that upstream makes no effort to support. Those problems go away when we're able to use the upstream solution.

No, lustre and ldiskfs are not always modified and installed in lockstep. There are a great deal more changes made to lustre than there are to ldiskfs. We probably build lustre dozens of times for each build of ldiskfs.

Of the top of my head, some reasons to separate them are:

  • Current recursive spec file usage is not compatible with standard mock-based rpm build systems.
  • Trying to build lustre's prerequisites (e.g. ldiskfs) in the middle of the lustre build process instead of before building lustre is both conceptually confusing, and more importantly makes the build system several times more complicated than it needs to be.
  • ldiskfs is not properly versioned. We've been stuck a the same version since 2010! Granted, this is just developer sloppiness, and isn't something that is forced by leaving ldiskfs in the lustre tree. But I believe that mistakes like that would be more difficult to miss ldiskfs was in its own stand-alone tree.
  • We've had frequent violations of layering of the lustre build system into ldiskfs and vice versa. Leaving ldiskfs in-tree enables that kind of bad practice. Granted, I've got most of that cleaned up, with help from James most recently with the creation of lustre_config.h. But a decade of evidence says that we're not able to keep things clean with ldiskfs in tree.

As for naming, we also have lustre-iokit, and other related packages, even if they are not built by the same .spec file. This is because these packages are largely only useful as part of lustre, and it (IMHO) makes life easier for the admin to know what a package is used for.

Yes, lustre-iokit is on my mental list of things to fix. Unlike ldiskfs, lustre-iokit is not a lustre prerequisite, so it is less of a complication. Its configure system and spec file are pretty simple, so I was thinking that it should just have its separate build system gutted, and merge into the lustre system properly. I.E., no independant autogen.sh, configure, or spec file.

Comment by Christopher Morrone [ 12/Jun/13 ]

I have broken some of the remaining work out into subtasks.

I very much want to see LU-3462 completed for Lustre 2.5.0. I also think it is reasonable to get LU-3464 and LU-3466 done the same time frame.

I would also like to see LU-3463 begun as soon as possible, but I am much less optimistic that it can be completed in time for 2.5.0. Perhaps it can stand next to the current buildfarm and be in an experimental usage mode at that time.

Comment by Christopher Morrone [ 13/Jun/13 ]

While some early patches landed for 2.4.0, the main thrust of this ticket was not addressed for Lustre 2.4.0. I'm not sure that adding a Fix Version of Lustre 2.4.0 is appropriate.

Comment by Andreas Dilger [ 13/Jun/13 ]

We track all of the bugs that have patches landed in each release so that it is possible to query in Jira. If you want to split this bug into smaller sub-tasks it would be easier to track what parts landed in each release.

Comment by Christopher Morrone [ 14/Jun/13 ]

Yes, I had already started that split, since this ticket has kind of evolved into a build system catch-all, and I don't want to lose sight of my main goal.

Comment by Andreas Dilger [ 05/Sep/13 ]

Minh, could you please start to refresh the lustre.spec.in cleanup patches starting with http://review.whamcloud.com/5486.

Comment by Jodi Levi (Inactive) [ 16/Sep/13 ]

Remaining subtasks and patches for this ticket will be tracked under LU-3953 for 2.6. Any additional patches for this work should be tracked under the new ticket.

Comment by Christopher Morrone [ 24/Sep/13 ]

I'm not a fan of the whole ticket cloning thing that happened here. The LU-1199 is well known as the "random little build system cleanup" ticket number for minor patches. I assume this was done because LU-1199 has Fix Version/s listed for 2.5.0, and you want to close as many 2.5.0 tickets as possible.

I would recommend instead that we remove all Fix Version/s from LU-1199, and just let the sub-tasks be more targeted to specific versions and closed as they are complete.

We also have the problem that in-flight patches now have the wrong jira ticket numbers. For instance, Minh's "Make lustre-iokit a subpackage of lustre" patch. I'd rather not blow away all of the reviews and testing on that patch just to correct the ticket number...but we really should make the ticket number in the patch match the ticket that it goes with.

Comment by Jodi Levi (Inactive) [ 24/Sep/13 ]

Ok Chris. Will do.

Comment by Christopher Morrone [ 25/Sep/13 ]

I meant to suggest that we undo the cloned tickets and restore LU-1199 and its subtasks to active duty.

Comment by Andreas Dilger [ 05/Oct/13 ]

Chris, I don't think there is a huge problem if patches that have LU-1199 land in 2.6 (assuming that they do not need to be updated between now and when they land), since we can just mark this ticket with a fix version for 2.6.0 as well. What is trying to be avoided is releasing 2.5.0 with open tickets. Conversely, we don't want to move tickets with landed patches out from fix version 2.5.0 since that makes it harder to track patches/bugs that were fixed in 2.5.0.

I think cloning the ticket and linking it as a duplicate ensures both that tickets landed in either release can be tracked properly, and doesn't require changing all of the patches in flight.

Comment by Christopher Morrone [ 07/Oct/13 ]

I see little value in having the tracking ticket marked fixed in a release where the work is clearly unfinished. Further, what bothers me more than duplicating the tracking ticket is the cloning of all of the subtasks, which serves no clear purpose. The subtask cloning just makes everything very hard to understand in jira.

The subtask is usually much more simple and clean. There is no good reason to have it appear under two different tracking tickets. We have two main subtask states: finished and unfinished.

Finished subtasks should not be cloned forward to the new tracker. There is no value there. They were finished and closed under the old tracker, no need to clone.

Then there are the unfinished tasks, there is no reason for them to remain connected to the old tracker since they were not completed there. There is very little value (none that I can see) in closing the unfinished ticket under the old tracker and having an new clone ticket opened under the new tracker.

Ultimately I don't really understand the desire to close the tracking ticket against a particular Fix Version when the work is incomplete. I would argue that simply leaving the Fix Version field empty for tracking tickets that are long lived makes more sense.

But if I can't get you folks to agree to that, hopefully you can change your procedures to handle the subtasks more sanely next time.

Comment by Cory Spitz [ 07/Oct/13 ]

Sounds like a topic for the OpenSFS CDWG. Can we move the discussion to cdwg@lists.opensfs.org?

Comment by Andreas Dilger [ 08/Oct/13 ]

I think part of the problem is that there were patches landed to both 2.4.0 and 2.5.0 using LU-1199 as the tracking bug, instead of a subtask. I agree that the cloned subtasks were a mistake, and that was just a side-effect of how Jira clones a ticket with subtasks rather than the intended outcome.

I guess the lessons to be learned here are:

  • patches shouldn't be landed directly against the tracking bug
  • don't clone a tracking bug with subtasks

The reason that we want to close the ticket when it has patches landed under a particular release is largely to ensure that tickets with the 2.5.0 "Fix Version" field are tracked properly. Since this is the first release that has used "Fix Version" in such a manner, it is not unexpected that there will be some mistakes made and learning to be done. To be honest, I think the importance of this issue is being overemphasized. The old and new bugs are linked together, so I don't think it will be too confusing to track the issues.

Generated at Sat Feb 10 01:14:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.