[LU-9745] dkms-lustre does not install all modules on initial autoinstall Created: 07/Jul/17 Updated: 18/Aug/17 Resolved: 18/Aug/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Brian Murrell (Inactive) | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
When I run the following command to install DKMS built Lustre: # yum install kernel-devel-[0-9]*_lustre lustre lustre-dkms kmod-lustre-osd-ldiskfs zfs the result after the installation is that only the lustre.ko (built from dkms-lustre) module is in /lib/modules/3.10.0-514.21.1.el7_lustre.x86_64/extra/: # ls -l /lib/modules/3.10.0-514.21.1.el7_lustre.x86_64/extra/ total 4836 -rw-r--r-- 1 root root 1615824 Jul 7 14:01 lustre.ko drwxr-xr-x 3 root root 16 Jul 7 13:55 lustre-osd-ldiskfs -rw-r--r-- 1 root root 353632 Jul 7 13:55 splat.ko -rw-r--r-- 1 root root 170024 Jul 7 13:55 spl.ko -rw-r--r-- 1 root root 14016 Jul 7 13:57 zavl.ko -rw-r--r-- 1 root root 75848 Jul 7 13:57 zcommon.ko -rw-r--r-- 1 root root 2205152 Jul 7 13:57 zfs.ko -rw-r--r-- 1 root root 132488 Jul 7 13:57 znvpair.ko -rw-r--r-- 1 root root 34000 Jul 7 13:57 zpios.ko -rw-r--r-- 1 root root 330920 Jul 7 13:57 zunicode.ko Notice that all of the other supporting modules are missing. After the above, if I then remove the module with dkms uninstall -m lustre/2.10.0_RC1 -k 3.10.0-514.21.1.el7_lustre.x86_64 and then run /etc/kernel/postinst.d/dkms 3.10.0-514.21.1.el7_lustre.x86_64 to emulate what happens during the yum installation above, /lib/modules/3.10.0-514.21.1.el7_lustre.x86_64/extra/ contains all of the necessary Lustre modules. So there seems to be some subtle issue with the lustre-dkms RPM that only occurs during the initial installation. |
| Comments |
| Comment by Peter Jones [ 07/Jul/17 ] |
|
Nathaniel Can you please work on this one? Thanks Peter |
| Comment by Malcolm Cowe (Inactive) [ 10/Jul/17 ] |
|
Brian, How does your procedure compare to that in the following doc: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/125 My equivalent command is: yum --nogpgcheck install \ lustre-dkms \ spl-dkms \ zfs-dkms \ kmod-lustre-osd-ldiskfs \ lustre-osd-ldiskfs-mount \ lustre-osd-zfs-mount \ lustre \ lustre-resource-agents Note that I ended up using DKMS for all the kmods except LDISKFS. |
| Comment by Nathaniel Clark [ 10/Jul/17 ] |
|
Which version of dkms is this with? |
| Comment by Brian Murrell (Inactive) [ 10/Jul/17 ] |
Whichever is in EPEL, which seems to be dkms-2.3-5.20170523git8c3065c.el7 currently. |
| Comment by Nathaniel Clark [ 18/Jul/17 ] |
|
I've tried reproducing, but it installed fine for me (installing all zfs/spl then installing lustre). I'm trying to reproduce by installing everything in once yum call. Do you have logs? EDIT: I could not reproduce installing everything at once. Can you reproduce? If there were old lustre-dkms version installed, that may be the cause (dkms is really crappy at cleaning up after itself) |
| Comment by Brian Murrell (Inactive) [ 18/Jul/17 ] |
|
Here is a transcsript The system this was run on is still available if you want any more information from it. |
| Comment by Brian Murrell (Inactive) [ 19/Jul/17 ] |
|
utopiabound: Was my transcript helpful? |
| Comment by Nathaniel Clark [ 19/Jul/17 ] |
|
Yes. Usually dkms is installed prior to packages that need dkms. Do you have any logs in /var/lib/dkms/lustre/2.10.0/*/x86_64/log/ ? |
| Comment by Nathaniel Clark [ 19/Jul/17 ] |
|
I tried reproducing this and I always get a full set of kernel modules. |
| Comment by Peter Jones [ 19/Jul/17 ] |
|
Why not do a Webex or similar to Brian's environment? |
| Comment by Brian Murrell (Inactive) [ 19/Jul/17 ] |
I don't think you can (or should) make that a requirement with RPM. It should not be a requirement. People familiar with RPM are going to do exactly as I have done.
I'm afraid I don't. I needed to recycle the machine for other tests.
Did you try reproducing exactly as I had done? If so, could you send me a transcript so that I can try to compare it to what I am doing to see if I can find any differences? |
| Comment by Nathaniel Clark [ 19/Jul/17 ] |
|
After trying again. If there is no kernel-devel installed prior to the yum command, not all lustre modules are installed. If even an older kernel-devel is installed all lustre modules are installed. Looking at why. WORKAROUND |
| Comment by Malcolm Cowe (Inactive) [ 19/Jul/17 ] |
|
Actually, kernel-devel is the essential pre-req – DKMS will get pulled in properly without needing to be pre-installed. I'm working on the upgrade / migration process for EE to Lustre 2.10 and have a reliable procedure, where I have documented the installation of the kernel packages as a standalone step. This was done to separate the steps into units that can be easily audited, and to incorporate the required reboot early in the process. I also thought it would be a way to avoid unnecessary rebuilds of the modules for other kernel versions, which seems to have been borne out. However, I'll review this today as I go through the upgrades of the other servers in my testbed. |
| Comment by Peter Jones [ 20/Jul/17 ] |
|
Brian Have you had a chance to test the proposed workaround? Peter |
| Comment by Brian Murrell (Inactive) [ 21/Jul/17 ] |
I have not. I was not aware that that was going to be the solution for IML. Having to stage the installation into multiple steps goes in entirely the opposite direction that we are trying to go with IML. It is designed around having a single install target per profile and having to split that up into multiple steps will require a non-insignificant redesign of the host deployment process. |
| Comment by Malcolm Cowe (Inactive) [ 21/Jul/17 ] |
|
I have tried several combinations of installing Lustre 2.10.0 and ZFS 0.6.5.11 with DKMS on RHEL 7.3. Packages are taken from official public repositories. The only successful method I have been able to incorporate is to introduce a reboot between adding the kernel packages and installing the ZFS and Lustre packages with DKMS. If all the packages are included on the same command line or even if they are separated such that the kernel is installed first, the process fails. My starting point is a basic RHEL 7.3 install with the 3.10.0-514.el7 kernel. For example: yum --nogpgcheck --disablerepo=base,extras,updates \ --enablerepo=lustre-server-2.10.0 install \ kernel kernel-devel kernel-headers kernel-tools \ kernel-tools-libs kernel-tools-libs-devel yum --nogpgcheck --enablerepo=lustre-server-2.10.0 install \ lustre-dkms spl-dkms zfs-dkms \ kmod-lustre-osd-ldiskfs \ lustre-osd-ldiskfs-mount lustre-osd-zfs-mount \ lustre lustre-resource-agents ... [root@ct68-oss4 ~]# dkms status lustre, 2.10.0: added spl, 0.6.5.11: added zfs, 0.6.5.11: added At this stage, the DKMS packages are added, but DKMS has not installed the modules. After reboot, DKMS builds and installs the modules as a background task: [root@ct68-oss4 ~]# dkms status lustre, 2.10.0: added spl, 0.6.5.11, 3.10.0-514.21.1.el7_lustre.x86_64, x86_64: installed zfs, 0.6.5.11: added On completion, the state is as follows [root@ct68-oss4 ~]# dkms status lustre, 2.10.0, 3.10.0-514.21.1.el7_lustre.x86_64, x86_64: installed (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) spl, 0.6.5.11, 3.10.0-514.21.1.el7_lustre.x86_64, x86_64: installed zfs, 0.6.5.11, 3.10.0-514.21.1.el7_lustre.x86_64, x86_64: installed The Lustre modules are build in /var/lib/dkms, but do not get copied into /lib/modules/.../extra/. I've tried a few variations on this instruction. The only successful outcome was to install the kernel packages, reboot on the new kernel, then continue installation. This process works with a high degree of confidence.
|
| Comment by Brian Murrell (Inactive) [ 21/Jul/17 ] |
|
Also don't forget the work-around in my original issue description:
Maybe this helps in debugging the root issue. The problem with this work-around, of course, is that it results in a second, unnecessary build of the Lustre modules, even though they actually already exist in /var/lib/dkms/.... I have found no way to ask DKMS to simply install the already built modules instead of first rebuilding them again, before trying to install them. This second, unnecessary rebuild has an end-user impact of the deploying of IML storage servers taking even longer than the already perceived long time it takes. We have had users complain about the amount of time it takes to deploy storage nodes as it is. I am a little confused about malkolm's work-around though. I believe DKMS' auto-install option was enabled for all of the modules, but lustre in particular. That should mean that after installation, even before booting to the lustre kernel, the DKMS modules will be built for it since it was installed with the DKMS modules. In fact I cannot see what is different between malkolm's work-around and my original reproducer except that I did not examine the kernel module installation state after the reboot. Maybe the reboot always fixes this problem, again at a cost of building the modules twice. |
| Comment by Nathaniel Clark [ 21/Jul/17 ] |
|
It all works the first time, if you have the booted kernels kernel-devel package installed. |
| Comment by Nathaniel Clark [ 21/Jul/17 ] |
|
Okay, this worked, and it satisfies the "only build once" requirement, and it does a single reboot (which is required anyway): yum install kernel-devel-[0-9]*_lustre kernel-[0-9]*_lustre reboot wait... yum -y install lustre lustre-dkms kmod-lustre-osd-ldiskfs zfs |
| Comment by Brian Murrell (Inactive) [ 21/Jul/17 ] |
|
Right. But it doesn't fit IML's current process of "install (a) package(s), then reboot", which is the current "hardcoded" (for all intents and purposes) process. It's not trivial for us to change that process from a (data driven – meaning the data structure describing that process as well as the code to implement the data structures' instruction) single (installation) step process to multiple (installation) steps separated by a reboot. Certainly not at this stage of trying to get to a 4.0 release of IML out there. This work-around also becomes "knowledge" that everyone else trying to deploy DKMS-built Lustre systems needs to know about. Perhaps in could be in Release Notes or something, but does anyone really read those? Particularly people already familiar with Lustre? So back to the root cause... Have we raised this issue with the DKMS maintainers? Perhaps this is a trivial issue that somebody intimately familiar with the DKMS code base will immediately recognise. |
| Comment by Nathaniel Clark [ 21/Jul/17 ] |
|
It's weird that zfs/spl both install all of their modules, but lustre doesn't. It feels like a dkms issue, but there might be a way to mitigate on our end. |
| Comment by Malcolm Cowe (Inactive) [ 23/Jul/17 ] |
|
The process Nathaniel outlined is the one I also described, and it is the only mechanism that I was successfully able to incorporate (in my command line, the first yum command installs lustre-patched kernel packages from a lustre 2.10 repo). Adding the lustre kernel packages and then continuing without a reboot, regardless of combination, leaves the system in the incomplete state when the host is eventually rebooted, where the modules get compiled, but not installed. I'd also point out that the modules are "added" but not installed when the attempt is made without rebooting first. To build on Nathaniel's comment, one possible compromise could be to install the kernel-devel package of the "starting" kernel, as well as the desired kernel, to see if that alters the behaviour prior to reboot. This will cause the modules to be built more than once, but it might be a way to meet the requirement of installing all packages prior to reboot. |
| Comment by Brian Murrell (Inactive) [ 24/Jul/17 ] |
|
But, as utopiabound points out, it's only the Lustre DKMS module that has this problem. The ZFS/SPL DKMS modules don't have the same problem. So this would appear to be some subtle issue with our module, or at least the way we have written our module is triggering a bug that the ZFL/SPL modules are not. The next step is probably to examine our module and the ZFS/SPL modules closely to see what kind of differences exist between them. |
| Comment by Nathaniel Clark [ 25/Jul/17 ] |
|
I think I know what is going wrong, lustre-dkms uses a "temporary" dkms.conf (which only lists lustre.ko) then rebuilds it during dkms process; I'm not sure why, but I aim to fix it. |
| Comment by Gerrit Updater [ 25/Jul/17 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/28210 |
| Comment by Nathaniel Clark [ 26/Jul/17 ] |
|
Using https://review.whamcloud.com/#/c/28210/3/ works. It was the bad dkms.conf. |
| Comment by Brian Murrell (Inactive) [ 26/Jul/17 ] |
|
utopiabound: Do you mind if I cherry-pick this to b2_10 so that I can test it? |
| Comment by Gerrit Updater [ 26/Jul/17 ] |
|
Brian J. Murrell (brian.murrell@intel.com) uploaded a new patch: https://review.whamcloud.com/28224 |
| Comment by Gerrit Updater [ 18/Aug/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28224/ |
| Comment by Gerrit Updater [ 18/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28210/ |
| Comment by Peter Jones [ 18/Aug/17 ] |
|
Landed for 2.11 |