[LU-9564] Support for Lustre Servers on Ubuntu 14.04/16.04 Kernel 4.4.0 Created: 26/May/17 Updated: 23/Dec/18 Resolved: 06/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Martin Schröder | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | ubuntu | ||
| Environment: |
Ubuntu 14.04.5 with Backport Kernel 4.4.0 from Ubuntu 16.04 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||
| Epic/Theme: | ubuntu | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Currently, only SuSE or RedHat machines can be used as Lustre servers. Ubuntu can only be used as a Client. Since Lustre has recently started supporting SLES12 with Kernel 4.4.0 – the very same version used by Ubuntu 16.04 (and 14.04. via HWE Kernels), we had the idea of porting Lustre over (since our future usage of Lustre would greatly benefit from that). We made good progress in that respect and have adjusted the Kernel-Patches for "ldiskfs" to also work for Ubuntu's flavour of Kernel 4.4.0. Additionally, we have extended the Debian build system provided by Lustre, to be able to create both client and server tools and modules. You can find the patches against Lustre 2.9.57 attached to this ticket. The kernel patches target version 4.4.0-45.66. The compilation and creation of the packages works well and produces the needed modules and tools. Both debian packages install cleanly and the modules (lnet, ldiskfs, lustre) load correctly. Unfortunately, once we come to the creation of the Lustre file system, we run into a curious error: root@musxbeo050:~# mkfs.lustre --mgs --mdt --fsname=lustre --backfstype=ldiskfs --index=0 /dev/sda7 mkfs.lustre FATAL: unhandled/unloaded fs type 1 'ldiskfs' mkfs.lustre FATAL: unable to prepare backend (22) mkfs.lustre: exiting with 22 (Invalid argument) This is despite the fact, that "ldiskfs" is properly registered with the kernel: root@musxbeo050:~# uname -a
Linux musxbeo050 4.4.0-45-generic #66~14.04.1lustre SMP Mon May 8 18:23:05 CEST 2017 x86_64 x86_64 x86_64 GNU/Linux
root@musxbeo050:~# grep "ldiskfs\|lustre" /proc/filesystems
ldiskfs
lustre
The only unusual message in the kernel log is a complaint by the "ldiskfs" module, that it cant' register itself under the "ext3" alias. May 26 14:16:11 musxbeo050 kernel: [159891.600363] LDISKFS-fs: Unable to register as ext3 (-16) So far, we have not yet tested the ZFS backend, but since no kernel changes are needed for that one at all, we don't think that it will be a great issue, once this one is solved. Thanks! |
| Comments |
| Comment by Andreas Dilger [ 27/May/17 ] |
|
Please submit patches to Gerrit, so that they can be reviewed. The ldiskfs error appears to be related to the userspace tools (mount_utils_ldiskfs) dynamically loaded, rather than kernel issues. |
| Comment by Peter Jones [ 27/May/17 ] |
|
Bob/Others Any advise to give here? Peter |
| Comment by Peter Jones [ 27/May/17 ] |
|
Also, given the ZFS support in Ubuntu 16.04, I would think that would be more interesting than ldiskfs - http://open-zfs.org/wiki/Distributions#Ubuntu |
| Comment by Bob Glossman (Inactive) [ 27/May/17 ] |
|
I suspect you are missing any lustre capable version of e2fsprogs. Needed for mkfs.lustre of ldiskfs. We have versions available & downloadable for RHEL and SLES, but since we've never done server support for Ubuntu we have none for that. Don't even know if the master-lustre branch of e2fsprogs is capable of building .debs for Ubuntu. |
| Comment by Thomas Stibor [ 27/May/17 ] |
|
I have built the Lustre specific e2fsprogs for Debian Jessie. Probably it can be used for your testing Cheers Thomas |
| Comment by Martin Schröder [ 29/May/17 ] |
|
Hi everyone. @Peter: Yes, we also want to look at using ZFS at a later point, but for now we want to make sure that we can test both file-systems. @Bob, Thomas: I compiled and deployed the ldiskfs-enabled e2fsprogs for Ubuntu already. I probably should have mentioned that in my original submission. [-bash-4.3]$ dpkg -l "e2fs*" | grep e2fs un e2fsck-static <none> <none> (no description available) ii e2fslibs:amd64 1.42.13-1-intel amd64 ext2/ext3/ext4 file system libraries ii e2fsprogs 1.42.13-1-intel amd64 ext2/ext3/ext4 file system utilities This is e2fsprogs-1.42.13.wc5, just with a tiny modification to successfully create a DEB on Ubuntu (the file "ext2_types-wrapper.h" is in the wrong location). [10:24:55][mhschroe@musxbeo022:/local/mhschroe/e2fsprogs-1.42.13.wc5] [-bash-4.3]$ git log --oneline -3 f77990b Fixed issue with compiling on Ubuntu 00c3728 LU-7867 e2fsprogs: update build version to 1.42.13.wc5 595b511 LU-7867 debugfs: fix check for out-of-bound xattr value Anyway, the same error occurs also with the modified e2fsprogs. @Andreas: I had hoped to upload a working set of changes to Gerrit – especially since my changes to the Automake system and debian/rules will most likely (currently) break your Jenkins config for the Ubuntu Client builds. |
| Comment by Gerrit Updater [ 29/May/17 ] |
|
Martin Schroeder (martin.h.schroeder@intel.com) uploaded a new patch: https://review.whamcloud.com/27323 |
| Comment by Martin Schröder [ 31/May/17 ] |
|
Hi everyone. I have now also checked the tool with the ZFS backend, and it has the same depressing result: root@musxbeo050:~# dpkg -s "zfsutils" | grep Version Version: 0.6.5.9-1~trusty root@musxbeo050:~# grep "zfs" /proc/filesystems nodev zfs root@musxbeo050:~# modinfo osd_zfs filename: /lib/modules/4.4.0-45-generic/updates/fs/osd_zfs.ko license: GPL version: 2.9.55_78_g1bb015f root@musxbeo050:~# mkfs.lustre --mgs --mdt --fsname=lustre --backfstype=zfs --index=0 /dev/sda7 mkfs.lustre FATAL: unhandled/unloaded fs type 5 'zfs'
So the tools are all present and – as far as I can tell – properly compiled and installed. For example, I can create a ZFS pool manually: root@musxbeo050:~# zpool create pool-lustre /dev/sda7
root@musxbeo050:~# zpool status
pool: pool-lustre
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
pool-lustre ONLINE 0 0 0
sda7 ONLINE 0 0 0
errors: No known data errors
By now, I presume that there is actually a problem in the source code of "mkfs.lustre" which prevents it from realizing that, yes, its desired file-systems are indeed available on that Ubuntu Server. I'll try to compile the tool with debugging symbols and then try to check why that error is raised. |
| Comment by Martin Schröder [ 31/May/17 ] |
|
Good news, everyone. The debugging session helped and showed me that the Debian packages were missing the "osd_<fstype>.so" files to be deployed into /usr/lib. After a quick adjustment of the "debian/*.install" files, a recompile and reinstall, I now get: root@musxbeo050:~# mkfs.lustre --mgs --mdt --fsname=lustre --backfstype=ldiskfs --index=0 /dev/sda7 Permanent disk data: Target: lustre:MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs [...] formatting backing filesystem ldiskfs on /dev/sda7 [...]
This means the Ubuntu support should be good now. I've uploaded PatchSet #3 to Gerrit that includes the needed changes for you to review. In the meantime, I'll try seeing how well the system performs.
Thanks, everyone for your input! |
| Comment by Martin Schröder [ 06/Jul/17 ] |
|
Hi everyone. The change in Gerrit (https://review.whamcloud.com/#/c/27323/) is ready for merge for 12 days now. It has the needed V+1 from Jenkins and Maloo, as well as two CR+1 votes from James and Thomas. What else needs to be done to get this merged? |
| Comment by Peter Jones [ 06/Jul/17 ] |
|
> What else needs to be done to get this merged? The 2.10 code freeze to end |
| Comment by Gerrit Updater [ 19/Jul/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27323/ |
| Comment by Peter Jones [ 19/Jul/17 ] |
|
The patch has landed - is that all that is required to complete this task or is further work needed? |
| Comment by Martin Schröder [ 19/Jul/17 ] |
|
Hi Peter. The only work that needs to be done is to finish the Wiki dokumentation (it is 95% done): I should be done with that within the week and will then migrate it to the lustre.org Wiki, too.
Beyond that, the only change that needs to be done is to occassionally align the LDISKFS patches with the Ubuntu Kernel releases. Currently it looks like that will only be necessary once every few months, given the rate of changes on the Ubuntu Linux Kernel's 4.4.0 branch. |
| Comment by Gerrit Updater [ 19/Jul/17 ] |
|
Martin Schroeder (martin.h.schroeder@intel.com) uploaded a new patch: https://review.whamcloud.com/28100 |
| Comment by Minh Diep [ 20/Jul/17 ] |
|
it seems that this patch caused lustre-master build failed dh_gencontrol: No packages to build. dh_md5sums -p lustre-client-modules-4.4.0-71-generic dh_builddeb --destdir=/var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/ubuntu1604/ib_stack/inkernel/.. -p lustre-client-modules-4.4.0-71-generic dpkg-deb: error: failed to open package info file 'debian/lustre-client-modules-4.4.0-71-generic/DEBIAN/control' for reading: No such file or directory dh_builddeb: dpkg-deb --build debian/lustre-client-modules-4.4.0-71-generic /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/ubuntu1604/ib_stack/inkernel/.. returned exit code 2 debian/rules:424: recipe for target 'binary-modules' failed make[2]: *** [binary-modules] Error 1 make[2]: Leaving directory '/var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/ubuntu1604/ib_stack/inkernel/debian/tmp/modules-deb/usr_src/modules/lustre' /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/ubuntu1604/ib_stack/inkernel/debian/tmp/modules-deb/usr_share_modass/include/common-rules.make:56: recipe for target 'kdist_build' failed make[1]: *** [kdist_build] Error 2 make[1]: Leaving directory '/var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/ubuntu1604/ib_stack/inkernel/debian/tmp/modules-deb/usr_src/modules/lustre' tput: No value for $TERM and no -T specified BUILD FAILED! tput: No value for $TERM and no -T specified See /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/ubuntu1604/ib_stack/inkernel/debian/tmp/modules-deb/var_cache_modass/lustre.buildlog.4.4.0-71-generic.1500438566 for details. autoMakefile:1142: recipe for target 'debs' failed make: *** [debs] Error 7 |
| Comment by Martin Schröder [ 20/Jul/17 ] |
|
Hi Minh Diep.
Curious that the tests started from Gerrit for the change have not picked that issue up. I will try to replicate it on our Ubuntu servers and get back to y'all. I do admit that in my test cases, I only looked at the "make debs" target in isolation, and never run the "make dist" target as done by this Jenkins test. |
| Comment by James A Simmons [ 20/Jul/17 ] |
|
I'm building my debs against only ZFS but when I run make dist I see: tardir=lustre-2.10.50_68_gb2c8846_dirty && tar - This is on Ubuntu 17. |
| Comment by Gerrit Updater [ 01/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/28293 |
| Comment by Gerrit Updater [ 01/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28293/ |
| Comment by Oleg Drokin [ 01/Aug/17 ] |
|
Please resubmit this patch in a form that actually works in our build system (mix in the changelog/which_patch/... update too please) Thanks. |
| Comment by Martin Schröder [ 01/Aug/17 ] |
|
Will do that, yes.
Of course, this only raises the point that the build system that covers the uploads does not actually cover all the build types that you use. From what I can tell, it covers only the "make debs" path on Ubuntu (which works) – but not the "make dist" call (which does not).
Indeed, I never even knew that there was an alternative make rule for building Debs to begin with.
So, my question would be: What would it take to have both "make debs" and "make dist" be checked by your Jenkins builds that trigger on Gerrit upload?
EDIT: Typo correction – it is "make dist", not "make site". |
| Comment by Peter Jones [ 01/Aug/17 ] |
|
Martin Yes it will be possible to make these checks routine once they are reliably working Peter |
| Comment by Martin Schröder [ 26/Sep/17 ] |
|
Hi everyone. I have finally found the time to look at this, and I can't replicate the issue that was reported. I installed two VMs – one with Ubuntu 14.04 and one with 16.04 and ran the DEB creation sucessfully on both of them, while following the build steps used by the HPDD build server, namely in the form of these commands: $ sh autogen.sh && ./configure --enable-dist
$ make dist
$ rm -rf BUILD
$ mkdir BUILD
$ cd BUILD
$ tar -zxf ../lustre-2.10.53_1_ga949a00.tar.gz
$ cd lustre-2.10.53_1_ga949a00
$ ./configure --with-linux=${BUILDPATH}/ubuntu-kernel --disable-server
$ make debs
The kernel that was used to compile Lustre against was 4.4.0-71 in both cases – the same that was used by the failing run above.
Hence, I can only conclude that it was a one-off failure on the build system – or is something specific to the environment / kernel sources used by the HPDD build server. I will shortly upload a new changeset to Gerrit, so that it can be tested on the build-server.
Thanks, |
| Comment by Gerrit Updater [ 26/Sep/17 ] |
|
Martin Schroeder (martin.h.schroeder@intel.com) uploaded a new patch: https://review.whamcloud.com/29215 |
| Comment by Martin Schröder [ 27/Sep/17 ] |
|
Hi everyone. Just as a heads-up. I narrowed the bug down. The code builds perfectly fine on Ubuntu 14, but fails on Ubuntu 16 while assembling the DEB for the client-modules. Fun times. {EDIT} Build-Depends: debhelper (>= 7.0.0), bzip2 The "7.0.0" must not be used. It seems that the current way of building modules (via modules-assistant) confuses newer debhelper variants. Once I reverted this to request version "5.0.0", the build works on 16.04. now, too. |
| Comment by James A Simmons [ 27/Sep/17 ] |
|
Have you tried upping the debhelper version. I found on my Ubuntu17 system it complains about debhelper using protocol 7 instead of 9 which is the supported default. |
| Comment by Martin Schröder [ 27/Sep/17 ] |
|
Hi James. Thanks, as I said above (in the edit), it works when requesting an even older version, namely 5.0.0. Since the original code uses 5.0.0 I'll keep it at that version, but I'll also check it with "9.0.0" and update this post, once it's done. {EDIT} One never stops learning. |
| Comment by Martin Schröder [ 16/Oct/17 ] |
|
Hi everyone. The Gerrit change which fixes all outstanding issues has been uploaded 3 weeks ago and has passed all automated tests. I wonder what else is needed to entice users that have CR rights to review and ultimately merge the changes?
Would it help if I got out of the spaceship and pushed it? |
| Comment by James A Simmons [ 16/Oct/17 ] |
|
Sorry I saw previous comments but didn't realize you updated the patch. Will look now. I will the proper people to review. |
| Comment by Bob Glossman (Inactive) [ 16/Oct/17 ] |
|
the latest kernel version on Ubuntu 16 appears to be 4.10, not 4.4. Is that supported? |
| Comment by Peter Jones [ 16/Oct/17 ] |
|
We officially support the 4.4 kernel version because that is a known quantity and allows us to maintain support for Ubuntu 16 regardless of what kernel version is used the cutting edge version. We do attempt to keep as abreast with the latest kernel versions as possible so most of the time I would expect us to be able to support either option. James will be able to comment authoritatively but I think that we're able to support up to 4.11 and are very close to 4.12 so I would expect 4.10 to be ok. |
| Comment by Martin Schröder [ 16/Oct/17 ] |
|
Hi everyone. I can only assent to what Peter said. I chose Kernel 4.4.0, because of several reasons:
In other words: I went the path of least resistance.
From what I have seen when adjusting the patches for Ubuntu's 4.4.0 version, I do not think it would be a great problem to target 4.10 -- eventually.
But to get this patch merged, I'd vote for sticking with 4.4.0 for a while, until someone with more experience than me regarding Lustre Kernel patches can find the time to port them to 4.10. |
| Comment by James A Simmons [ 16/Oct/17 ] |
|
Martin you are going to need to rebase the patch : -). It failed to apply when I went to try it. As for newer kernel support we are one patch away from 4.12 kernel support for the client. I have updated the patch, which is https://review.whamcloud.com/#/c/28511. For lustre 2.10 we are one patch away from 4.11 kernel support. This is also assuming GSS keyring is disabled which is stuck at pre-4.6 kernels. Also the latest ZFS + lustre with a 4.12 kernel works fine. Its ldiskfs which I haven't touched for a 4.12 kernel. I haven't tried to port ldiskfs to any newer kernels. |
| Comment by Bob Glossman (Inactive) [ 16/Oct/17 ] |
|
Latest patch isn't working for me at all. # Create the module-source tarball. cd debian/lustre-source/usr/src && tar jcf lustre.tar.bz2 modules rm -rf debian/lustre-source/usr/src/modules dh_install -plustre-source dh_installchangelogs -p lustre-source lustre/ChangeLog dh_installdocs -p lustre-source dh_link -p lustre-source /usr/share/modass/packages/default.sh /usr/share/modass/overrides/lustre-source dh_compress -p lustre-source dh_installdeb -p lustre-source dh_fixperms -p lustre-source dh_gencontrol -p lustre-source dh_md5sums -p lustre-source dh_builddeb -p lustre-source dpkg-deb: building package 'lustre-source' in '../lustre-source_2.10.54-14-g65983ee-1_all.deb'. dh_testdir dh_testroot dh_installdirs -p lustre-server-utils dh_installdocs -p lustre-server-utils dh_installman -p lustre-server-utils dh_install -p lustre-server-utils dh_install: lustre-server-utils missing files: debian/tmp/usr/lib/lustre/*.la dh_install: missing files, aborting debian/rules:239: recipe for target 'binary-lustre-server-utils' failed make[1]: *** [binary-lustre-server-utils] Error 255 make[1]: Leaving directory '/home/bogl/lustre-release' dpkg-buildpackage: error: fakeroot debian/rules binary gave error exit status 2 autoMakefile:1142: recipe for target 'debs' failed make: *** [debs] Error 2 |
| Comment by James A Simmons [ 16/Oct/17 ] |
|
It needs a rebase as well. |
| Comment by Bob Glossman (Inactive) [ 16/Oct/17 ] |
|
that fail was a rebase on latest master |
| Comment by Martin Schröder [ 17/Oct/17 ] |
|
Okay, I'll rebase and see where the problem might be.
I do remember the "missing files" issue, which happened when certain configurations would not build shared libraries, only static ones. But I believed to have fixed all of them. Let me check. |
| Comment by James A Simmons [ 17/Oct/17 ] |
|
Its a libtool thing. Looking on my ubuntu system I don't see any *.la files installed. I noticed we are installing them. |
| Comment by Martin Schröder [ 16/Jan/18 ] |
|
Hi everyone. I have rebased the code, fixed the build-time issues and changed the DEB file creation to allow more than one "lustre - * - modules - <KVERS>" package to be installed simultaneously. I have checked the compilation on both Ubuntu 14.04 and 16.04. and against the most recent Linux Kernel version available for both. If you have the time, please review the Gerrit change under: https://review.whamcloud.com/#/c/29215 Do note especially the small workaround for "IB_DEVICE_SG_GAPS_REG" and "IB_MR_TYPE_SG_GAPS", that is needed to compile against the Ubuntu Kernels, which do not have those macros. The change to "lnet/klnds/o2iblnd/o2iblnd.c" should be broadly compatible with both older and newer kernels.
Thanks! |
| Comment by Chris Hunter (Inactive) [ 16/Jan/18 ] |
|
For "IB_MR_TYPE_SG_GAPS" you might want to look at |
| Comment by Martin Schröder [ 16/Jan/18 ] |
|
Hi Chris. I've seen that commit (and an earlier one that was already merged) when I encountered the compile-time issue. The problem here is, that neither "IB_DEVICE_SG_GAPS_REG" nor "IB_MR_TYPE_SG_GAPS" are available on the 4.4.0-series Kernels as used by Ubuntu at all. So any code that mentions them outside of a suitable preprocessor guard will fail to compile. In this case, the guard matched, even when the symbols were not present.
The reason for their absence appears to be, that there is a general warning against using these symbols: So either they removed them, or they never used them on the 4.4.0 Kernel.
I changed the code so that Lustre will use them, if the symbols are present. If missing, it will revert to the old pre-patch behaviour.
|
| Comment by Bob Glossman (Inactive) [ 16/Jan/18 ] |
|
I would prefer to see those o2iblnd.c changes as a separate patch. That fix isn't specific to server builds. It is needed for any build on current Ubuntu kernel versions, including client builds. |
| Comment by Martin Schröder [ 17/Jan/18 ] |
|
Okay, I will extract the o2iblnd.c changes into a separate changeset and then adjust the dependencies of the current one, so that it depends on the other. |
| Comment by Martin Schröder [ 17/Jan/18 ] |
|
Add reference to blocking subtask for o2iblnd.c changes. |
| Comment by Martin Schröder [ 17/Jan/18 ] |
|
@Bob Glossmann: Please review: https://review.whamcloud.com/#/c/30893/ |
| Comment by Martin Schröder [ 23/Jan/18 ] |
|
Okay everyone. Both changes have CR+1/V+1. Now, I guess all that is needed is to find someone with the needed permissions to push the code into the mainline. So... any volunteers on the front of merging the two commits?
Once the code is merged and shown to not cause any problems (like last time), it would also be a good idea to add a Ubuntu Server build to the Jenkins config. The Client build is already there, after all.
Thanks, |
| Comment by Peter Jones [ 23/Jan/18 ] |
|
Martin The commits will be landed by the gatekeeper when he has completed necessary reviews and tests. One of the patches is in the batch being processed ATM - https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=shortlog;h=refs/heads/master-next - and the other will likely be in the next batch Peter |
| Comment by Gerrit Updater [ 06/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29215/ |
| Comment by Peter Jones [ 06/Feb/18 ] |
|
Landed for 2.11 |