Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5952

ko2iblnd not built with kernel 2.6.32-504.1.3

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 1.8.7
    • None
    • Scientific Linux 6.5
    • 3
    • 16616

    Description

      Hello Lustre tream!

      We're seeing an issue with the ko2iblnd module not being built properly any longer with our lustre 1.8.7 source with the latest 2.6.32-504.x Linux kernel.

      We have successfully been building RPM's utilizing the command below for several kernel revisions: 2.6.32_220.x, 2.6.32_279.x, 2.6.32_358.x, and 2.6.32_431.x.

      ./configure --with-linux=/lib/modules/KERNEL_VERSION/build --disable-lru-resize --enable-ext4 --disable-server

      However, with the release of 2.6.32_504.x we are now seeing that the ko2iblnd module isn't built, and as a result we get I/O errors when loading the lustre module:

      LustreError: 2897:0:(api-ni.c:1081:lnet_startup_lndnis()) Can't load LND o2ib, module ko2iblnd, rc=256
      LustreError: 2897:0:(events.c:725:ptlrpc_init_portals()) network initialisation failed

      Looking over Google and other Jira tickets, we found that using "--with-o2ib=yes" produces the error (which is probably just a red herring):

      configure: error: can't compile with kernel OpenIB gen2 headers

      Given that the other builds were successful, I am fairly certain that this is a kernel issue, but I wanted to double-check here first. I've attached the config.log for perusal.

      Here is an RPM package query from the latest build to the most recent twp:

      1. pwd
        /root/rpmbuild/RPMS/x86_64
      1. ls -1t lustre-modules-1.8.7-2.6.32_*|head -3
        lustre-modules-1.8.7-2.6.32_504.1.3.el6.x86_64.x86_64.rpm
        lustre-modules-1.8.7-2.6.32_431.29.2.el6.x86_64.x86_64.rpm
        lustre-modules-1.8.7-2.6.32_431.23.3.el6.x86_64.x86_64.rpm
      1. ls -1t lustre-modules-1.8.7-2.6.32_*|head -3|xargs -I'{}' rpm -qlp {} |grep ko2iblnd
        /lib/modules/2.6.32-431.29.2.el6.x86_64/updates/kernel/net/lustre/ko2iblnd.ko
        /lib/modules/2.6.32-431.23.3.el6.x86_64/updates/kernel/net/lustre/ko2iblnd.ko

      Thank you for any guidance and/or additional information.

      John DeSantis

      Attachments

        Issue Links

          Activity

            [LU-5952] ko2iblnd not built with kernel 2.6.32-504.1.3

            Martin,

            I was able to perform some testing.

            With your patch and 1.8.7, we're able to mount Lustre without any error and we're seeing expected performance between nodes.

            John DeSantis

            mrfusion John DeSantis (Inactive) added a comment - Martin, I was able to perform some testing. With your patch and 1.8.7, we're able to mount Lustre without any error and we're seeing expected performance between nodes. John DeSantis

            Martin,

            I completely agree with striping, not just for I/O but also because recently ran into a case where the 2.0 TB file size limit was reached because the file wasn't striped over all available OST's.

            I was able to get the 1.8.7 client RPM's built (using 1.8.7_wc1) with the ko2iblnd module present using your patch, the LU-1116 full patch (https://jira.hpdd.intel.com/secure/attachment/10860/LU-1116-full.patch), and the patch from LU-2800 (http://review.whamcloud.com/#/c/8607/5).

            I will be doing some testing over the next few days to see if there are any differences in performance with "default" values between the client versions 1.8.7 and 1.8.9.

            Again, without your patch we would have been stuck. So I, too, can confirm to the @Developers that your patch has worked.

            John DeSantis

            mrfusion John DeSantis (Inactive) added a comment - Martin, I completely agree with striping, not just for I/O but also because recently ran into a case where the 2.0 TB file size limit was reached because the file wasn't striped over all available OST's. I was able to get the 1.8.7 client RPM's built (using 1.8.7_wc1) with the ko2iblnd module present using your patch, the LU-1116 full patch ( https://jira.hpdd.intel.com/secure/attachment/10860/LU-1116-full.patch ), and the patch from LU-2800 ( http://review.whamcloud.com/#/c/8607/5 ). I will be doing some testing over the next few days to see if there are any differences in performance with "default" values between the client versions 1.8.7 and 1.8.9. Again, without your patch we would have been stuck. So I, too, can confirm to the @Developers that your patch has worked. John DeSantis
            martin_hecht Martin Hecht added a comment - - edited

            Hi John,

            I'm not aware of performance problems for both (servers and clients) in the 1.8 branch. There were quite some changes to the code for the 1.8.9 release, but I'm not a developer, just a user of lustre. Commenting on your performance report: For large files (at least if you have large chunks of IO) it makes sense to set striping in order to make use of the parallelism in the file system. However, this also depends on the use case. If your application does random access, the better choice may be not striping the file even when it is large. So, it is always difficult to tell if a change in performance is good or bad, because it is a trade off of performance for different use cases. A decrease of performance for bulk IO to a large non-striped file might go along with an improvement of performance for small files or for random IO.

            Actually, I'm using the patch from http://review.whamcloud.com/#/c/8607/5 as well for quite a while now (probably since the upgrade from SL6.4 to 6.5 or so... the other one is needed since SL6.6 is out).

            Just a remark since we are looking at llite: If you plan to connect to servers with lustre > 2.4, you should also include
            http://review.whamcloud.com/#/c/5971/ as a fix for LU-3067 (this is a correction to commit http://review.whamcloud.com/#/c/5285/ which is included in 1.8.9).

            In the meantime my build for kernel 2.6.32-504.1.3 is done as well, and ko2iblnd.ko has been built successfully.

            @Developers: could someone pick up the patch for the kernel-2.6.32-504-series and submit it to the 1.8 branch, please? And maybe the other one as well?

            best regards,
            Martin

            martin_hecht Martin Hecht added a comment - - edited Hi John, I'm not aware of performance problems for both (servers and clients) in the 1.8 branch. There were quite some changes to the code for the 1.8.9 release, but I'm not a developer, just a user of lustre. Commenting on your performance report: For large files (at least if you have large chunks of IO) it makes sense to set striping in order to make use of the parallelism in the file system. However, this also depends on the use case. If your application does random access, the better choice may be not striping the file even when it is large. So, it is always difficult to tell if a change in performance is good or bad, because it is a trade off of performance for different use cases. A decrease of performance for bulk IO to a large non-striped file might go along with an improvement of performance for small files or for random IO. Actually, I'm using the patch from http://review.whamcloud.com/#/c/8607/5 as well for quite a while now (probably since the upgrade from SL6.4 to 6.5 or so... the other one is needed since SL6.6 is out). Just a remark since we are looking at llite: If you plan to connect to servers with lustre > 2.4, you should also include http://review.whamcloud.com/#/c/5971/ as a fix for LU-3067 (this is a correction to commit http://review.whamcloud.com/#/c/5285/ which is included in 1.8.9). In the meantime my build for kernel 2.6.32-504.1.3 is done as well, and ko2iblnd.ko has been built successfully. @Developers: could someone pick up the patch for the kernel-2.6.32-504-series and submit it to the 1.8 branch, please? And maybe the other one as well? best regards, Martin

            Martin,

            Too bad you're not local, I'd get you some pizza and beer

            Yesterday, I did not clone the git repository correctly. So, after applying your patch and the patch from: http://review.whamcloud.com/#/c/8607/5 (basically the same instructions from https://lists.01.org/pipermail/hpdd-discuss/2014-November/001409.html) I was able to get the RPM's built and verified that the missing module was present.

            The remaining issue that I'm seeing is degraded performance. I've attached a log file if you wish to review it. Before I start looking into this deeper, do you know if there are any known problems with performance when server and client versions do not match?

            Again, thank you very, very much. Without your help we would not have progressed far.

            John DeSantis

            mrfusion John DeSantis (Inactive) added a comment - Martin, Too bad you're not local, I'd get you some pizza and beer Yesterday, I did not clone the git repository correctly. So, after applying your patch and the patch from: http://review.whamcloud.com/#/c/8607/5 (basically the same instructions from https://lists.01.org/pipermail/hpdd-discuss/2014-November/001409.html ) I was able to get the RPM's built and verified that the missing module was present. The remaining issue that I'm seeing is degraded performance. I've attached a log file if you wish to review it. Before I start looking into this deeper, do you know if there are any known problems with performance when server and client versions do not match? Again, thank you very, very much. Without your help we would not have progressed far. John DeSantis
            martin_hecht Martin Hecht added a comment -

            Hi John,

            Basically, your attempt 4) should be working. This works fine for me:
            git clone git://git.whamcloud.com/fs/lustre-release.git
            cd lustre-release
            git checkout --track -b b1_8 origin/b1_8
            cd ..
            mv lustre-release lustre-1.8.9
            wget http://lists.01.org/pipermail/hpdd-discuss/attachments/20141107/95bf4389/attachment.patch
            cd lustre-1.8.9
            patch -p1 < ../attachment.patch
            sh autogen.sh

            The patch applies cleanly in my case. Now I noticed that I have updated my local git copy some time before the last commits and built lustre with kernel 2.6.32-504.
            Based on that I have sent the patch to the HPDD list. I'm just updating my build system and retrying with the latest kernel 2.6.32-504.1.3 and updated git checkout. I'll keep you updated...

            best regards, Martin

            martin_hecht Martin Hecht added a comment - Hi John, Basically, your attempt 4) should be working. This works fine for me: git clone git://git.whamcloud.com/fs/lustre-release.git cd lustre-release git checkout --track -b b1_8 origin/b1_8 cd .. mv lustre-release lustre-1.8.9 wget http://lists.01.org/pipermail/hpdd-discuss/attachments/20141107/95bf4389/attachment.patch cd lustre-1.8.9 patch -p1 < ../attachment.patch sh autogen.sh The patch applies cleanly in my case. Now I noticed that I have updated my local git copy some time before the last commits and built lustre with kernel 2.6.32-504. Based on that I have sent the patch to the HPDD list. I'm just updating my build system and retrying with the latest kernel 2.6.32-504.1.3 and updated git checkout. I'll keep you updated... best regards, Martin
            mrfusion John DeSantis (Inactive) added a comment - - edited

            Martin,

            Thanks for supplying the patch and the discussion thread!

            Unfortunately, the module still isn't getting built. I tried several different approaches:

            1.) Applied the patch to our 1.8.7 "build" directory; there were some hunk errors applying the patch, but otherwise the RPM's were built. Querying the module package showed that the ko2iblnd module was still missing when compiled against 2.6.32-504.1.3.

            2.) Repeated step #1 above, but this time I cherry picked sections of your patch which applied to the files in question (Makefile, autoMakefile.am.toplevel, and build/autoconf/lustre-build-linux.m4) as to avoid the hunk errors while patching. This result in RPM build errors.

            3.) Downloaded the lustre 1.8.7 source from http://downloads.lustre.org/public/lustre/v1.8/lustre_1.8.7/source/ and attempted a rebuild after applying the patch http://review.whamcloud.com/#/c/8607/ and your recent patch. Once I attempted to "make rpms", I received build errors.

            4.) Downloaded the latest git repo for b1_8 and applied your patch. I then ran autogen.sh and attempted to build. I did not receive any errors, but the ko2iblnd module was missing as with step #1.

            I've attached another file which shows the patch errors which led to cherry picking and some of the errors experienced when "make rpms" was used.

            If any of the information seems incomplete, I'll redo the steps above and save all of the output for perusal. During my testing today it's possible that I may have did something out of order.

            Again, thanks thus so far for your help.

            John DeSantis

            mrfusion John DeSantis (Inactive) added a comment - - edited Martin, Thanks for supplying the patch and the discussion thread! Unfortunately, the module still isn't getting built. I tried several different approaches: 1.) Applied the patch to our 1.8.7 "build" directory; there were some hunk errors applying the patch, but otherwise the RPM's were built. Querying the module package showed that the ko2iblnd module was still missing when compiled against 2.6.32-504.1.3. 2.) Repeated step #1 above, but this time I cherry picked sections of your patch which applied to the files in question (Makefile, autoMakefile.am.toplevel, and build/autoconf/lustre-build-linux.m4) as to avoid the hunk errors while patching. This result in RPM build errors. 3.) Downloaded the lustre 1.8.7 source from http://downloads.lustre.org/public/lustre/v1.8/lustre_1.8.7/source/ and attempted a rebuild after applying the patch http://review.whamcloud.com/#/c/8607/ and your recent patch. Once I attempted to "make rpms", I received build errors. 4.) Downloaded the latest git repo for b1_8 and applied your patch. I then ran autogen.sh and attempted to build. I did not receive any errors, but the ko2iblnd module was missing as with step #1. I've attached another file which shows the patch errors which led to cherry picking and some of the errors experienced when "make rpms" was used. If any of the information seems incomplete, I'll redo the steps above and save all of the output for perusal. During my testing today it's possible that I may have did something out of order. Again, thanks thus so far for your help. John DeSantis

            @John: I had the same problem - I had to backport a commit from the lustre 2-branch (LU-1337).
            Please find the patch and some discussion here:
            https://lists.01.org/pipermail/hpdd-discuss/2014-November/001370.html

            @Team: could you pick up the patch from there for review?
            Today I received the confirmation of an independent build with that patch.

            martin_hecht Martin Hecht added a comment - @John: I had the same problem - I had to backport a commit from the lustre 2-branch ( LU-1337 ). Please find the patch and some discussion here: https://lists.01.org/pipermail/hpdd-discuss/2014-November/001370.html @Team: could you pick up the patch from there for review? Today I received the confirmation of an independent build with that patch.

            Forgot to initially include the config.log.

            mrfusion John DeSantis (Inactive) added a comment - Forgot to initially include the config.log.

            People

              wc-triage WC Triage
              mrfusion John DeSantis (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: