Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14297

Can't compile lustre client against MLNX OFED-5.2-1.0.4 on Centos 7.8

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.7
    • Lustre 2.12.5
    • None
    • Dell and Lenovo hardware. MLNX OFED-5.2-1.0.4. Lustre 2.12.5. OS is Centos 7.8. Kernel is 3.10.0-1127.19.1.el7.x86_64

    Description

      Hello, I am trying to install lustre on our lnet routers which have connectx-5 cards installed in them using dkms on Centos 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64. Also Mellanox just released their latest driver version OFED-5.2-1.0.4 yesterday Jan 4, 2021. When dkms tries to compile lustre, it fails with the following at end:

      configure: LNet kernel checks
      ==============================================================================
      checking whether to enable CPU affinity support... yes
      checking if Linux kernel has cpu affinity support... yes
      checking whether to enable tunable backoff TCP support... yes
      checking if Linux kernel has tunable backoff TCP support... no
      checking whether to use Compat RDMA... /bin/ofed_info
      no
      configure: error: no OFED nor kernel OpenIB gen2 headers present
      configure error, check /var/lib/dkms/lustre-client/2.12.5/build/config.log

      Building module:
      cleaning build area...(bad exit status: 2)
      make -j8 KERNELRELEASE=3.10.0-1127.19.1.el7.x86_64...(bad exit status: 2)
      Error! Bad return status for module build on kernel: 3.10.0-1127.19.1.el7.x86_64 (x86_64)
      Consult /var/lib/dkms/lustre-client/2.12.5/build/make.log for more information.

      Also, I did verify that the MLNX rpms that are supposed to be installed, are installed.
      On the machine I am trying to install on, I did check and ibstat states that both the cards have an active LinkUP:

      [root@lnet08 ~]# ibstat
      CA 'mlx5_0'
      CA type: MT4119
      Number of ports: 1
      Firmware version: 16.26.1040
      Hardware version: 0
      Node GUID: 0xb8599f03002f8318
      System image GUID: 0xb8599f03002f8318
      Port 1:
      State: Active
      Physical state: LinkUp
      Rate: 100
      Base lid: 1522
      LMC: 0
      SM lid: 1434
      Capability mask: 0x2651e848
      Port GUID: 0xb8599f03002f8318
      Link layer: InfiniBand
      CA 'mlx5_1'
      CA type: MT4119
      Number of ports: 1
      Firmware version: 16.26.1040
      Hardware version: 0
      Node GUID: 0xb8599f03002f8319
      System image GUID: 0xb8599f03002f8318
      Port 1:
      State: Active
      Physical state: LinkUp
      Rate: 56
      Base lid: 2260
      LMC: 0
      SM lid: 158
      Capability mask: 0x2651e848
      Port GUID: 0xb8599f03002f8319
      Link layer: InfiniBand

      Any ideas how to get this to work ?

      Thanks,
      Mike

      Attachments

        1. autogen.sh
          0.3 kB
        2. config.log
          208 kB
        3. lustre-version.m4
          1 kB

        Issue Links

          Activity

            [LU-14297] Can't compile lustre client against MLNX OFED-5.2-1.0.4 on Centos 7.8
            yujian Jian Yu added a comment -

            And before running autogen.sh, the attached lustre-version.m4 also needs to be put into /usr/src/lustre-client-2.12.5/config.
            The following steps work for me from scratch:

            # rpm -ivh lustre-client-dkms-2.12.5-1.el7.noarch.rpm
            # cd /usr/src/lustre-client-2.12.5/
            # patch -p1 < /root/0001-LU-13761-o2ib-Fix-compilation-with-MOFED-5.1.patch 
            # patch -p1 < /root/0001-LU-13783-o2iblnd-make-FMR-pool-support-optional.patch
            # cp /root/autogen.sh .
            # cp /root/lustre-version.m4 config/
            # sh ./autogen.sh 
            # dkms install -k $(uname -r) lustre-client/2.12.5
            ...
            ...
             - Installation
               - Installing to /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/
            Adding any weak-modules
            
            depmod....
            
            DKMS: install completed.
            
            yujian Jian Yu added a comment - And before running autogen.sh , the attached lustre-version.m4 also needs to be put into /usr/src/lustre-client-2.12.5/config . The following steps work for me from scratch: # rpm -ivh lustre-client-dkms-2.12.5-1.el7.noarch.rpm # cd /usr/src/lustre-client-2.12.5/ # patch -p1 < /root/0001-LU-13761-o2ib-Fix-compilation-with-MOFED-5.1.patch # patch -p1 < /root/0001-LU-13783-o2iblnd-make-FMR-pool-support-optional.patch # cp /root/autogen.sh . # cp /root/lustre-version.m4 config/ # sh ./autogen.sh # dkms install -k $(uname -r) lustre-client/2.12.5 ... ... - Installation - Installing to /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/ Adding any weak-modules depmod.... DKMS: install completed.
            yujian Jian Yu added a comment -

            Hi Mike,
            I can reproduce your issue. After applying the patches, could you please run the attached autogen.sh under /usr/src/lustre-client-2.12.5 before running dkms install ...?

            # pwd
            /usr/src/lustre-client-2.12.5
            # sh ./autogen.sh
            
            yujian Jian Yu added a comment - Hi Mike, I can reproduce your issue. After applying the patches, could you please run the attached autogen.sh under /usr/src/lustre-client-2.12.5 before running dkms install ... ? # pwd /usr/src/lustre-client-2.12.5 # sh ./autogen.sh

            So I have an lnet router out of service that I was trying to get running with the latest MOFED and lustre 2.12.5. Should I just rebuilt it back to its previous functioning setup ? I don't want to leave it down for a long time.

            mre64 Michael Ethier (Inactive) added a comment - So I have an lnet router out of service that I was trying to get running with the latest MOFED and lustre 2.12.5. Should I just rebuilt it back to its previous functioning setup ? I don't want to leave it down for a long time.
            pjones Peter Jones added a comment -

            My suggestion is that we expedite landing https://review.whamcloud.com/#/c/41152/ to b2_12 and then the tip of b2_12 will be what is needed to to build 2.12.6 for MOFED 5.2. We have not thought about 2.12.7 timing yet, but we will certainly want to include this fix.

            pjones Peter Jones added a comment - My suggestion is that we expedite landing https://review.whamcloud.com/#/c/41152/ to b2_12 and then the tip of b2_12 will be what is needed to to build 2.12.6 for MOFED 5.2. We have not thought about 2.12.7 timing yet, but we will certainly want to include this fix.

            Hi Jian,
            Any luck in trying my method ?
            Thanks,
            Mike

            mre64 Michael Ethier (Inactive) added a comment - Hi Jian, Any luck in trying my method ? Thanks, Mike
            yujian Jian Yu added a comment -

            You're welcome, Mike. I'm not sure when the next 2.12.x version will be released.
            I directly installed the lustre-client-dkms rpm generated by Jenkins build system https://build.whamcloud.com/job/lustre-reviews/78580/arch=x86_64,build_type=client,distro=el7.8,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/lustre-client-dkms-2.12.5_1_g726eed2-1.el7.noarch.rpm without problem.
            I will try your method to see how it goes.

            yujian Jian Yu added a comment - You're welcome, Mike. I'm not sure when the next 2.12.x version will be released. I directly installed the lustre-client-dkms rpm generated by Jenkins build system https://build.whamcloud.com/job/lustre-reviews/78580/arch=x86_64,build_type=client,distro=el7.8,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/lustre-client-dkms-2.12.5_1_g726eed2-1.el7.noarch.rpm without problem. I will try your method to see how it goes.

            Hi Jian,
            I just tried those 2 patches you recommended to lustre 2.12.5 and its failing the same way still. How exactly are you applying those 2 patches ? This is what I did:

            [root@cannonlnet08 lustre-client-2.12.5]# pwd
            /usr/src/lustre-client-2.12.5
            [root@cannonlnet08 lustre-client-2.12.5]# patch -p1 < ~/14e02fb3.diff
            patching file lnet/autoconf/lustre-lnet.m4
            Hunk #3 succeeded at 567 with fuzz 2 (offset -23 lines).
            patching file lnet/klnds/o2iblnd/o2iblnd.c
            patching file lnet/klnds/o2iblnd/o2iblnd.h
            patching file lnet/klnds/o2iblnd/o2iblnd_cb.c
            [root@cannonlnet08 lustre-client-2.12.5]# patch -p1 < ~/ba702c79.diff
            patching file lnet/autoconf/lustre-lnet.m4
            Hunk #1 succeeded at 579 with fuzz 2 (offset 9 lines).
            patching file lnet/klnds/o2iblnd/o2iblnd_cb.c
            Hunk #1 succeeded at 2418 (offset 11 lines).

            Then I started the build:
            [root@cannonlnet08 lustre-client-2.12.5]# dkms install -k $(uname -r) lustre-client/2.12.5

            Kernel preparation unnecessary for this kernel. Skipping...

            Running the pre_build script:
            checking build system type... x86_64-unknown-linux-gnu
            ...
            ...

            mre64 Michael Ethier (Inactive) added a comment - Hi Jian, I just tried those 2 patches you recommended to lustre 2.12.5 and its failing the same way still. How exactly are you applying those 2 patches ? This is what I did: [root@cannonlnet08 lustre-client-2.12.5] # pwd /usr/src/lustre-client-2.12.5 [root@cannonlnet08 lustre-client-2.12.5] # patch -p1 < ~/14e02fb3.diff patching file lnet/autoconf/lustre-lnet.m4 Hunk #3 succeeded at 567 with fuzz 2 (offset -23 lines). patching file lnet/klnds/o2iblnd/o2iblnd.c patching file lnet/klnds/o2iblnd/o2iblnd.h patching file lnet/klnds/o2iblnd/o2iblnd_cb.c [root@cannonlnet08 lustre-client-2.12.5] # patch -p1 < ~/ba702c79.diff patching file lnet/autoconf/lustre-lnet.m4 Hunk #1 succeeded at 579 with fuzz 2 (offset 9 lines). patching file lnet/klnds/o2iblnd/o2iblnd_cb.c Hunk #1 succeeded at 2418 (offset 11 lines). Then I started the build: [root@cannonlnet08 lustre-client-2.12.5] # dkms install -k $(uname -r) lustre-client/2.12.5 Kernel preparation unnecessary for this kernel. Skipping... Running the pre_build script: checking build system type... x86_64-unknown-linux-gnu ... ...

            BTW, do you know when this issue will be fixed in the general lustre release ? 2.12.6 is already released.

            mre64 Michael Ethier (Inactive) added a comment - BTW, do you know when this issue will be fixed in the general lustre release ? 2.12.6 is already released.

            Hi Jian,
            Thanks you have been very responsive in regards to my issue. I will see if I can make this work.
            Mike

            mre64 Michael Ethier (Inactive) added a comment - Hi Jian, Thanks you have been very responsive in regards to my issue. I will see if I can make this work. Mike
            yujian Jian Yu added a comment -

            Hi Mike,
            There are some LNet fixups and improvements in Lustre 2.12.6, but I'm not sure if there are compatibility issues.
            I just verified that with the following two patches applied to Lustre 2.12.5, the client build also passed on CentOS 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64 and MLNX_OFED 5.2-1.0.4.0:

            yujian Jian Yu added a comment - Hi Mike, There are some LNet fixups and improvements in Lustre 2.12.6, but I'm not sure if there are compatibility issues. I just verified that with the following two patches applied to Lustre 2.12.5, the client build also passed on CentOS 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64 and MLNX_OFED 5.2-1.0.4.0: https://review.whamcloud.com/41152 " LU-13783 o2iblnd: make FMR-pool support optional." https://review.whamcloud.com/39781 " LU-13761 o2ib: Fix compilation with MOFED 5.1"

            Hi Jian,

            Thanks for the feedback. However, we are running Lustre client 2.12.5 almost everywhere on our production infrastructure.

            I am working currently on updating out LNET routers from Centos 7.7 Lustre 2.12.4 and OFED-4.7-1.0.0 to Centos 7.8 and was hoping to keep the lustre version the same (ie 2.12.5).

            Based on your info I have to use lustre 2.12.6 in order to get this to work with the latest MLNX OFED. And Mellanox recommends I use their latest OFED version. Do you know of any compatibility issues or other issues updating our LNET routers to 2.12.6 ? Or should I just leave them alone as they seem to be working fine.

            Thanks,
            Mike

            mre64 Michael Ethier (Inactive) added a comment - Hi Jian, Thanks for the feedback. However, we are running Lustre client 2.12.5 almost everywhere on our production infrastructure. I am working currently on updating out LNET routers from Centos 7.7 Lustre 2.12.4 and OFED-4.7-1.0.0 to Centos 7.8 and was hoping to keep the lustre version the same (ie 2.12.5). Based on your info I have to use lustre 2.12.6 in order to get this to work with the latest MLNX OFED. And Mellanox recommends I use their latest OFED version. Do you know of any compatibility issues or other issues updating our LNET routers to 2.12.6 ? Or should I just leave them alone as they seem to be working fine. Thanks, Mike

            People

              yujian Jian Yu
              mre64 Michael Ethier (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: