Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14297

Can't compile lustre client against MLNX OFED-5.2-1.0.4 on Centos 7.8

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.7
    • Lustre 2.12.5
    • None
    • Dell and Lenovo hardware. MLNX OFED-5.2-1.0.4. Lustre 2.12.5. OS is Centos 7.8. Kernel is 3.10.0-1127.19.1.el7.x86_64

    Description

      Hello, I am trying to install lustre on our lnet routers which have connectx-5 cards installed in them using dkms on Centos 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64. Also Mellanox just released their latest driver version OFED-5.2-1.0.4 yesterday Jan 4, 2021. When dkms tries to compile lustre, it fails with the following at end:

      configure: LNet kernel checks
      ==============================================================================
      checking whether to enable CPU affinity support... yes
      checking if Linux kernel has cpu affinity support... yes
      checking whether to enable tunable backoff TCP support... yes
      checking if Linux kernel has tunable backoff TCP support... no
      checking whether to use Compat RDMA... /bin/ofed_info
      no
      configure: error: no OFED nor kernel OpenIB gen2 headers present
      configure error, check /var/lib/dkms/lustre-client/2.12.5/build/config.log

      Building module:
      cleaning build area...(bad exit status: 2)
      make -j8 KERNELRELEASE=3.10.0-1127.19.1.el7.x86_64...(bad exit status: 2)
      Error! Bad return status for module build on kernel: 3.10.0-1127.19.1.el7.x86_64 (x86_64)
      Consult /var/lib/dkms/lustre-client/2.12.5/build/make.log for more information.

      Also, I did verify that the MLNX rpms that are supposed to be installed, are installed.
      On the machine I am trying to install on, I did check and ibstat states that both the cards have an active LinkUP:

      [root@lnet08 ~]# ibstat
      CA 'mlx5_0'
      CA type: MT4119
      Number of ports: 1
      Firmware version: 16.26.1040
      Hardware version: 0
      Node GUID: 0xb8599f03002f8318
      System image GUID: 0xb8599f03002f8318
      Port 1:
      State: Active
      Physical state: LinkUp
      Rate: 100
      Base lid: 1522
      LMC: 0
      SM lid: 1434
      Capability mask: 0x2651e848
      Port GUID: 0xb8599f03002f8318
      Link layer: InfiniBand
      CA 'mlx5_1'
      CA type: MT4119
      Number of ports: 1
      Firmware version: 16.26.1040
      Hardware version: 0
      Node GUID: 0xb8599f03002f8319
      System image GUID: 0xb8599f03002f8318
      Port 1:
      State: Active
      Physical state: LinkUp
      Rate: 56
      Base lid: 2260
      LMC: 0
      SM lid: 158
      Capability mask: 0x2651e848
      Port GUID: 0xb8599f03002f8319
      Link layer: InfiniBand

      Any ideas how to get this to work ?

      Thanks,
      Mike

      Attachments

        1. autogen.sh
          0.3 kB
        2. config.log
          208 kB
        3. lustre-version.m4
          1 kB

        Issue Links

          Activity

            [LU-14297] Can't compile lustre client against MLNX OFED-5.2-1.0.4 on Centos 7.8
            pjones Peter Jones added a comment -

            Michael

            It's possible something new might come to light that quickly changes this but, as things stand today, my best guess is months.

            Peter

            pjones Peter Jones added a comment - Michael It's possible something new might come to light that quickly changes this but, as things stand today, my best guess is months. Peter

            Peter, our group is going to wait for 2.12.7 to be release before we update all our lnet routers. Do you think the 2.12.7 will be released in weeks or months ? Thanks.

            mre64 Michael Ethier (Inactive) added a comment - Peter, our group is going to wait for 2.12.7 to be release before we update all our lnet routers. Do you think the 2.12.7 will be released in weeks or months ? Thanks.
            pjones Peter Jones added a comment -

            Mike

            The "official" release will be 2.12.7 but we don't have an exact timeline for it yet

            Peter

            pjones Peter Jones added a comment - Mike The "official" release will be 2.12.7 but we don't have an exact timeline for it yet Peter

            Hi Jian,
            I followed your instructions and that seems to have worked and the lnet route is running. I need to rebuild 9 other lnet routers and this is what I should correct ? Or is there going to be an "official" release that will include this fix soon ?
            It won't be an official version of 2.12.5 correct ?
            Thanks,
            Mike

            mre64 Michael Ethier (Inactive) added a comment - Hi Jian, I followed your instructions and that seems to have worked and the lnet route is running. I need to rebuild 9 other lnet routers and this is what I should correct ? Or is there going to be an "official" release that will include this fix soon ? It won't be an official version of 2.12.5 correct ? Thanks, Mike
            yujian Jian Yu added a comment -

            Hi Mike,
            The same ones as those in #comment-288967

            yujian Jian Yu added a comment - Hi Mike, The same ones as those in #comment-288967

            Hi Jian,
            The patches I should apply are they the same ones or different ones ? Can you give me pointers to them ?
            Thanks,
            Mike

            mre64 Michael Ethier (Inactive) added a comment - Hi Jian, The patches I should apply are they the same ones or different ones ? Can you give me pointers to them ? Thanks, Mike
            yujian Jian Yu added a comment -

            And before running autogen.sh, the attached lustre-version.m4 also needs to be put into /usr/src/lustre-client-2.12.5/config.
            The following steps work for me from scratch:

            # rpm -ivh lustre-client-dkms-2.12.5-1.el7.noarch.rpm
            # cd /usr/src/lustre-client-2.12.5/
            # patch -p1 < /root/0001-LU-13761-o2ib-Fix-compilation-with-MOFED-5.1.patch 
            # patch -p1 < /root/0001-LU-13783-o2iblnd-make-FMR-pool-support-optional.patch
            # cp /root/autogen.sh .
            # cp /root/lustre-version.m4 config/
            # sh ./autogen.sh 
            # dkms install -k $(uname -r) lustre-client/2.12.5
            ...
            ...
             - Installation
               - Installing to /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/
            Adding any weak-modules
            
            depmod....
            
            DKMS: install completed.
            
            yujian Jian Yu added a comment - And before running autogen.sh , the attached lustre-version.m4 also needs to be put into /usr/src/lustre-client-2.12.5/config . The following steps work for me from scratch: # rpm -ivh lustre-client-dkms-2.12.5-1.el7.noarch.rpm # cd /usr/src/lustre-client-2.12.5/ # patch -p1 < /root/0001-LU-13761-o2ib-Fix-compilation-with-MOFED-5.1.patch # patch -p1 < /root/0001-LU-13783-o2iblnd-make-FMR-pool-support-optional.patch # cp /root/autogen.sh . # cp /root/lustre-version.m4 config/ # sh ./autogen.sh # dkms install -k $(uname -r) lustre-client/2.12.5 ... ... - Installation - Installing to /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/ Adding any weak-modules depmod.... DKMS: install completed.
            yujian Jian Yu added a comment -

            Hi Mike,
            I can reproduce your issue. After applying the patches, could you please run the attached autogen.sh under /usr/src/lustre-client-2.12.5 before running dkms install ...?

            # pwd
            /usr/src/lustre-client-2.12.5
            # sh ./autogen.sh
            
            yujian Jian Yu added a comment - Hi Mike, I can reproduce your issue. After applying the patches, could you please run the attached autogen.sh under /usr/src/lustre-client-2.12.5 before running dkms install ... ? # pwd /usr/src/lustre-client-2.12.5 # sh ./autogen.sh

            So I have an lnet router out of service that I was trying to get running with the latest MOFED and lustre 2.12.5. Should I just rebuilt it back to its previous functioning setup ? I don't want to leave it down for a long time.

            mre64 Michael Ethier (Inactive) added a comment - So I have an lnet router out of service that I was trying to get running with the latest MOFED and lustre 2.12.5. Should I just rebuilt it back to its previous functioning setup ? I don't want to leave it down for a long time.
            pjones Peter Jones added a comment -

            My suggestion is that we expedite landing https://review.whamcloud.com/#/c/41152/ to b2_12 and then the tip of b2_12 will be what is needed to to build 2.12.6 for MOFED 5.2. We have not thought about 2.12.7 timing yet, but we will certainly want to include this fix.

            pjones Peter Jones added a comment - My suggestion is that we expedite landing https://review.whamcloud.com/#/c/41152/ to b2_12 and then the tip of b2_12 will be what is needed to to build 2.12.6 for MOFED 5.2. We have not thought about 2.12.7 timing yet, but we will certainly want to include this fix.

            People

              yujian Jian Yu
              mre64 Michael Ethier (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: