[LU-14297] Can't compile lustre client against MLNX OFED-5.2-1.0.4 on Centos 7.8 Created: 06/Jan/21  Updated: 23/Jan/24  Resolved: 23/Jan/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.12.7

Type: Bug Priority: Major
Reporter: Michael Ethier (Inactive) Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Dell and Lenovo hardware. MLNX OFED-5.2-1.0.4. Lustre 2.12.5. OS is Centos 7.8. Kernel is 3.10.0-1127.19.1.el7.x86_64


Attachments: File autogen.sh     Text File config.log     File lustre-version.m4    
Issue Links:
Related
is related to LU-13783 Support for linux kernel version 5.8 Resolved
is related to LU-13761 MOFED 5.1 support Resolved
Epic/Theme: MLNX, OFED-5.2-1.0.4, lustre-2.12.5
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello, I am trying to install lustre on our lnet routers which have connectx-5 cards installed in them using dkms on Centos 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64. Also Mellanox just released their latest driver version OFED-5.2-1.0.4 yesterday Jan 4, 2021. When dkms tries to compile lustre, it fails with the following at end:

configure: LNet kernel checks
==============================================================================
checking whether to enable CPU affinity support... yes
checking if Linux kernel has cpu affinity support... yes
checking whether to enable tunable backoff TCP support... yes
checking if Linux kernel has tunable backoff TCP support... no
checking whether to use Compat RDMA... /bin/ofed_info
no
configure: error: no OFED nor kernel OpenIB gen2 headers present
configure error, check /var/lib/dkms/lustre-client/2.12.5/build/config.log

Building module:
cleaning build area...(bad exit status: 2)
make -j8 KERNELRELEASE=3.10.0-1127.19.1.el7.x86_64...(bad exit status: 2)
Error! Bad return status for module build on kernel: 3.10.0-1127.19.1.el7.x86_64 (x86_64)
Consult /var/lib/dkms/lustre-client/2.12.5/build/make.log for more information.

Also, I did verify that the MLNX rpms that are supposed to be installed, are installed.
On the machine I am trying to install on, I did check and ibstat states that both the cards have an active LinkUP:

[root@lnet08 ~]# ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.26.1040
Hardware version: 0
Node GUID: 0xb8599f03002f8318
System image GUID: 0xb8599f03002f8318
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 1522
LMC: 0
SM lid: 1434
Capability mask: 0x2651e848
Port GUID: 0xb8599f03002f8318
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4119
Number of ports: 1
Firmware version: 16.26.1040
Hardware version: 0
Node GUID: 0xb8599f03002f8319
System image GUID: 0xb8599f03002f8318
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 2260
LMC: 0
SM lid: 158
Capability mask: 0x2651e848
Port GUID: 0xb8599f03002f8319
Link layer: InfiniBand

Any ideas how to get this to work ?

Thanks,
Mike



 Comments   
Comment by Peter Jones [ 06/Jan/21 ]

Jian

Could you please investigate?

Thanks

Peter

Comment by Jian Yu [ 06/Jan/21 ]

Hi Mike,

checking whether to use Compat RDMA... /bin/ofed_info
no
configure: error: no OFED nor kernel OpenIB gen2 headers present
configure error, check /var/lib/dkms/lustre-client/2.12.5/build/config.log

Could you please upload the config.log to this ticket for investigation?

Comment by Michael Ethier (Inactive) [ 06/Jan/21 ]

Hi sure its attached.
Thanks,
Mike
config.log

Comment by Jian Yu [ 06/Jan/21 ]

Hi Mike,

        o2ib_found=false
        for O2IBPATH in $O2IBPATHS; do
                AS_IF([test \( -f ${O2IBPATH}/include/rdma/rdma_cm.h -a \
                           -f ${O2IBPATH}/include/rdma/ib_cm.h -a \
                           -f ${O2IBPATH}/include/rdma/ib_verbs.h -a \
                           -f ${O2IBPATH}/include/rdma/ib_fmr_pool.h \)], [
                        o2ib_found=true
                        break
                ])
        done

Could you please check if the above header files are located under /usr/src/ofa_kernel/default?

Comment by Michael Ethier (Inactive) [ 06/Jan/21 ]

Hi Jian,
This is what I found:
[root@cannonlnet08 ~]# cd /usr/src/ofa_kernel/default
[root@cannonlnet08 default]# find . -name rdma_cm.h -print
./include/rdma/rdma_cm.h
[root@cannonlnet08 default]# find . -name ib_verbs.h -print
./include/rdma/ib_verbs.h
[root@cannonlnet08 default]# find . -name ib_fmr_pool.h -print
[root@cannonlnet08 default]#

Looks like ib_fmr_pool.h is missing ?

[root@cannonlnet08 default]# cd ./include/rdma
[root@cannonlnet08 rdma]# ls
ib_addr.h ib_hdrs.h ib_sa.h ib_verbs.h lag.h opa_vnic.h rdma_netlink.h restrack.h uverbs_named_ioctl.h
iba.h ib_mad.h ib_smi.h ib_verbs_nvmf_def.h mr_pool.h peer_mem.h rdmavt_cq.h rw.h uverbs_std_types.h
ib_cache.h ib_marshall.h ibta_vol1_c12.h ib_verbs_nvmf.h opa_addr.h rdma_cm.h rdma_vt.h signature.h uverbs_types.h
ib_cm.h ib_pack.h ib_umem.h iw_cm.h opa_port_info.h rdma_cm_ib.h rdmavt_mr.h tid_rdma_defs.h
ib.h ib_pma.h ib_umem_odp.h iw_portmap.h opa_smi.h rdma_counter.h rdmavt_qp.h uverbs_ioctl.h

Thanks,
Mike

Comment by Jian Yu [ 06/Jan/21 ]

Yes, Mike. If one or more of those files is missing, then configure will return "error: no OFED nor kernel OpenIB gen2 headers present".
The files are usually included in mlnx-ofa_kernel-devel rpm. It seems ib_fmr_pool.h is missing from MLNX_OFED 5.2-1.0.4.0. Let me investigate further.

Comment by James A Simmons [ 06/Jan/21 ]

Can you try patch https://review.whamcloud.com/#/c/40287

Comment by Michael Ethier (Inactive) [ 06/Jan/21 ]

Hi James, can you point me to the procedure to apply the patch ?
Thanks, Mike

Comment by Michael Ethier (Inactive) [ 06/Jan/21 ]

Is the the procedure ? I would run without --dryrun to implement the changes.

[root@cannonlnet08 lustre-client-2.12.5]# pwd
/usr/src/lustre-client-2.12.5
[root@cannonlnet08 lustre-client-2.12.5]# patch -p1 --dry-run < ~/14b20ca6.diff
checking file lnet/autoconf/lustre-lnet.m4
Hunk #1 succeeded at 157 (offset 48 lines).
Hunk #2 succeeded at 234 (offset 48 lines).
Hunk #3 succeeded at 567 with fuzz 2 (offset 25 lines).
checking file lnet/klnds/o2iblnd/o2iblnd.c
Hunk #1 succeeded at 1469 (offset 56 lines).
Hunk #2 succeeded at 1528 (offset 56 lines).
Hunk #3 succeeded at 1557 (offset 56 lines).
Hunk #4 succeeded at 1566 (offset 56 lines).
Hunk #5 succeeded at 1675 (offset 56 lines).
Hunk #6 succeeded at 1767 (offset 57 lines).
Hunk #7 succeeded at 1789 (offset 57 lines).
Hunk #8 succeeded at 1799 (offset 57 lines).
Hunk #9 succeeded at 1813 (offset 57 lines).
Hunk #10 succeeded at 1855 (offset 57 lines).
Hunk #11 succeeded at 1871 (offset 57 lines).
Hunk #12 succeeded at 1894 (offset 57 lines).
Hunk #13 FAILED at 1885.
Hunk #14 succeeded at 1960 (offset 57 lines).
Hunk #15 succeeded at 1987 (offset 57 lines).
Hunk #16 succeeded at 2586 (offset -21 lines).
Hunk #17 succeeded at 2600 (offset -21 lines).
1 out of 17 hunks FAILED
checking file lnet/klnds/o2iblnd/o2iblnd.h
Hunk #1 succeeded at 71 (offset -12 lines).
Hunk #2 succeeded at 174 (offset -7 lines).
Hunk #3 succeeded at 337 with fuzz 2 (offset -13 lines).
Hunk #4 FAILED at 388.
1 out of 4 hunks FAILED
checking file lnet/klnds/o2iblnd/o2iblnd_cb.c
Hunk #3 succeeded at 625 (offset -1 lines).
Hunk #4 succeeded at 656 (offset -1 lines).
Hunk #5 succeeded at 687 (offset -1 lines).

Comment by Jian Yu [ 06/Jan/21 ]

Hi Mike,
I'm back-porting the patch to Lustre 2.12.5 and will share it with you.
BTW, it turns out ib_fmr_pool.h exists in kernel-devel-3.10.0-1127.19.1.el7:

# rpm -qlp kernel-devel-3.10.0-1127.19.1.el7.x86_64.rpm | grep ib_fmr_pool.h
/usr/src/kernels/3.10.0-1127.19.1.el7.x86_64/include/rdma/ib_fmr_pool.h

While installing MLNX_OFED 5.2-1.0.4.0 on the node, did you pass "--add-kernel-support" option to mlnxofedinstall or run mlnx_add_kernel_support.sh to generate an MLNX_OFED package with drivers for the kernel 3.10.0-1127.19.1.el7 on the node?

Comment by Michael Ethier (Inactive) [ 06/Jan/21 ]

Hi Jian,
Yes I ran mlnx_add_kernel_support.sh to generate an MLNX_OFED package with drivers for the kernel 3.10.0-1127.19.1.el7 on the node. Then I took the resulting .gz file and extracted the RPMS from it, and used them to install MLNX OFED on this node via the yum repo method.
Thanks,
Mike

Comment by Jian Yu [ 07/Jan/21 ]

Thank you Mike for the info.
Here is the back-ported patch for Lustre 2.12.5: https://review.whamcloud.com/41153. You can apply the patch and build manually or wait for Jenkins build to be finished.

Comment by Michael Ethier (Inactive) [ 07/Jan/21 ]

Hi Jian,
I applied your patch and tried to build lustre 2.12.5 client and it failed the same way:
[root@cannonlnet08 lustre-client-2.12.5]# patch -p1 < ~/f4d9b03a.diff
patching file lnet/autoconf/lustre-lnet.m4
patching file lnet/klnds/o2iblnd/o2iblnd.c
patching file lnet/klnds/o2iblnd/o2iblnd.h
patching file lnet/klnds/o2iblnd/o2iblnd_cb.c

[root@cannonlnet08 ~]# dkms install -k $(uname -r) lustre-client/2.12.5

Kernel preparation unnecessary for this kernel. Skipping...

Running the pre_build script:
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
...
...
configure: LNet kernel checks
==============================================================================
checking whether to enable CPU affinity support... yes
checking if Linux kernel has cpu affinity support... yes
checking whether to enable tunable backoff TCP support... yes
checking if Linux kernel has tunable backoff TCP support... no
checking whether to use Compat RDMA... /bin/ofed_info
no
configure: error: no OFED nor kernel OpenIB gen2 headers present
configure error, check /var/lib/dkms/lustre-client/2.12.5/build/config.log

Building module:
cleaning build area...(bad exit status: 2)
make -j8 KERNELRELEASE=3.10.0-1127.19.1.el7.x86_64...(bad exit status: 2)
Error! Bad return status for module build on kernel: 3.10.0-1127.19.1.el7.x86_64 (x86_64)
Consult /var/lib/dkms/lustre-client/2.12.5/build/make.log for more information.

Comment by Jian Yu [ 07/Jan/21 ]

Hi Mike,
I'm setting up a test node to reproduce the issue and verify the patch.

Comment by Jian Yu [ 07/Jan/21 ]

Hi Mike,
I can reproduce the ib_fmr_pool.h missing issue on my test node. However, with patch https://review.whamcloud.com/41153 applied to Lustre 2.12.5, the issue was resolved. I installed the el7.8/x86_64 lustre-client-dkms rpm from Jenkins build https://build.whamcloud.com/job/lustre-reviews/78554/ :

# rpm -ivh lustre-client-dkms-2.12.5_1_gf4d9b03-1.el7.noarch.rpm 
Preparing...                          ################################# [100%]
Updating / installing...
   1:lustre-client-dkms-2.12.5_1_gf4d9################################# [100%]
Loading new lustre-client-2.12.5_1_gf4d9b03 DKMS files...
Building for 3.10.0-1127.19.1.el7.x86_64
Building initial module for 3.10.0-1127.19.1.el7.x86_64
<~snip~>

The config.log showed that:

configure:18577: checking whether to use Compat RDMA
configure:18673: result: yes
configure:18708: checking whether to use any OFED backport headers
configure:18716: result: no
configure:18725: checking whether to enable OpenIB gen2 support
<~snip~>
configure:18794: result: yes
configure:18817: adding /usr/src/ofa_kernel/default/Module.symvers to Symbol Path

configure passed.

Comment by Jian Yu [ 07/Jan/21 ]

After configure passed, building the codes hit the following error:

/var/lib/dkms/lustre-client/2.12.5_1_gf4d9b03/build/lnet/klnds/o2iblnd/o2iblnd_cb.c: In function ‘kiblnd_reject’:
/var/lib/dkms/lustre-client/2.12.5_1_gf4d9b03/build/lnet/klnds/o2iblnd/o2iblnd_cb.c:2421:9: error: too few arguments to function ‘rdma_reject’
         rc = rdma_reject(cmid, rej, sizeof(*rej));
         ^

The error has been fixed in patch https://review.whamcloud.com/39781 and landed for Lustre 2.12.6.

With patch https://review.whamcloud.com/41152 applied to Lustre 2.12.6, I can successfully build Lustre 2.12.6 client on CentOS 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64 and MLNX_OFED 5.2-1.0.4.0:

# rpm -ivh lustre-client-dkms-2.12.6_1_g14e02fb-1.el7.noarch.rpm
Preparing...                          ################################# [100%]
Updating / installing...
   1:lustre-client-dkms-2.12.6_1_g14e0################################# [100%]
Loading new lustre-client-2.12.6_1_g14e02fb DKMS files...
Building for 3.10.0-1127.19.1.el7.x86_64
Building initial module for 3.10.0-1127.19.1.el7.x86_64
Done.
<~snip~>
ko2iblnd.ko.xz:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/
<~snip~>
Adding any weak-modules

depmod....

DKMS: install completed.
Comment by Michael Ethier (Inactive) [ 07/Jan/21 ]

Hi Jian,

Thanks for the feedback. However, we are running Lustre client 2.12.5 almost everywhere on our production infrastructure.

I am working currently on updating out LNET routers from Centos 7.7 Lustre 2.12.4 and OFED-4.7-1.0.0 to Centos 7.8 and was hoping to keep the lustre version the same (ie 2.12.5).

Based on your info I have to use lustre 2.12.6 in order to get this to work with the latest MLNX OFED. And Mellanox recommends I use their latest OFED version. Do you know of any compatibility issues or other issues updating our LNET routers to 2.12.6 ? Or should I just leave them alone as they seem to be working fine.

Thanks,
Mike

Comment by Jian Yu [ 07/Jan/21 ]

Hi Mike,
There are some LNet fixups and improvements in Lustre 2.12.6, but I'm not sure if there are compatibility issues.
I just verified that with the following two patches applied to Lustre 2.12.5, the client build also passed on CentOS 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64 and MLNX_OFED 5.2-1.0.4.0:

Comment by Michael Ethier (Inactive) [ 08/Jan/21 ]

Hi Jian,
Thanks you have been very responsive in regards to my issue. I will see if I can make this work.
Mike

Comment by Michael Ethier (Inactive) [ 08/Jan/21 ]

BTW, do you know when this issue will be fixed in the general lustre release ? 2.12.6 is already released.

Comment by Michael Ethier (Inactive) [ 08/Jan/21 ]

Hi Jian,
I just tried those 2 patches you recommended to lustre 2.12.5 and its failing the same way still. How exactly are you applying those 2 patches ? This is what I did:

[root@cannonlnet08 lustre-client-2.12.5]# pwd
/usr/src/lustre-client-2.12.5
[root@cannonlnet08 lustre-client-2.12.5]# patch -p1 < ~/14e02fb3.diff
patching file lnet/autoconf/lustre-lnet.m4
Hunk #3 succeeded at 567 with fuzz 2 (offset -23 lines).
patching file lnet/klnds/o2iblnd/o2iblnd.c
patching file lnet/klnds/o2iblnd/o2iblnd.h
patching file lnet/klnds/o2iblnd/o2iblnd_cb.c
[root@cannonlnet08 lustre-client-2.12.5]# patch -p1 < ~/ba702c79.diff
patching file lnet/autoconf/lustre-lnet.m4
Hunk #1 succeeded at 579 with fuzz 2 (offset 9 lines).
patching file lnet/klnds/o2iblnd/o2iblnd_cb.c
Hunk #1 succeeded at 2418 (offset 11 lines).

Then I started the build:
[root@cannonlnet08 lustre-client-2.12.5]# dkms install -k $(uname -r) lustre-client/2.12.5

Kernel preparation unnecessary for this kernel. Skipping...

Running the pre_build script:
checking build system type... x86_64-unknown-linux-gnu
...
...

Comment by Jian Yu [ 08/Jan/21 ]

You're welcome, Mike. I'm not sure when the next 2.12.x version will be released.
I directly installed the lustre-client-dkms rpm generated by Jenkins build system https://build.whamcloud.com/job/lustre-reviews/78580/arch=x86_64,build_type=client,distro=el7.8,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/lustre-client-dkms-2.12.5_1_g726eed2-1.el7.noarch.rpm without problem.
I will try your method to see how it goes.

Comment by Michael Ethier (Inactive) [ 12/Jan/21 ]

Hi Jian,
Any luck in trying my method ?
Thanks,
Mike

Comment by Peter Jones [ 12/Jan/21 ]

My suggestion is that we expedite landing https://review.whamcloud.com/#/c/41152/ to b2_12 and then the tip of b2_12 will be what is needed to to build 2.12.6 for MOFED 5.2. We have not thought about 2.12.7 timing yet, but we will certainly want to include this fix.

Comment by Michael Ethier (Inactive) [ 12/Jan/21 ]

So I have an lnet router out of service that I was trying to get running with the latest MOFED and lustre 2.12.5. Should I just rebuilt it back to its previous functioning setup ? I don't want to leave it down for a long time.

Comment by Jian Yu [ 12/Jan/21 ]

Hi Mike,
I can reproduce your issue. After applying the patches, could you please run the attached autogen.sh under /usr/src/lustre-client-2.12.5 before running dkms install ...?

# pwd
/usr/src/lustre-client-2.12.5
# sh ./autogen.sh
Comment by Jian Yu [ 12/Jan/21 ]

And before running autogen.sh, the attached lustre-version.m4 also needs to be put into /usr/src/lustre-client-2.12.5/config.
The following steps work for me from scratch:

# rpm -ivh lustre-client-dkms-2.12.5-1.el7.noarch.rpm
# cd /usr/src/lustre-client-2.12.5/
# patch -p1 < /root/0001-LU-13761-o2ib-Fix-compilation-with-MOFED-5.1.patch 
# patch -p1 < /root/0001-LU-13783-o2iblnd-make-FMR-pool-support-optional.patch
# cp /root/autogen.sh .
# cp /root/lustre-version.m4 config/
# sh ./autogen.sh 
# dkms install -k $(uname -r) lustre-client/2.12.5
...
...
 - Installation
   - Installing to /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/
Adding any weak-modules

depmod....

DKMS: install completed.
Comment by Michael Ethier (Inactive) [ 12/Jan/21 ]

Hi Jian,
The patches I should apply are they the same ones or different ones ? Can you give me pointers to them ?
Thanks,
Mike

Comment by Jian Yu [ 12/Jan/21 ]

Hi Mike,
The same ones as those in #comment-288967

Comment by Michael Ethier (Inactive) [ 13/Jan/21 ]

Hi Jian,
I followed your instructions and that seems to have worked and the lnet route is running. I need to rebuild 9 other lnet routers and this is what I should correct ? Or is there going to be an "official" release that will include this fix soon ?
It won't be an official version of 2.12.5 correct ?
Thanks,
Mike

Comment by Peter Jones [ 13/Jan/21 ]

Mike

The "official" release will be 2.12.7 but we don't have an exact timeline for it yet

Peter

Comment by Michael Ethier (Inactive) [ 13/Jan/21 ]

Peter, our group is going to wait for 2.12.7 to be release before we update all our lnet routers. Do you think the 2.12.7 will be released in weeks or months ? Thanks.

Comment by Peter Jones [ 13/Jan/21 ]

Michael

It's possible something new might come to light that quickly changes this but, as things stand today, my best guess is months.

Peter

Generated at Sat Feb 10 03:08:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.