[LU-14297] Can't compile lustre client against MLNX OFED-5.2-1.0.4 on Centos 7.8 Created: 06/Jan/21 Updated: 23/Jan/24 Resolved: 23/Jan/24 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.12.7 |
| Type: | Bug | Priority: | Major |
| Reporter: | Michael Ethier (Inactive) | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Dell and Lenovo hardware. MLNX OFED-5.2-1.0.4. Lustre 2.12.5. OS is Centos 7.8. Kernel is 3.10.0-1127.19.1.el7.x86_64 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Epic/Theme: | MLNX, OFED-5.2-1.0.4, lustre-2.12.5 | ||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Hello, I am trying to install lustre on our lnet routers which have connectx-5 cards installed in them using dkms on Centos 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64. Also Mellanox just released their latest driver version OFED-5.2-1.0.4 yesterday Jan 4, 2021. When dkms tries to compile lustre, it fails with the following at end: configure: LNet kernel checks Building module: Also, I did verify that the MLNX rpms that are supposed to be installed, are installed. [root@lnet08 ~]# ibstat Any ideas how to get this to work ? Thanks, |
| Comments |
| Comment by Peter Jones [ 06/Jan/21 ] |
|
Jian Could you please investigate? Thanks Peter |
| Comment by Jian Yu [ 06/Jan/21 ] |
|
Hi Mike, checking whether to use Compat RDMA... /bin/ofed_info no configure: error: no OFED nor kernel OpenIB gen2 headers present configure error, check /var/lib/dkms/lustre-client/2.12.5/build/config.log Could you please upload the config.log to this ticket for investigation? |
| Comment by Michael Ethier (Inactive) [ 06/Jan/21 ] |
|
Hi sure its attached. |
| Comment by Jian Yu [ 06/Jan/21 ] |
|
Hi Mike, o2ib_found=false
for O2IBPATH in $O2IBPATHS; do
AS_IF([test \( -f ${O2IBPATH}/include/rdma/rdma_cm.h -a \
-f ${O2IBPATH}/include/rdma/ib_cm.h -a \
-f ${O2IBPATH}/include/rdma/ib_verbs.h -a \
-f ${O2IBPATH}/include/rdma/ib_fmr_pool.h \)], [
o2ib_found=true
break
])
done
Could you please check if the above header files are located under /usr/src/ofa_kernel/default? |
| Comment by Michael Ethier (Inactive) [ 06/Jan/21 ] |
|
Hi Jian, Looks like ib_fmr_pool.h is missing ? [root@cannonlnet08 default]# cd ./include/rdma Thanks, |
| Comment by Jian Yu [ 06/Jan/21 ] |
|
Yes, Mike. If one or more of those files is missing, then configure will return "error: no OFED nor kernel OpenIB gen2 headers present". |
| Comment by James A Simmons [ 06/Jan/21 ] |
|
Can you try patch https://review.whamcloud.com/#/c/40287 |
| Comment by Michael Ethier (Inactive) [ 06/Jan/21 ] |
|
Hi James, can you point me to the procedure to apply the patch ? |
| Comment by Michael Ethier (Inactive) [ 06/Jan/21 ] |
|
Is the the procedure ? I would run without --dryrun to implement the changes. [root@cannonlnet08 lustre-client-2.12.5]# pwd |
| Comment by Jian Yu [ 06/Jan/21 ] |
|
Hi Mike, # rpm -qlp kernel-devel-3.10.0-1127.19.1.el7.x86_64.rpm | grep ib_fmr_pool.h /usr/src/kernels/3.10.0-1127.19.1.el7.x86_64/include/rdma/ib_fmr_pool.h While installing MLNX_OFED 5.2-1.0.4.0 on the node, did you pass "--add-kernel-support" option to mlnxofedinstall or run mlnx_add_kernel_support.sh to generate an MLNX_OFED package with drivers for the kernel 3.10.0-1127.19.1.el7 on the node? |
| Comment by Michael Ethier (Inactive) [ 06/Jan/21 ] |
|
Hi Jian, |
| Comment by Jian Yu [ 07/Jan/21 ] |
|
Thank you Mike for the info. |
| Comment by Michael Ethier (Inactive) [ 07/Jan/21 ] |
|
Hi Jian, [root@cannonlnet08 ~]# dkms install -k $(uname -r) lustre-client/2.12.5 Kernel preparation unnecessary for this kernel. Skipping... Running the pre_build script: Building module: |
| Comment by Jian Yu [ 07/Jan/21 ] |
|
Hi Mike, |
| Comment by Jian Yu [ 07/Jan/21 ] |
|
Hi Mike, # rpm -ivh lustre-client-dkms-2.12.5_1_gf4d9b03-1.el7.noarch.rpm Preparing... ################################# [100%] Updating / installing... 1:lustre-client-dkms-2.12.5_1_gf4d9################################# [100%] Loading new lustre-client-2.12.5_1_gf4d9b03 DKMS files... Building for 3.10.0-1127.19.1.el7.x86_64 Building initial module for 3.10.0-1127.19.1.el7.x86_64 <~snip~> The config.log showed that: configure:18577: checking whether to use Compat RDMA configure:18673: result: yes configure:18708: checking whether to use any OFED backport headers configure:18716: result: no configure:18725: checking whether to enable OpenIB gen2 support <~snip~> configure:18794: result: yes configure:18817: adding /usr/src/ofa_kernel/default/Module.symvers to Symbol Path configure passed. |
| Comment by Jian Yu [ 07/Jan/21 ] |
|
After configure passed, building the codes hit the following error: /var/lib/dkms/lustre-client/2.12.5_1_gf4d9b03/build/lnet/klnds/o2iblnd/o2iblnd_cb.c: In function ‘kiblnd_reject’:
/var/lib/dkms/lustre-client/2.12.5_1_gf4d9b03/build/lnet/klnds/o2iblnd/o2iblnd_cb.c:2421:9: error: too few arguments to function ‘rdma_reject’
rc = rdma_reject(cmid, rej, sizeof(*rej));
^
The error has been fixed in patch https://review.whamcloud.com/39781 and landed for Lustre 2.12.6. With patch https://review.whamcloud.com/41152 applied to Lustre 2.12.6, I can successfully build Lustre 2.12.6 client on CentOS 7.8 with kernel 3.10.0-1127.19.1.el7.x86_64 and MLNX_OFED 5.2-1.0.4.0: # rpm -ivh lustre-client-dkms-2.12.6_1_g14e02fb-1.el7.noarch.rpm Preparing... ################################# [100%] Updating / installing... 1:lustre-client-dkms-2.12.6_1_g14e0################################# [100%] Loading new lustre-client-2.12.6_1_g14e02fb DKMS files... Building for 3.10.0-1127.19.1.el7.x86_64 Building initial module for 3.10.0-1127.19.1.el7.x86_64 Done. <~snip~> ko2iblnd.ko.xz: Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/ <~snip~> Adding any weak-modules depmod.... DKMS: install completed. |
| Comment by Michael Ethier (Inactive) [ 07/Jan/21 ] |
|
Hi Jian, Thanks for the feedback. However, we are running Lustre client 2.12.5 almost everywhere on our production infrastructure. I am working currently on updating out LNET routers from Centos 7.7 Lustre 2.12.4 and OFED-4.7-1.0.0 to Centos 7.8 and was hoping to keep the lustre version the same (ie 2.12.5). Based on your info I have to use lustre 2.12.6 in order to get this to work with the latest MLNX OFED. And Mellanox recommends I use their latest OFED version. Do you know of any compatibility issues or other issues updating our LNET routers to 2.12.6 ? Or should I just leave them alone as they seem to be working fine. Thanks, |
| Comment by Jian Yu [ 07/Jan/21 ] |
|
Hi Mike,
|
| Comment by Michael Ethier (Inactive) [ 08/Jan/21 ] |
|
Hi Jian, |
| Comment by Michael Ethier (Inactive) [ 08/Jan/21 ] |
|
BTW, do you know when this issue will be fixed in the general lustre release ? 2.12.6 is already released. |
| Comment by Michael Ethier (Inactive) [ 08/Jan/21 ] |
|
Hi Jian, [root@cannonlnet08 lustre-client-2.12.5]# pwd Then I started the build: Kernel preparation unnecessary for this kernel. Skipping... Running the pre_build script: |
| Comment by Jian Yu [ 08/Jan/21 ] |
|
You're welcome, Mike. I'm not sure when the next 2.12.x version will be released. |
| Comment by Michael Ethier (Inactive) [ 12/Jan/21 ] |
|
Hi Jian, |
| Comment by Peter Jones [ 12/Jan/21 ] |
|
My suggestion is that we expedite landing https://review.whamcloud.com/#/c/41152/ to b2_12 and then the tip of b2_12 will be what is needed to to build 2.12.6 for MOFED 5.2. We have not thought about 2.12.7 timing yet, but we will certainly want to include this fix. |
| Comment by Michael Ethier (Inactive) [ 12/Jan/21 ] |
|
So I have an lnet router out of service that I was trying to get running with the latest MOFED and lustre 2.12.5. Should I just rebuilt it back to its previous functioning setup ? I don't want to leave it down for a long time. |
| Comment by Jian Yu [ 12/Jan/21 ] |
|
Hi Mike, # pwd /usr/src/lustre-client-2.12.5 # sh ./autogen.sh |
| Comment by Jian Yu [ 12/Jan/21 ] |
|
And before running autogen.sh, the attached lustre-version.m4 # rpm -ivh lustre-client-dkms-2.12.5-1.el7.noarch.rpm # cd /usr/src/lustre-client-2.12.5/ # patch -p1 < /root/0001-LU-13761-o2ib-Fix-compilation-with-MOFED-5.1.patch # patch -p1 < /root/0001-LU-13783-o2iblnd-make-FMR-pool-support-optional.patch # cp /root/autogen.sh . # cp /root/lustre-version.m4 config/ # sh ./autogen.sh # dkms install -k $(uname -r) lustre-client/2.12.5 ... ... - Installation - Installing to /lib/modules/3.10.0-1127.19.1.el7.x86_64/extra/ Adding any weak-modules depmod.... DKMS: install completed. |
| Comment by Michael Ethier (Inactive) [ 12/Jan/21 ] |
|
Hi Jian, |
| Comment by Jian Yu [ 12/Jan/21 ] |
|
Hi Mike, |
| Comment by Michael Ethier (Inactive) [ 13/Jan/21 ] |
|
Hi Jian, |
| Comment by Peter Jones [ 13/Jan/21 ] |
|
Mike The "official" release will be 2.12.7 but we don't have an exact timeline for it yet Peter |
| Comment by Michael Ethier (Inactive) [ 13/Jan/21 ] |
|
Peter, our group is going to wait for 2.12.7 to be release before we update all our lnet routers. Do you think the 2.12.7 will be released in weeks or months ? Thanks. |
| Comment by Peter Jones [ 13/Jan/21 ] |
|
Michael It's possible something new might come to light that quickly changes this but, as things stand today, my best guess is months. Peter |