[LU-11842] 2.12.0 lustre-zfs dkms install with MOFED - can't built dkms with ofa_kernel Created: 09/Jan/19  Updated: 24/Jan/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Question/Request Priority: Major
Reporter: Jeff Johnson (Inactive) Assignee: Nathaniel Clark
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6-1810, SPL/ZFS 0.7.9 (dkms) Lustre 2.12.0-ib. MOFED 4.5-1.0.1.0, kernel-3.10.0-957.el7_lustre.x86_64


Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

Unsuccessfully trying to setup a dkms 2.12.0-ib config. Package set from lustre.org download site. During dkms build of lustre-zfs/2.12.0 for kernel-3.10.0-957.el7_lustre.x86_64 the config process finds the ofa_kernel presence. Build and install completes. depmod as well as loading of lustre modules barfs on ko2iblnd. I have the ofa_kernel-devel package installed. Everything should work. What am I missing (or messing up)?

[10774.270196] ko2iblnd: disagrees about version of symbol ib_fmr_pool_unmap
[10774.270202] ko2iblnd: Unknown symbol ib_fmr_pool_unmap (err -22)
[10774.270212] ko2iblnd: disagrees about version of symbol __ib_alloc_pd
[10774.270215] ko2iblnd: Unknown symbol __ib_alloc_pd (err -22)
[10774.270232] ko2iblnd: disagrees about version of symbol rdma_resolve_addr
[10774.270234] ko2iblnd: Unknown symbol rdma_resolve_addr (err -22)
[10774.270242] ko2iblnd: disagrees about version of symbol ib_create_fmr_pool

...lots more

 

depmod:

depmod: WARNING: /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz needs unknown symbol ib_get_dma_mr
depmod: WARNING: /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz needs unknown symbol __ib_create_cq
depmod: WARNING: /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz needs unknown symbol __rdma_accept
depmod: WARNING: /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz needs unknown symbol __rdma_create_id
depmod: WARNING: /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz needs unknown symbol backport_dependency_symbol

 

 



 Comments   
Comment by Andreas Dilger [ 09/Jan/19 ]

Nathaniel, can you please take a look.

Comment by Nathaniel Clark [ 09/Jan/19 ]

What does the output of following commands provide:

rpm -qa \*lustre\*
dkms status 
Comment by Jeff Johnson (Inactive) [ 09/Jan/19 ]

Here you go...

 

lustre-2.12.0-1.el7.x86_64
lustre-zfs-dkms-2.12.0-1.el7.noarch
lustre-osd-ldiskfs-mount-2.12.0-1.el7.x86_64
lustre-resource-agents-2.12.0-1.el7.x86_64
lustre-osd-zfs-mount-2.12.0-1.el7.x86_64
lustre-iokit-2.12.0-1.el7.x86_64

dkms status
lustre-zfs, 2.12.0, 3.10.0-957.el7_lustre.x86_64, x86_64: installed
spl, 0.7.9, 3.10.0-957.el7_lustre.x86_64, x86_64: installed
zfs, 0.7.9, 3.10.0-957.el7_lustre.x86_64, x86_64: installed

uname -r
3.10.0-957.el7_lustre.x86_64 

 

Comment by James A Simmons [ 09/Jan/19 ]

I have similar issues and I tracked it down to a weird lustre_lnet.m4 test. The problem is this section:

AC_MSG_CHECKING([whether to use any OFED backport headers])

     if test -n "$BACKPORT_INCLUDES"; then

        AC_MSG_RESULT([yes])

                            OFED_BACKPORT_PATH="$O2IBPATH/${BACKPORT_INCLUDES/*\/kernel_addons/kernel_addons}/"

                        EXTRA_OFED_INCLUDE="-I$OFED_BACKPORT_PATH $EXTRA_OFED_INCLUDE"

                else

                        AC_MSG_RESULT([no])

                fi

If you don't define OFED_BACKPORT_PATH your lustre build with MOFED can end up very broken with older MOFED stacks. Newer ones don't have this anymore. I understand this is done to handle both MOFED and normal OFED. For old MOFED the backport path is "backport_includes" and for compact package from OFED it would be just "include" as well as newer MOFED versions.

 

Comment by Jeff Johnson (Inactive) [ 09/Jan/19 ]

I reached this speedbump via a basic dkms install using the released Whamcloud package set so anyone else downloading this package set is going to get bit similarly. I installed MOFED 4.5-1.0.1.0 so dealing with latest/ greatest and hitting this issue.

 

Comment by Nathaniel Clark [ 09/Jan/19 ]

This is specific to MOFED 4.5? 

Comment by Jeff Johnson (Inactive) [ 09/Jan/19 ]

In this case, for me, it is 4.5-1.0.1.0 specific as I have not tried other versions. That happens to be the same version that the WC distributed 2.12.0-ib package set is built against. 

As I am typing this it appears I have cleared the issue. Not by using the WC package set but by building 2.12.0 from git into a lustre-dkms package and build/installing that. Now I am able to load ko2iblnd without symbol errors and can bind o2ib0 to ib0 and ib1. 

I won't be out of the woods totally until the LFS is completely stood up but it appears that building from git as opposed to the package set provided by WC works.

Comment by Nathaniel Clark [ 14/Jan/19 ]

aeonjeffj,

I'm not seeing this locally.
Can you do:

 modinfo ib_core ko2iblnd

And attach: /var/lib/dkms/lustre-zfs/2.12.0/3.10.0-957.el7_lustre.x86_64/x86_64/log/config.log

To be clear, the order you installed was the following?

  1. Install lustre kernel
  2. reboot
  3. install spl/zfs (Can be done before or after next step)
  4. install MOFED
  5. Install lustre-zfs-dkms
Comment by Jeff Johnson (Inactive) [ 14/Jan/19 ]

I will build up a sandbox to recreate this. I got around the issue by rebuilding from git source and abandoning the staged 2.12.0-ib packages on WC's download site.

The systems were pxe installed with the wc lustre kernel so the steps were:

  1. Newly installed system running 3.10.0-957.el7_lustre.x86_64
  2. Install prereq packages for dkms and MOFED build envs
  3. Install MOFED
  4. Install build install spl/zfs
  5. Install lustre-zfs-dkms

Build/install completes but ko2iblnd fails to load due to reported missing symbols.

I can recreate tonight, maybe sooner.

Comment by Nathaniel Clark [ 18/Jan/19 ]

aeonjeffj,

Have you been able to reproduce this issue?

Comment by Jeff Johnson (Inactive) [ 24/Jan/19 ]

Yes, just reproduced.

  1. Fresh install of CentOS 7.6-1810
  2. Install kernel-3.10.0-957.el7_lustre.x86_64 along with kernel-devel and kernel-headers of same version
  3. Boot into 3.10.0-957.el7_lustre.x86_64 kernel
  4. Install dkms and other development environment packages
  5. Install the packages of the Lustre-2.12.0-ib release from Whamcloud download.
e2fsprogs.x86_64                      1.44.3.wc1-0.el7                 @lustre
e2fsprogs-libs.x86_64                 1.44.3.wc1-0.el7                 @lustre
ibutils.x86_64                        1.5.7.1-0.12.gdcaeae2.45101      @lustre
kernel.x86_64                         3.10.0-957.el7_lustre            @lustre
kernel-devel.x86_64                   3.10.0-957.el7_lustre            @lustre
kernel-headers.x86_64                 3.10.0-957.el7_lustre            @lustre
kmod-mlnx-ofa_kernel.x86_64           4.5-OFED.4.5.1.0.1.1.gb4fdfac    @lustre
libcom_err.x86_64                     1.44.3.wc1-0.el7                 @lustre
                                                                       @lustre
libnvpair1.x86_64                     0.7.9-1.el7                      @lustre
libss.x86_64                          1.44.3.wc1-0.el7                 @lustre
libuutil1.x86_64                      0.7.9-1.el7                      @lustre
libzfs2.x86_64                        0.7.9-1.el7                      @lustre
libzpool2.x86_64                      0.7.9-1.el7                      @lustre
lustre.x86_64                         2.12.0-1.el7                     @lustre
lustre-iokit.x86_64                   2.12.0-1.el7                     @lustre
lustre-osd-ldiskfs-mount.x86_64       2.12.0-1.el7                     @lustre
lustre-osd-zfs-mount.x86_64           2.12.0-1.el7                     @lustre
lustre-resource-agents.x86_64         2.12.0-1.el7                     @lustre
lustre-zfs-dkms.noarch                2.12.0-1.el7                     @lustre
mlnx-ofa_kernel.x86_64                4.5-OFED.4.5.1.0.1.1.gb4fdfac    @lustre
mlnx-ofa_kernel-devel.x86_64          4.5-OFED.4.5.1.0.1.1.gb4fdfac    @lustre
                                                                       @lustre
spl.x86_64                            0.7.9-1.el7                      @lustre
spl-dkms.noarch                       0.7.9-1.el7                      @lustre
zfs.x86_64                            0.7.9-1.el7                      @lustre
zfs-dkms.noarch                       0.7.9-1.el7                      @lustre 
  1. dkms uninstall of lustre-zfs/2.12.0, zfs/0.7.9 and spl/0.7.9 for 3.10.0-957.el7_lustre.x86_64 kernel
  2. dkms build and install of spl/0.7.9 and zfs/0.7.9
  3. dkms build and install of lustre-zfs/2.12.0
  4. /etc/init.d/openibd start  (loads IB kernel modules)
  5. ibstatus shows healthy IB interface
  6. modprobe lnet -v
[root@ls15-mds-00 ~]# modprobe lnet -v
insmod /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/libcfs.ko.xz
insmod /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/lnet.ko.xz 
  1. modprobe ko2iblnd -v
[root@ls15-mds-00 ~]# modprobe ko2iblnd -v
install /usr/sbin/ko2iblnd-probe
insmod /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz
modprobe: ERROR: could not insert 'ko2iblnd': Invalid argument
modprobe: ERROR: Error running install command for ko2iblnd
modprobe: ERROR: could not insert 'ko2iblnd': Operation not permitted 
  1. dmesg
[root@ls15-mds-00 ~]# dmesg
[19669.470576] LNet: HW NUMA nodes: 2, HW CPU cores: 16, npartitions: 2
[19669.474158] alg: No test for adler32 (adler32-zlib)
[19674.029736] ko2iblnd: disagrees about version of symbol ib_fmr_pool_unmap
[19674.029743] ko2iblnd: Unknown symbol ib_fmr_pool_unmap (err -22)
[19674.029785] ko2iblnd: Unknown symbol ib_create_cq (err 0)
[19674.029796] ko2iblnd: disagrees about version of symbol __ib_alloc_pd
[19674.029798] ko2iblnd: Unknown symbol __ib_alloc_pd (err -22)
[19674.029814] ko2iblnd: disagrees about version of symbol rdma_resolve_addr
[19674.029817] ko2iblnd: Unknown symbol rdma_resolve_addr (err -22)
[19674.029828] ko2iblnd: disagrees about version of symbol ib_create_fmr_pool
[19674.029831] ko2iblnd: Unknown symbol ib_create_fmr_pool (err -22)
[19674.029865] ko2iblnd: disagrees about version of symbol ib_dereg_mr
[19674.029867] ko2iblnd: Unknown symbol ib_dereg_mr (err -22)
[19674.029876] ko2iblnd: disagrees about version of symbol rdma_reject
[19674.029879] ko2iblnd: Unknown symbol rdma_reject (err -22)
[19674.029889] ko2iblnd: disagrees about version of symbol rdma_disconnect
[19674.029892] ko2iblnd: Unknown symbol rdma_disconnect (err -22)
[19674.029946] ko2iblnd: disagrees about version of symbol rdma_resolve_route
[19674.029949] ko2iblnd: Unknown symbol rdma_resolve_route (err -22)
[19674.029958] ko2iblnd: disagrees about version of symbol rdma_bind_addr
[19674.029961] ko2iblnd: Unknown symbol rdma_bind_addr (err -22)
[19674.029968] ko2iblnd: disagrees about version of symbol rdma_create_qp
[19674.029970] ko2iblnd: Unknown symbol rdma_create_qp (err -22)
[19674.029984] ko2iblnd: disagrees about version of symbol ib_map_mr_sg
[19674.029986] ko2iblnd: Unknown symbol ib_map_mr_sg (err -22)
[19674.029994] ko2iblnd: disagrees about version of symbol ib_destroy_cq
[19674.029997] ko2iblnd: Unknown symbol ib_destroy_cq (err -22)
[19674.030026] ko2iblnd: Unknown symbol rdma_create_id (err 0)
[19674.030043] ko2iblnd: disagrees about version of symbol rdma_notify
[19674.030046] ko2iblnd: Unknown symbol rdma_notify (err -22)
[19674.030057] ko2iblnd: disagrees about version of symbol rdma_listen
[19674.030059] ko2iblnd: Unknown symbol rdma_listen (err -22)
[19674.030064] ko2iblnd: disagrees about version of symbol rdma_destroy_qp
[19674.030067] ko2iblnd: Unknown symbol rdma_destroy_qp (err -22)
[19674.030085] ko2iblnd: disagrees about version of symbol ib_alloc_mr
[19674.030088] ko2iblnd: Unknown symbol ib_alloc_mr (err -22)
[19674.030102] ko2iblnd: disagrees about version of symbol rdma_set_reuseaddr
[19674.030105] ko2iblnd: Unknown symbol rdma_set_reuseaddr (err -22)
[19674.030112] ko2iblnd: disagrees about version of symbol rdma_connect
[19674.030114] ko2iblnd: Unknown symbol rdma_connect (err -22)
[19674.030123] ko2iblnd: disagrees about version of symbol ib_modify_qp
[19674.030126] ko2iblnd: Unknown symbol ib_modify_qp (err -22)
[19674.030153] ko2iblnd: disagrees about version of symbol rdma_destroy_id
[19674.030155] ko2iblnd: Unknown symbol rdma_destroy_id (err -22)
[19674.030191] ko2iblnd: Unknown symbol rdma_accept (err 0)
[19674.030222] ko2iblnd: disagrees about version of symbol ib_dealloc_pd
[19674.030224] ko2iblnd: Unknown symbol ib_dealloc_pd (err -22)
[19674.030233] ko2iblnd: disagrees about version of symbol ib_fmr_pool_map_phys
[19674.030235] ko2iblnd: Unknown symbol ib_fmr_pool_map_phys (err -22) 

Note: The mlnx-ofa packages installed were the ones provided in the WC 2.12.0-ib package bundle.

Comment by Jeff Johnson (Inactive) [ 24/Jan/19 ]

modinfo from your earlier request (sorry, missed it)

# modinfo ib_core ko2iblnd
filename:       /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko
alias:          rdma-netlink-subsys-4
license:        Dual BSD/GPL
description:    core kernel InfiniBand API
author:         Roland Dreier
alias:          net-pf-16-proto-20
alias:          rdma-netlink-subsys-5
retpoline:      Y
rhelversion:    7.6
srcversion:     06E7E72A62EE218127F8301
depends:        mlx_compat
vermagic:       3.10.0-957.el7_lustre.x86_64 SMP mod_unload modversions
parm:           send_queue_size:Size of send queue in number of work requests (int)
parm:           recv_queue_size:Size of receive queue in number of work requests (int)
parm:           roce_v1_noncompat_gid:Default GID auto configuration (Default: yes) (bool)
parm:           force_mr:Force usage of MRs for RDMA READ/WRITE operations (bool)

filename:       /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz
license:        GPL
version:        2.8.0
description:    OpenIB gen2 LNet Network Driver
author:         OpenSFS, Inc. <http://www.lustre.org/>
retpoline:      Y
rhelversion:    7.6
srcversion:     D2BDDC597FD6B8D570DF5B3
depends:        libcfs,lnet,ib_core,rdma_cm
vermagic:       3.10.0-957.el7_lustre.x86_64 SMP mod_unload modversions
parm:           service:service number (within RDMA_PS_TCP) (int)
parm:           cksum:set non-zero to enable message (not RDMA) checksums (int)
parm:           timeout:timeout (seconds) (int)
parm:           nscheds:number of threads in each scheduler pool (int)
parm:           conns_per_peer:number of connections per peer (uint)
parm:           ntx:# of message descriptors allocated for each pool (int)
parm:           credits:# concurrent sends (int)
parm:           peer_credits:# concurrent sends to 1 peer (int)
parm:           peer_credits_hiw:when eagerly to return credits (int)
parm:           peer_buffer_credits:# per-peer router buffer credits (int)
parm:           peer_timeout:Seconds without aliveness news to declare peer dead (<=0 to disable) (int)
parm:           ipif_name:IPoIB interface name (charp)
parm:           retry_count:Retransmissions when no ACK received (int)
parm:           rnr_retry_count:RNR retransmissions (int)
parm:           keepalive:Idle time in seconds before sending a keepalive (int)
parm:           ib_mtu:IB MTU 256/512/1024/2048/4096 (int)
parm:           concurrent_sends:send work-queue sizing (obsolete) (int)
parm:           use_fastreg_gaps:Enable discontiguous fastreg fragment support. Expect performance drop (int)
parm:           map_on_demand:map on demand (obsolete) (int)
parm:           fmr_pool_size:size of fmr pool on each CPT (>= ntx / 4) (int)
parm:           fmr_flush_trigger:# dirty FMRs that triggers pool flush (int)
parm:           fmr_cache:non-zero to enable FMR caching (int)
parm:           dev_failover:HCA failover for bonding (0 off, 1 on, other values reserved) (int)
parm:           require_privileged_port:require privileged port when accepting connection (int)
parm:           use_privileged_port:use privileged port when initiating connection (int)
parm:           wrq_sge:# scatter/gather element per work request (uint) 

If you want remote access to determine root cause let me know. This is reproduced on a development machine in our lab, not a customer or production system.

Generated at Sat Feb 10 02:47:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.