[LU-11842] 2.12.0 lustre-zfs dkms install with MOFED - can't built dkms with ofa_kernel Created: 09/Jan/19 Updated: 24/Jan/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Major |
| Reporter: | Jeff Johnson (Inactive) | Assignee: | Nathaniel Clark |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6-1810, SPL/ZFS 0.7.9 (dkms) Lustre 2.12.0-ib. MOFED 4.5-1.0.1.0, kernel-3.10.0-957.el7_lustre.x86_64 |
||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Unsuccessfully trying to setup a dkms 2.12.0-ib config. Package set from lustre.org download site. During dkms build of lustre-zfs/2.12.0 for kernel-3.10.0-957.el7_lustre.x86_64 the config process finds the ofa_kernel presence. Build and install completes. depmod as well as loading of lustre modules barfs on ko2iblnd. I have the ofa_kernel-devel package installed. Everything should work. What am I missing (or messing up)? [10774.270196] ko2iblnd: disagrees about version of symbol ib_fmr_pool_unmap ...lots more
depmod: depmod: WARNING: /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz needs unknown symbol ib_get_dma_mr
|
| Comments |
| Comment by Andreas Dilger [ 09/Jan/19 ] |
|
Nathaniel, can you please take a look. |
| Comment by Nathaniel Clark [ 09/Jan/19 ] |
|
What does the output of following commands provide: rpm -qa \*lustre\* dkms status |
| Comment by Jeff Johnson (Inactive) [ 09/Jan/19 ] |
|
Here you go...
lustre-2.12.0-1.el7.x86_64 lustre-zfs-dkms-2.12.0-1.el7.noarch lustre-osd-ldiskfs-mount-2.12.0-1.el7.x86_64 lustre-resource-agents-2.12.0-1.el7.x86_64 lustre-osd-zfs-mount-2.12.0-1.el7.x86_64 lustre-iokit-2.12.0-1.el7.x86_64 dkms status lustre-zfs, 2.12.0, 3.10.0-957.el7_lustre.x86_64, x86_64: installed spl, 0.7.9, 3.10.0-957.el7_lustre.x86_64, x86_64: installed zfs, 0.7.9, 3.10.0-957.el7_lustre.x86_64, x86_64: installed uname -r 3.10.0-957.el7_lustre.x86_64
|
| Comment by James A Simmons [ 09/Jan/19 ] |
|
I have similar issues and I tracked it down to a weird lustre_lnet.m4 test. The problem is this section: AC_MSG_CHECKING([whether to use any OFED backport headers]) if test -n "$BACKPORT_INCLUDES"; then AC_MSG_RESULT([yes]) OFED_BACKPORT_PATH="$O2IBPATH/${BACKPORT_INCLUDES/*\/kernel_addons/kernel_addons}/" EXTRA_OFED_INCLUDE="-I$OFED_BACKPORT_PATH $EXTRA_OFED_INCLUDE" else AC_MSG_RESULT([no]) fi If you don't define OFED_BACKPORT_PATH your lustre build with MOFED can end up very broken with older MOFED stacks. Newer ones don't have this anymore. I understand this is done to handle both MOFED and normal OFED. For old MOFED the backport path is "backport_includes" and for compact package from OFED it would be just "include" as well as newer MOFED versions.
|
| Comment by Jeff Johnson (Inactive) [ 09/Jan/19 ] |
|
I reached this speedbump via a basic dkms install using the released Whamcloud package set so anyone else downloading this package set is going to get bit similarly. I installed MOFED 4.5-1.0.1.0 so dealing with latest/ greatest and hitting this issue.
|
| Comment by Nathaniel Clark [ 09/Jan/19 ] |
|
This is specific to MOFED 4.5? |
| Comment by Jeff Johnson (Inactive) [ 09/Jan/19 ] |
|
In this case, for me, it is 4.5-1.0.1.0 specific as I have not tried other versions. That happens to be the same version that the WC distributed 2.12.0-ib package set is built against. As I am typing this it appears I have cleared the issue. Not by using the WC package set but by building 2.12.0 from git into a lustre-dkms package and build/installing that. Now I am able to load ko2iblnd without symbol errors and can bind o2ib0 to ib0 and ib1. I won't be out of the woods totally until the LFS is completely stood up but it appears that building from git as opposed to the package set provided by WC works. |
| Comment by Nathaniel Clark [ 14/Jan/19 ] |
|
I'm not seeing this locally. modinfo ib_core ko2iblnd And attach: /var/lib/dkms/lustre-zfs/2.12.0/3.10.0-957.el7_lustre.x86_64/x86_64/log/config.log To be clear, the order you installed was the following?
|
| Comment by Jeff Johnson (Inactive) [ 14/Jan/19 ] |
|
I will build up a sandbox to recreate this. I got around the issue by rebuilding from git source and abandoning the staged 2.12.0-ib packages on WC's download site. The systems were pxe installed with the wc lustre kernel so the steps were:
Build/install completes but ko2iblnd fails to load due to reported missing symbols. I can recreate tonight, maybe sooner. |
| Comment by Nathaniel Clark [ 18/Jan/19 ] |
|
Have you been able to reproduce this issue? |
| Comment by Jeff Johnson (Inactive) [ 24/Jan/19 ] |
|
Yes, just reproduced.
e2fsprogs.x86_64 1.44.3.wc1-0.el7 @lustre
e2fsprogs-libs.x86_64 1.44.3.wc1-0.el7 @lustre
ibutils.x86_64 1.5.7.1-0.12.gdcaeae2.45101 @lustre
kernel.x86_64 3.10.0-957.el7_lustre @lustre
kernel-devel.x86_64 3.10.0-957.el7_lustre @lustre
kernel-headers.x86_64 3.10.0-957.el7_lustre @lustre
kmod-mlnx-ofa_kernel.x86_64 4.5-OFED.4.5.1.0.1.1.gb4fdfac @lustre
libcom_err.x86_64 1.44.3.wc1-0.el7 @lustre
@lustre
libnvpair1.x86_64 0.7.9-1.el7 @lustre
libss.x86_64 1.44.3.wc1-0.el7 @lustre
libuutil1.x86_64 0.7.9-1.el7 @lustre
libzfs2.x86_64 0.7.9-1.el7 @lustre
libzpool2.x86_64 0.7.9-1.el7 @lustre
lustre.x86_64 2.12.0-1.el7 @lustre
lustre-iokit.x86_64 2.12.0-1.el7 @lustre
lustre-osd-ldiskfs-mount.x86_64 2.12.0-1.el7 @lustre
lustre-osd-zfs-mount.x86_64 2.12.0-1.el7 @lustre
lustre-resource-agents.x86_64 2.12.0-1.el7 @lustre
lustre-zfs-dkms.noarch 2.12.0-1.el7 @lustre
mlnx-ofa_kernel.x86_64 4.5-OFED.4.5.1.0.1.1.gb4fdfac @lustre
mlnx-ofa_kernel-devel.x86_64 4.5-OFED.4.5.1.0.1.1.gb4fdfac @lustre
@lustre
spl.x86_64 0.7.9-1.el7 @lustre
spl-dkms.noarch 0.7.9-1.el7 @lustre
zfs.x86_64 0.7.9-1.el7 @lustre
zfs-dkms.noarch 0.7.9-1.el7 @lustre
[root@ls15-mds-00 ~]# modprobe lnet -v insmod /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/libcfs.ko.xz insmod /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/lnet.ko.xz
[root@ls15-mds-00 ~]# modprobe ko2iblnd -v install /usr/sbin/ko2iblnd-probe insmod /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz modprobe: ERROR: could not insert 'ko2iblnd': Invalid argument modprobe: ERROR: Error running install command for ko2iblnd modprobe: ERROR: could not insert 'ko2iblnd': Operation not permitted
[root@ls15-mds-00 ~]# dmesg [19669.470576] LNet: HW NUMA nodes: 2, HW CPU cores: 16, npartitions: 2 [19669.474158] alg: No test for adler32 (adler32-zlib) [19674.029736] ko2iblnd: disagrees about version of symbol ib_fmr_pool_unmap [19674.029743] ko2iblnd: Unknown symbol ib_fmr_pool_unmap (err -22) [19674.029785] ko2iblnd: Unknown symbol ib_create_cq (err 0) [19674.029796] ko2iblnd: disagrees about version of symbol __ib_alloc_pd [19674.029798] ko2iblnd: Unknown symbol __ib_alloc_pd (err -22) [19674.029814] ko2iblnd: disagrees about version of symbol rdma_resolve_addr [19674.029817] ko2iblnd: Unknown symbol rdma_resolve_addr (err -22) [19674.029828] ko2iblnd: disagrees about version of symbol ib_create_fmr_pool [19674.029831] ko2iblnd: Unknown symbol ib_create_fmr_pool (err -22) [19674.029865] ko2iblnd: disagrees about version of symbol ib_dereg_mr [19674.029867] ko2iblnd: Unknown symbol ib_dereg_mr (err -22) [19674.029876] ko2iblnd: disagrees about version of symbol rdma_reject [19674.029879] ko2iblnd: Unknown symbol rdma_reject (err -22) [19674.029889] ko2iblnd: disagrees about version of symbol rdma_disconnect [19674.029892] ko2iblnd: Unknown symbol rdma_disconnect (err -22) [19674.029946] ko2iblnd: disagrees about version of symbol rdma_resolve_route [19674.029949] ko2iblnd: Unknown symbol rdma_resolve_route (err -22) [19674.029958] ko2iblnd: disagrees about version of symbol rdma_bind_addr [19674.029961] ko2iblnd: Unknown symbol rdma_bind_addr (err -22) [19674.029968] ko2iblnd: disagrees about version of symbol rdma_create_qp [19674.029970] ko2iblnd: Unknown symbol rdma_create_qp (err -22) [19674.029984] ko2iblnd: disagrees about version of symbol ib_map_mr_sg [19674.029986] ko2iblnd: Unknown symbol ib_map_mr_sg (err -22) [19674.029994] ko2iblnd: disagrees about version of symbol ib_destroy_cq [19674.029997] ko2iblnd: Unknown symbol ib_destroy_cq (err -22) [19674.030026] ko2iblnd: Unknown symbol rdma_create_id (err 0) [19674.030043] ko2iblnd: disagrees about version of symbol rdma_notify [19674.030046] ko2iblnd: Unknown symbol rdma_notify (err -22) [19674.030057] ko2iblnd: disagrees about version of symbol rdma_listen [19674.030059] ko2iblnd: Unknown symbol rdma_listen (err -22) [19674.030064] ko2iblnd: disagrees about version of symbol rdma_destroy_qp [19674.030067] ko2iblnd: Unknown symbol rdma_destroy_qp (err -22) [19674.030085] ko2iblnd: disagrees about version of symbol ib_alloc_mr [19674.030088] ko2iblnd: Unknown symbol ib_alloc_mr (err -22) [19674.030102] ko2iblnd: disagrees about version of symbol rdma_set_reuseaddr [19674.030105] ko2iblnd: Unknown symbol rdma_set_reuseaddr (err -22) [19674.030112] ko2iblnd: disagrees about version of symbol rdma_connect [19674.030114] ko2iblnd: Unknown symbol rdma_connect (err -22) [19674.030123] ko2iblnd: disagrees about version of symbol ib_modify_qp [19674.030126] ko2iblnd: Unknown symbol ib_modify_qp (err -22) [19674.030153] ko2iblnd: disagrees about version of symbol rdma_destroy_id [19674.030155] ko2iblnd: Unknown symbol rdma_destroy_id (err -22) [19674.030191] ko2iblnd: Unknown symbol rdma_accept (err 0) [19674.030222] ko2iblnd: disagrees about version of symbol ib_dealloc_pd [19674.030224] ko2iblnd: Unknown symbol ib_dealloc_pd (err -22) [19674.030233] ko2iblnd: disagrees about version of symbol ib_fmr_pool_map_phys [19674.030235] ko2iblnd: Unknown symbol ib_fmr_pool_map_phys (err -22) Note: The mlnx-ofa packages installed were the ones provided in the WC 2.12.0-ib package bundle. |
| Comment by Jeff Johnson (Inactive) [ 24/Jan/19 ] |
|
modinfo from your earlier request (sorry, missed it) # modinfo ib_core ko2iblnd filename: /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko alias: rdma-netlink-subsys-4 license: Dual BSD/GPL description: core kernel InfiniBand API author: Roland Dreier alias: net-pf-16-proto-20 alias: rdma-netlink-subsys-5 retpoline: Y rhelversion: 7.6 srcversion: 06E7E72A62EE218127F8301 depends: mlx_compat vermagic: 3.10.0-957.el7_lustre.x86_64 SMP mod_unload modversions parm: send_queue_size:Size of send queue in number of work requests (int) parm: recv_queue_size:Size of receive queue in number of work requests (int) parm: roce_v1_noncompat_gid:Default GID auto configuration (Default: yes) (bool) parm: force_mr:Force usage of MRs for RDMA READ/WRITE operations (bool) filename: /lib/modules/3.10.0-957.el7_lustre.x86_64/extra/ko2iblnd.ko.xz license: GPL version: 2.8.0 description: OpenIB gen2 LNet Network Driver author: OpenSFS, Inc. <http://www.lustre.org/> retpoline: Y rhelversion: 7.6 srcversion: D2BDDC597FD6B8D570DF5B3 depends: libcfs,lnet,ib_core,rdma_cm vermagic: 3.10.0-957.el7_lustre.x86_64 SMP mod_unload modversions parm: service:service number (within RDMA_PS_TCP) (int) parm: cksum:set non-zero to enable message (not RDMA) checksums (int) parm: timeout:timeout (seconds) (int) parm: nscheds:number of threads in each scheduler pool (int) parm: conns_per_peer:number of connections per peer (uint) parm: ntx:# of message descriptors allocated for each pool (int) parm: credits:# concurrent sends (int) parm: peer_credits:# concurrent sends to 1 peer (int) parm: peer_credits_hiw:when eagerly to return credits (int) parm: peer_buffer_credits:# per-peer router buffer credits (int) parm: peer_timeout:Seconds without aliveness news to declare peer dead (<=0 to disable) (int) parm: ipif_name:IPoIB interface name (charp) parm: retry_count:Retransmissions when no ACK received (int) parm: rnr_retry_count:RNR retransmissions (int) parm: keepalive:Idle time in seconds before sending a keepalive (int) parm: ib_mtu:IB MTU 256/512/1024/2048/4096 (int) parm: concurrent_sends:send work-queue sizing (obsolete) (int) parm: use_fastreg_gaps:Enable discontiguous fastreg fragment support. Expect performance drop (int) parm: map_on_demand:map on demand (obsolete) (int) parm: fmr_pool_size:size of fmr pool on each CPT (>= ntx / 4) (int) parm: fmr_flush_trigger:# dirty FMRs that triggers pool flush (int) parm: fmr_cache:non-zero to enable FMR caching (int) parm: dev_failover:HCA failover for bonding (0 off, 1 on, other values reserved) (int) parm: require_privileged_port:require privileged port when accepting connection (int) parm: use_privileged_port:use privileged port when initiating connection (int) parm: wrq_sge:# scatter/gather element per work request (uint) If you want remote access to determine root cause let me know. This is reproduced on a development machine in our lab, not a customer or production system. |