[LU-2907] Infiniband HW kernel modules of OFA builds not started at system boot Created: 05/Mar/13  Updated: 23/Apr/13  Resolved: 07/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Story Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Frank Heckes (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Duplicate
is duplicated by LU-2972 Execution conflict of OFED initialisa... Closed
Related
Rank (Obsolete): 6997

 Description   

Symptom:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When provisioning test nodes with ofa builds (i.e. 'external' build of the kernel-ib based on Openfabrics OFED tarballs) based on rhel6 and compiled against kernel version 2.6.32-279, the initialization of the Infiniband interfaces (ib0, ib1,...) fails due to the fact the low level kernel Infiniband HW modules mlx4_core, mlx4_en are not loaded.

When loading the kernel-ib HW modules manually (modprobe mlx4_core,...) the interface are created and operational (i.e. connected to fabric, IP over IB works,...)

The kernel-ib RPM normally is going to be build with a set of startup-scripts (/etc/init.d/openibd and links in /etc/rc.d/*, chkconfig execution,...) to ensure that the Infiniband HW kernel modules are loaded during system start. These files/scripts are missing in the kernel-ib RPM.

Due to a installation conflict of the kernel-ib with openibd RPM for canonical distribution 'rhel5' the scripts/files were removed from the OFED kernel-ib SPEC file before creating them (rpmbuild) with help of the lbuild script. (See LU-388 for further details)

This conflict no longer exist since openib-<version>.rpm isn't part of rhel6 anymore. Additionally the functionality of initializing the Infiniband HW is gone, too, because openib RPM contain(ed) the necessary startup scripts:

rpm -qil --scripts -p openib-1.4.1-5.el5.noarch.rpm
warning: openib-1.4.1-5.el5.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID 192a7d7d: NOKEY
Name : openib Relocations: (not relocatable)
Version : 1.4.1 Vendor: Scientific Linux
Release : 5.el5 Build Date: Wed 31 Mar 2010 12:39:27 AM PDT
Install Date: (not installed) Build Host: norob.fnal.gov
Group : System Environment/Base Source RPM: openib-1.4.1-5.el5.src.rpm
Size : 27021 License: GPL/BSD
Signature : DSA/SHA1, Wed 31 Mar 2010 12:52:50 PM PDT, Key ID b0b4183f192a7d7d
URL : http://www.openfabrics.org/
Summary : OpenIB Infiniband Driver Stack
Description :
User space initialization scripts for the kernel InfiniBand drivers
postinstall scriptlet (using /bin/sh):
if [ $1 = 1 ]; then
/sbin/chkconfig --add openibd
fi
preuninstall scriptlet (using /bin/sh):
if [ $1 = 0 ]; then
/sbin/chkconfig --del openibd
fi
/etc/ofed
/etc/ofed/fixup-mtrr.awk
/etc/ofed/openib.conf
/etc/rc.d/init.d/openibd
/etc/sysconfig/network-scripts/ifup-ib
/etc/udev/rules.d/90-ib.rules

The script (openidb) have been 'moved' to kernel-ib package for OFED version 1.5.*.

To overcome the situation the following code change in lustre-reviews/build/lbuild (inside loop beginning at line 1216; `for file in $(ls ${TOPDIR}/lustre/build/patches/ofed/*.patch); do´ )

if [ file =~ "${CANONICAL_TARGET}" ]
ed_fragment3="$ed_fragment3
$(cat $file)"
let n=$n+1
end

and rename of the ed - script (to remove packaging of openibd files and scripts) from

01-play-nice-with-RHEL5.ed
to
01-play-nice-with-rhel5.ed

is necessary. This will ensure that kernel-ib ofa-builds for rhel5 are created without openibd scripts, but make them available for rhel6 RPMs.



 Comments   
Comment by Brian Murrell (Inactive) [ 05/Mar/13 ]

Frank,

I don't understand. We talked about this at quite some length (must have been several hours over a few conversations) and I thought we came to the same conclusion. I thought we had agreed that the patching (01-play-nice-with-RHEL5.ed) in lbuild should stay as it is for both EL5 and EL6 and the solution to the problem of initializing drivers on EL6 was the job of the rdma initscript from the rdma RPM. i.e. simply "yum install rdma" on EL6 nodes to get initscripts to load the I/B drivers.

Has something changed since those conversations?

Comment by Frank Heckes (Inactive) [ 06/Mar/13 ]

Hi Brian,

well, the problem is that the rdma RPM (script) was there from the beginning, i.e. it was installed during the node provisioning:
...
rng-tools-2-13.el6_2.x86_64 Mon 04 Mar 2013 08:39:51 AM PST
readahead-1.5.6-1.el6.x86_64 Mon 04 Mar 2013 08:39:51 AM PST
rdma-3.3-4.el6_3.noarch Mon 04 Mar 2013 08:39:51 AM PST
quota-3.17-16.el6.x86_64 Mon 04 Mar 2013 08:39:51 AM PST
microcode_ctl-1.17-11.el6.x86_64 Mon 04 Mar 2013 08:39:51 AM PST
...
...

and it failed.

I really looked and reconsidered the rdma (/etc/init.d/rdma) script again, but it will initialize the Infiniband interface with an IP Address
if the card has been recognized by the OS. This is only the case if the modules mlx4_core, mlx4_en and mlx4_ib are loaded. This is what the rdma
doesn't provide.

It fails during system boot:

Bringing up interface ib0: Device ib0 does not seem to be present, delaying initialization.
[FAILED]

No hardware was detected:
[root@client-7 ~]# /etc/init.d/rdma status
Low level hardware support loaded:
none found

Upper layer protocol modules:
ib_ipoib

User space access modules:
rdma_ucm ib_ucm ib_uverbs ib_umad

Connection management modules:
rdma_cm ib_cm iw_cm

Configured IPoIB interfaces: none
Currently active IPoIB interfaces: none
[root@client-7 ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:30:48:f7:72:4e brd ff:ff:ff:ff:ff:ff
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
link/ether 00:30:48:f7:72:4f brd ff:ff:ff:ff:ff:ff

rdma script is active:
[root@client-7 ~]# chkconfig --list rdma
rdma 0:off 1:off 2:on 3:on 4:on 5:on 6:off

Starting the IB HW modules manually (mlx4_core, mlx4_en, mlx4_ib) 'fix' the problem.

Further I found that openibd RPM (part of rhel5 distro) contained the /etc/init.d/openibd script starts the HW modules.
This is only the case for rhel5 and OFED-1.4.*

For rhel6 the distro no longer contains the openib RPM. Therefore there's no conflict.

At first glance there's the strange fact that the 'inkernel' build initilizes the Infiniband card correctly. But reason is
that the modules are part of the initial ramdisk (extracted from inkernel build of #180@lustre-b2_1):

./lib/modules/2.6.32-279.14.1.el6_lustre.g044a3a2.x86_64:
total 4548
rw-rr- 1 root root 23712 Mar 6 03:31 acpi-cpufreq.ko
rw-rr- 1 root root 85080 Mar 6 03:31 ahci.ko
rw-rr- 1 root root 13672 Mar 6 03:31 ata_generic.ko
...
...
...
rw-rr- 1 root root 36240 Mar 6 03:31 microcode.ko
rw-rr- 1 root root 300952 Mar 6 03:31 mlx4_core.ko
rw-rr- 1 root root 126960 Mar 6 03:31 mlx4_en.ko
rw-rr- 1 root root 99544 Mar 6 03:31 mlx4_ib.ko
rw-rr- 1 root root 21055 Mar 6 03:31 modules.alias

This is also visible from the system boot messages:

mlx4_core: Mellanox ConnectX core driver v1.1 (Dec, 2011)
mlx4_core: Initializing 0000:02:00.0
mlx4_core 0000:02:00.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24
mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.0 (Dec 2011)
mlx4_en 0000:02:00.0: UDP RSS is not supported on this device.
mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)

Setting hostname client-7.lab.whamcloud.com: [ OK ]
Setting up Logical Volume Management: No volume groups found
[ OK ]

i.e. modules are loaded before init execute the run-level scripts.

This could be a workaround for the OFA builds, too. I.e. add mlx4_

{core,en,ib}

to /etc/sysconfig/kernel
to add ensure they started at system boot.

Comment by Brian Murrell (Inactive) [ 06/Mar/13 ]

Frank,

To be clear, the issue of conflict is not one of simply RPM naming, but its of having multiple initscripts trying to do the same things. If we install an initscript in kernel-ib that fiddles with I/B and the user decides to also install the rdma RPM which also fiddles with I/B there is a conflict there as both should not be trying to do the same thing. Ultimately we need our added kernel-ib to integrate with the base O/S as closely as we can.

So the question becomes, why are the mlx4_* modules from the stock kernel loaded during boot but when they are supplied by kernel-ib they are not loaded?

Perhaps you need to compare the operation of the rdma initscript with and without kernel-ib. You could insert the following right after the first line (i.e. after the #! line) of the rdma initscript:

exec 2>/tmp/rdma.debug
set -x

This will log the xtrace of that initscript to /tmp/rdma.debug. Do that and boot with both kernel-ib installed and without it installed and compare them and see if they operate differently, and if they do, why they do.

It might also be worth while taking an inventory of the installed modules (i.e. lsmod) before and after the rdma initscript runs during boot. You could add an "lsmod > /tmp/before" to the initscript before it calls start() and an "lsmod >/tmp/after" after it exits from start() and again, run that with and without kernel-ib to see what difference in behaviour there is.

Ultimately what you have here is a case were something ought to work but doesn't. In such cases it's usually better to understand why something that ought to work but doesn't doesn't and approach from there. The problem is that I don't think we yet know why that something that ought to work doesn't actually work so any attempt to band-aid it has a likelihood of causing some other unexpected problem and it might not happen until it's out in the field where it becomes a customer support problem (i.e. much more expensive to deal with) and a mea culpa.

Comment by Frank Heckes (Inactive) [ 06/Mar/13 ]

Hi Brian,

reason why the stock ('inkernel') starts the mlx4_

{core, en, ib}

modules, is because they're included in the initial ramdisk of the Lustre kernel (--> see above in my previous comment). I think that is well understood.

We could use the same idea for the external (OFA) builds to circumvent the risk for any clashes of whatever scripts available in the distro with the kernel-ib scripts.
This could be done by adding the modules to /etc/sysconfig/kernel add re-create the Lustre-kernel init-ramdisk, as said above.
Indeed applying the ed-script inside the lbuild script could be left as it is.

For rhel5 there used to be a dedicated RPM (openib; listed in my first comment) that contained the init script '/etc/init.d/openibd' which is (was) supplied by OFED-1.4.* kernel-ib RPM, too. That was the conflict resolved in LU-388.
For rhel6 the openib-RPM doesn't exist anymore, i.e. the packaging has changed.

But I agree rdma and openibd (of the OFED-1.5.4 kernel-ib) can modprobe the same modules (besides the HW core modules ) and set the IP address twice, but that won't do any harm, I guess. (I'll try that on Toro: client-7)

Comment by Brian Murrell (Inactive) [ 06/Mar/13 ]

If the mlx4_* modules really are only being installed by virtue of them being in the ramdisk, why do they not get included in the ramdisk when kernel-ib is installed? i.e. Why do we have to modify /etc/sysconfig/kernel for the kernel-ib case and not for the stock kernel case?

Comment by Chris Gearing (Inactive) [ 06/Mar/13 ]

Brian: I have little insight into the detail on this. But I am surprised that the standard OFED build would not be the best outcome, why do we need to modify the standard build? Or more correctly why would the standard build be of a form that is not providing the best functionality?

Frank: Is it the case that the standard OFED build, without the spec file change, builds, installs and runs properly - or have I missed something?

Comment by Brian Murrell (Inactive) [ 06/Mar/13 ]

chris: Because the standard OFED build assumes a "vanilla" Linux installation does not really take into account vendor "Value Add" such as RedHat has done with their "rdma" package. Ideally, their packaging process should try to figure out if they need to interoperate with the vendors "Value Add' but I don't believe it does".

Comment by Frank Heckes (Inactive) [ 07/Mar/13 ]

Created change for lbuild to alter the kernel-ib SPEC file based on the canonical target name of the distribution (will preserve changes for rhel5).

Also continue investigating into option adding the mlx4_

{core,en,ib}

to initrd and why it isn't done for ofa builds in parallel.

Comment by Frank Heckes (Inactive) [ 15/Mar/13 ]

For inkernel build the mlx4_core and mlx4_en are not part of the initrmamfs. I checked the initrd.kdump file by mistake. Anyway important finding is that the modules are started before the execution of the /etc/init.d/rdma - script

For the inkernel build the following sequence relevant to the infiniband initialization is performed:

init run /etc/rc.d/rc.sysinit
/etc/rc.sysinit run /sbin/start_udev
/sbin/start_udev runs udevd
udevd receives event from kernel that HCA interface is available
udevd triggers load of mlx4_core, and mlx4_en
/etc/rc.sysinit executes active run-level scripts
rdma is executed
if mlx4_core is started mlx4_ib is started ---> which will create interface (ib0, ib...)
if interface is (ib0) available IP configuration is done
rdma finish with success

The 'critical' part for script rdma is whether mlx4_core is loaded or not. If the module is not present the
initialization of the infiniband interface fails.

The behaviour (for the inkernel) can be repeated at run-time by running udevadm monitor --environment and by executing
/etc/init.d/rdma stop
echo 1 > /sys/devices/pci0000\:00/0000\:00\:03.0/0000\:02\:00.0/remove

--> this will remove all mlx4_* modules and the HCA (infiniband) card from the OS

Executing:
echo 1 > > /sys/bus/pci/rescan

adds the hardware and udevd starts the mlx4_en, mlx4_core driver (see client-7-)

If the hardware isn't removed, but all mlx4_* modules are unloaded the udevd reloads the mlx4_core, mlx4_en
when starting the ib-interface via /etc/init.d/rdma.
The startup is handled by the entry:
alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_core

For ofa builds the only the HCA is detected, but the drivers don't. Reason is a dublicate entry in
modules.alias for the ofa build:

client-7-modules.alias-ofa:alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_en
client-7-modules.alias-ofa:alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_core

Removing the entry for mlx4_en fixes the problem and rdma scripts works for ofa, too.

Comment by Brian Murrell (Inactive) [ 15/Mar/13 ]

Frank,

Was this discovery:

For ofa builds the only the HCA is detected, but the drivers don't. Reason is a dublicate entry in
modules.alias for the ofa build:

client-7-modules.alias-ofa:alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_en
client-7-modules.alias-ofa:alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_core

Removing the entry for mlx4_en fixes the problem and rdma scripts works for ofa, too.

made after we spoke on Friday? i.e. is that the smoking gun and if we figure out why that duplicate entry (which is only there when using the OFA I/B, is that right?) is being created it will resolve the issue and the rdma initscript will be fully-functional?

Comment by Frank Heckes (Inactive) [ 16/Mar/13 ]

Yes, that the right, so we have two potential solutions for the problem. I didn't find out yet why the entries for mlx4_en are created. I'll that check on Monday.

Comment by Frank Heckes (Inactive) [ 18/Mar/13 ]

For both client and server ofa builds the modules mlx4_core, mlx4_en won't be loaded by udevd (started from /etc/rc.d/rc.sysinit) if the configuration file '/etc/modprobe.d/mlx4_en.conf' is present. If the file is removed (or moved to other directory or file name) startup of the mlx4_core, mlx4_en works and
therefore the interface ib0 is configured correctly by the '/etc/init.d/rdma' script.

Content of the file reads as:
[root@client-7 ~]# cat /etc/modprobe.d/mlx4_en.conf
install mlx4_core modprobe -ignore-install $((modprobe -c | grep -wq "^allow_unsupported_modules") && echo '-allow-unsupported-modules') mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then modprobe mlx4_en; fi; else modprobe mlx4_en; fi
install mlx4_en modprobe -ignore-install $((modprobe -c | grep -wq "^allow_unsupported_modules") && echo '-allow-unsupported-modules') mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then /sbin/sysctl_perf_tuning load; fi; fi
remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r --ignore-remove mlx4_en

The file is owned by the external OFED kernel-ib RPM:
[root@client-7 ~]# rpm -q --whatprovides /etc/modprobe.d/mlx4_en.conf
kernel-ib-1.5.4-2.6.32_279.14.1.el6_lustre.g1f5b9fe.x86_64.x86_64
(Same for client kernel-ib RPM; version string is only different)

The failed startup of the modules in the case 'mlx_en.conf' is present can can be reproduced by:
1 Removing the HCA (echo 1 > /sys/devices/pci0000\:00/0000\:00\:03.0/0000\:02\:00.0/remove)
2 Rescan of PCI bus (echo 1 > /sys/bus/pci/rescan)
The output of 'udevadm monitor --environment' run simultaneously, shows only the initialization, but no startup of the modules. The same test sequence with 'mlx4_en.conf' removed shows that the modules are loaded correctly accordingly to the modules.

{alias, dep}

mappping.

Easiest fix for the problem will be to remove the file '/etc/modprobe.d/mlx4_en.conf' from the 'packaging list' of the rpmbuild spec file for the OFED kernel-ib modules RPM.

Comment by Brian Murrell (Inactive) [ 19/Mar/13 ]

Easiest fix for the problem will be to remove the file '/etc/modprobe.d/mlx4_en.conf' from the 'packaging list' of the rpmbuild spec file for the OFED kernel-ib modules RPM.

Ahhh. Nice detective work Frank!

This /etc/modprobe.d/mlx4_en.conf is marginally interesting. Reformatting it's lack of whitespace for ease of reading:

install mlx4_core modprobe -ignore-install $((modprobe -c | grep -wq "^allow_unsupported_modules") &&
    echo '-allow-unsupported-modules') mlx4_core &&
    if [ -e /etc/infiniband/openib.conf ]; then
        if ( grep -q "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
            modprobe mlx4_en
        fi
    else
        modprobe mlx4_en
    fi
install mlx4_en modprobe -ignore-install $((modprobe -c | grep -wq "^allow_unsupported_modules") &&
    echo '-allow-unsupported-modules') mlx4_en &&
    if [ -e /etc/infiniband/openib.conf ]; then
        if ( grep -q "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
            /sbin/sysctl_perf_tuning load
        fi
    fi
remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r --ignore-remove mlx4_en

It's an interesting little bit of code. One thing about it worth noting is the reference to /etc/infiniband/openib.conf. Is that file used for anything other than this module installation configuration? If not, might as well remove it from the kernel-ib package as well.

Comment by Frank Heckes (Inactive) [ 20/Mar/13 ]

No the file (/etc/infiniband/openib.conf) is there but not the entries the command grep-command search for,
since they are removed with help of the 01-play-nice.....ed-script, but even if I add them the install directives
prevent both mlx4_core and mlx4_en from being started.

I'm sorry I forgot to append the line:
g/mlx4_en.conf/d

to 01-play-nice-with-rhel5-rhel6.ed. Push it to git.

Comment by Peter Jones [ 07/Apr/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:29:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.