Details
-
Story
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
6997
Description
Symptom:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When provisioning test nodes with ofa builds (i.e. 'external' build of the kernel-ib based on Openfabrics OFED tarballs) based on rhel6 and compiled against kernel version 2.6.32-279, the initialization of the Infiniband interfaces (ib0, ib1,...) fails due to the fact the low level kernel Infiniband HW modules mlx4_core, mlx4_en are not loaded.
When loading the kernel-ib HW modules manually (modprobe mlx4_core,...) the interface are created and operational (i.e. connected to fabric, IP over IB works,...)
The kernel-ib RPM normally is going to be build with a set of startup-scripts (/etc/init.d/openibd and links in /etc/rc.d/*, chkconfig execution,...) to ensure that the Infiniband HW kernel modules are loaded during system start. These files/scripts are missing in the kernel-ib RPM.
Due to a installation conflict of the kernel-ib with openibd RPM for canonical distribution 'rhel5' the scripts/files were removed from the OFED kernel-ib SPEC file before creating them (rpmbuild) with help of the lbuild script. (See LU-388 for further details)
This conflict no longer exist since openib-<version>.rpm isn't part of rhel6 anymore. Additionally the functionality of initializing the Infiniband HW is gone, too, because openib RPM contain(ed) the necessary startup scripts:
rpm -qil --scripts -p openib-1.4.1-5.el5.noarch.rpm
warning: openib-1.4.1-5.el5.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID 192a7d7d: NOKEY
Name : openib Relocations: (not relocatable)
Version : 1.4.1 Vendor: Scientific Linux
Release : 5.el5 Build Date: Wed 31 Mar 2010 12:39:27 AM PDT
Install Date: (not installed) Build Host: norob.fnal.gov
Group : System Environment/Base Source RPM: openib-1.4.1-5.el5.src.rpm
Size : 27021 License: GPL/BSD
Signature : DSA/SHA1, Wed 31 Mar 2010 12:52:50 PM PDT, Key ID b0b4183f192a7d7d
URL : http://www.openfabrics.org/
Summary : OpenIB Infiniband Driver Stack
Description :
User space initialization scripts for the kernel InfiniBand drivers
postinstall scriptlet (using /bin/sh):
if [ $1 = 1 ]; then
/sbin/chkconfig --add openibd
fi
preuninstall scriptlet (using /bin/sh):
if [ $1 = 0 ]; then
/sbin/chkconfig --del openibd
fi
/etc/ofed
/etc/ofed/fixup-mtrr.awk
/etc/ofed/openib.conf
/etc/rc.d/init.d/openibd
/etc/sysconfig/network-scripts/ifup-ib
/etc/udev/rules.d/90-ib.rules
The script (openidb) have been 'moved' to kernel-ib package for OFED version 1.5.*.
To overcome the situation the following code change in lustre-reviews/build/lbuild (inside loop beginning at line 1216; `for file in $(ls ${TOPDIR}/lustre/build/patches/ofed/*.patch); do´ )
if [ file =~ "${CANONICAL_TARGET}" ]
ed_fragment3="$ed_fragment3
$(cat $file)"
let n=$n+1
end
and rename of the ed - script (to remove packaging of openibd files and scripts) from
01-play-nice-with-RHEL5.ed
to
01-play-nice-with-rhel5.ed
is necessary. This will ensure that kernel-ib ofa-builds for rhel5 are created without openibd scripts, but make them available for rhel6 RPMs.
Attachments
Issue Links
- is duplicated by
-
LU-2972 Execution conflict of OFED initialisation script
-
- Closed
-
For both client and server ofa builds the modules mlx4_core, mlx4_en won't be loaded by udevd (started from /etc/rc.d/rc.sysinit) if the configuration file '/etc/modprobe.d/mlx4_en.conf' is present. If the file is removed (or moved to other directory or file name) startup of the mlx4_core, mlx4_en works and
therefore the interface ib0 is configured correctly by the '/etc/init.d/rdma' script.
Content of the file reads as:
[root@client-7 ~]# cat /etc/modprobe.d/mlx4_en.conf
install mlx4_core modprobe -
ignore-install $((modprobe -c | grep -wq "^allow_unsupported_modules") && echo '-allow-unsupported-modules') mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then modprobe mlx4_en; fi; else modprobe mlx4_en; fiinstall mlx4_en modprobe -
ignore-install $((modprobe -c | grep -wq "^allow_unsupported_modules") && echo '-allow-unsupported-modules') mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then /sbin/sysctl_perf_tuning load; fi; firemove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r --ignore-remove mlx4_en
The file is owned by the external OFED kernel-ib RPM:
[root@client-7 ~]# rpm -q --whatprovides /etc/modprobe.d/mlx4_en.conf
kernel-ib-1.5.4-2.6.32_279.14.1.el6_lustre.g1f5b9fe.x86_64.x86_64
(Same for client kernel-ib RPM; version string is only different)
The failed startup of the modules in the case 'mlx_en.conf' is present can can be reproduced by:
{alias, dep}1Removing the HCA (echo 1 > /sys/devices/pci0000\:00/0000\:00\:03.0/0000\:02\:00.0/remove)2Rescan of PCI bus (echo 1 > /sys/bus/pci/rescan)The output of 'udevadm monitor --environment' run simultaneously, shows only the initialization, but no startup of the modules. The same test sequence with 'mlx4_en.conf' removed shows that the modules are loaded correctly accordingly to the modules.
mappping.
Easiest fix for the problem will be to remove the file '/etc/modprobe.d/mlx4_en.conf' from the 'packaging list' of the rpmbuild spec file for the OFED kernel-ib modules RPM.