<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:29:16 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2907] Infiniband HW kernel modules of OFA builds not started at system boot</title>
                <link>https://jira.whamcloud.com/browse/LU-2907</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Symptom:&lt;br/&gt;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^&lt;br/&gt;
When provisioning test nodes with ofa builds (i.e. &apos;external&apos; build of the kernel-ib based on Openfabrics OFED tarballs) based on rhel6 and compiled against kernel version 2.6.32-279, the initialization of the Infiniband interfaces (ib0, ib1,...) fails due to the fact the low level kernel Infiniband HW modules mlx4_core, mlx4_en are not loaded.&lt;/p&gt;

&lt;p&gt;When loading the kernel-ib HW modules manually (modprobe mlx4_core,...) the interface are created and operational (i.e. connected to fabric, IP over IB works,...)&lt;/p&gt;

&lt;p&gt;The kernel-ib RPM normally is going to be build with a set of startup-scripts (/etc/init.d/openibd and links in /etc/rc.d/*, chkconfig execution,...) to ensure that the Infiniband HW kernel modules are loaded during system start. These files/scripts are missing in the kernel-ib RPM.&lt;/p&gt;


&lt;p&gt;Due to a installation conflict of the kernel-ib with openibd RPM for canonical distribution &apos;rhel5&apos; the scripts/files were removed from the OFED kernel-ib SPEC file before creating them (rpmbuild) with help of the lbuild script. (See &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-388&quot; title=&quot;Lustre&amp;#39;s kernel-ib RPM conflicts with EL5&amp;#39;s openib RPM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-388&quot;&gt;&lt;del&gt;LU-388&lt;/del&gt;&lt;/a&gt; for further details)&lt;/p&gt;

&lt;p&gt;This conflict no longer exist since openib-&amp;lt;version&amp;gt;.rpm isn&apos;t part of rhel6 anymore. Additionally the functionality of initializing the Infiniband HW is gone, too, because openib RPM contain(ed) the necessary startup scripts:&lt;/p&gt;

&lt;p&gt;rpm -qil --scripts -p openib-1.4.1-5.el5.noarch.rpm&lt;br/&gt;
warning: openib-1.4.1-5.el5.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID 192a7d7d: NOKEY&lt;br/&gt;
Name        : openib                       Relocations: (not relocatable)&lt;br/&gt;
Version     : 1.4.1                             Vendor: Scientific Linux&lt;br/&gt;
Release     : 5.el5                         Build Date: Wed 31 Mar 2010 12:39:27 AM PDT&lt;br/&gt;
Install Date: (not installed)               Build Host: norob.fnal.gov&lt;br/&gt;
Group       : System Environment/Base       Source RPM: openib-1.4.1-5.el5.src.rpm&lt;br/&gt;
Size        : 27021                            License: GPL/BSD&lt;br/&gt;
Signature   : DSA/SHA1, Wed 31 Mar 2010 12:52:50 PM PDT, Key ID b0b4183f192a7d7d&lt;br/&gt;
URL         : &lt;a href=&quot;http://www.openfabrics.org/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://www.openfabrics.org/&lt;/a&gt;&lt;br/&gt;
Summary     : OpenIB Infiniband Driver Stack&lt;br/&gt;
Description :&lt;br/&gt;
User space initialization scripts for the kernel InfiniBand drivers&lt;br/&gt;
postinstall scriptlet (using /bin/sh):&lt;br/&gt;
if [ $1 = 1 ]; then&lt;br/&gt;
    /sbin/chkconfig --add openibd &lt;br/&gt;
fi&lt;br/&gt;
preuninstall scriptlet (using /bin/sh):&lt;br/&gt;
if [ $1 = 0 ]; then&lt;br/&gt;
    /sbin/chkconfig --del openibd &lt;br/&gt;
fi&lt;br/&gt;
/etc/ofed&lt;br/&gt;
/etc/ofed/fixup-mtrr.awk&lt;br/&gt;
/etc/ofed/openib.conf&lt;br/&gt;
/etc/rc.d/init.d/openibd&lt;br/&gt;
/etc/sysconfig/network-scripts/ifup-ib&lt;br/&gt;
/etc/udev/rules.d/90-ib.rules&lt;/p&gt;


&lt;p&gt;The script (openidb) have been &apos;moved&apos; to kernel-ib package for OFED version 1.5.*.&lt;/p&gt;

&lt;p&gt;To overcome the situation the following code change in lustre-reviews/build/lbuild (inside loop beginning at line 1216; `for file in $(ls ${TOPDIR}/lustre/build/patches/ofed/*.patch); do&#180; )&lt;/p&gt;

&lt;p&gt;if [ file =~ &quot;${CANONICAL_TARGET}&quot; ]&lt;br/&gt;
   ed_fragment3=&quot;$ed_fragment3&lt;br/&gt;
   $(cat $file)&quot;&lt;br/&gt;
   let n=$n+1&lt;br/&gt;
end&lt;/p&gt;

&lt;p&gt;and rename of the ed - script (to remove packaging of openibd files and scripts) from&lt;/p&gt;

&lt;p&gt;     01-play-nice-with-RHEL5.ed &lt;br/&gt;
to&lt;br/&gt;
     01-play-nice-with-rhel5.ed&lt;/p&gt;

&lt;p&gt;is necessary. This will ensure that kernel-ib ofa-builds for rhel5 are created without openibd scripts, but make them available for rhel6 RPMs.&lt;/p&gt;</description>
                <environment></environment>
        <key id="17761">LU-2907</key>
            <summary>Infiniband HW kernel modules of OFA builds not started at system boot</summary>
                <type id="6" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11315&amp;avatarType=issuetype">Story</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="heckes">Frank Heckes</assignee>
                                    <reporter username="heckes">Frank Heckes</reporter>
                        <labels>
                            <label>HB</label>
                    </labels>
                <created>Tue, 5 Mar 2013 12:15:31 +0000</created>
                <updated>Tue, 23 Apr 2013 17:14:50 +0000</updated>
                            <resolved>Sun, 7 Apr 2013 17:33:31 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                    <fixVersion>Lustre 2.4.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="53350" author="brian" created="Tue, 5 Mar 2013 13:44:24 +0000"  >&lt;p&gt;Frank,&lt;/p&gt;

&lt;p&gt;I don&apos;t understand.  We talked about this at quite some length (must have been several hours over a few conversations) and I thought we came to the same conclusion.  I thought we had agreed that the patching (01-play-nice-with-RHEL5.ed) in lbuild should stay as it is for both EL5 and EL6 and the solution to the problem of initializing drivers on EL6 was the job of the rdma initscript from the rdma RPM.  i.e. simply &quot;yum install rdma&quot; on EL6 nodes to get initscripts to load the I/B drivers.&lt;/p&gt;

&lt;p&gt;Has something changed since those conversations?&lt;/p&gt;</comment>
                            <comment id="53425" author="heckes" created="Wed, 6 Mar 2013 06:48:27 +0000"  >&lt;p&gt;Hi Brian,&lt;/p&gt;

&lt;p&gt;well, the problem is that the rdma RPM (script) was there from the beginning, i.e. it was installed during the node provisioning:&lt;br/&gt;
...&lt;br/&gt;
rng-tools-2-13.el6_2.x86_64                   Mon 04 Mar 2013 08:39:51 AM PST&lt;br/&gt;
readahead-1.5.6-1.el6.x86_64                  Mon 04 Mar 2013 08:39:51 AM PST&lt;br/&gt;
rdma-3.3-4.el6_3.noarch                       Mon 04 Mar 2013 08:39:51 AM PST&lt;br/&gt;
quota-3.17-16.el6.x86_64                      Mon 04 Mar 2013 08:39:51 AM PST&lt;br/&gt;
microcode_ctl-1.17-11.el6.x86_64              Mon 04 Mar 2013 08:39:51 AM PST&lt;br/&gt;
...&lt;br/&gt;
...&lt;/p&gt;

&lt;p&gt;and it failed. &lt;/p&gt;

&lt;p&gt;I really looked and reconsidered the rdma (/etc/init.d/rdma) script again, but it will initialize the Infiniband interface with an IP Address &lt;br/&gt;
if the card has been recognized by the OS. This is only the case if the modules mlx4_core, mlx4_en and mlx4_ib are loaded. This is what the rdma&lt;br/&gt;
doesn&apos;t provide. &lt;/p&gt;

&lt;p&gt;It fails during system boot:&lt;/p&gt;

&lt;p&gt;Bringing up interface ib0:  Device ib0 does not seem to be present, delaying initialization.&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;FAILED&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;No hardware was detected:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@client-7 ~&amp;#93;&lt;/span&gt;# /etc/init.d/rdma status&lt;br/&gt;
Low level hardware support loaded:&lt;br/&gt;
        none found&lt;/p&gt;

&lt;p&gt;Upper layer protocol modules:&lt;br/&gt;
        ib_ipoib &lt;/p&gt;

&lt;p&gt;User space access modules:&lt;br/&gt;
        rdma_ucm ib_ucm ib_uverbs ib_umad &lt;/p&gt;

&lt;p&gt;Connection management modules:&lt;br/&gt;
        rdma_cm ib_cm iw_cm &lt;/p&gt;

&lt;p&gt;Configured IPoIB interfaces: none&lt;br/&gt;
Currently active IPoIB interfaces: none&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@client-7 ~&amp;#93;&lt;/span&gt;# ip link&lt;br/&gt;
1: lo: &amp;lt;LOOPBACK,UP,LOWER_UP&amp;gt; mtu 16436 qdisc noqueue state UNKNOWN &lt;br/&gt;
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00&lt;br/&gt;
2: eth0: &amp;lt;BROADCAST,MULTICAST,UP,LOWER_UP&amp;gt; mtu 1500 qdisc mq state UP qlen 1000&lt;br/&gt;
    link/ether 00:30:48:f7:72:4e brd ff:ff:ff:ff:ff:ff&lt;br/&gt;
3: eth1: &amp;lt;NO-CARRIER,BROADCAST,MULTICAST,UP&amp;gt; mtu 1500 qdisc mq state DOWN qlen 1000&lt;br/&gt;
    link/ether 00:30:48:f7:72:4f brd ff:ff:ff:ff:ff:ff&lt;/p&gt;


&lt;p&gt;rdma script is active:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@client-7 ~&amp;#93;&lt;/span&gt;# chkconfig --list rdma&lt;br/&gt;
rdma            0:off   1:off   2:on    3:on    4:on    5:on    6:off&lt;/p&gt;

&lt;p&gt;Starting the IB HW modules manually (mlx4_core, mlx4_en, mlx4_ib) &apos;fix&apos; the problem.&lt;/p&gt;

&lt;p&gt;Further I found that openibd RPM (part of rhel5 distro) contained the /etc/init.d/openibd script starts the HW modules. &lt;br/&gt;
This is only the case for rhel5 and OFED-1.4.* &lt;/p&gt;

&lt;p&gt;For rhel6 the distro no longer contains the openib RPM. Therefore there&apos;s no conflict.&lt;/p&gt;


&lt;p&gt;At first glance there&apos;s the strange fact that the &apos;inkernel&apos; build initilizes the Infiniband card correctly. But reason is&lt;br/&gt;
that the modules are part of the initial ramdisk (extracted from inkernel build of #180@lustre-b2_1): &lt;/p&gt;

&lt;p&gt;./lib/modules/2.6.32-279.14.1.el6_lustre.g044a3a2.x86_64:&lt;br/&gt;
total 4548&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root  23712 Mar  6 03:31 acpi-cpufreq.ko&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root  85080 Mar  6 03:31 ahci.ko&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root  13672 Mar  6 03:31 ata_generic.ko&lt;br/&gt;
...&lt;br/&gt;
...&lt;br/&gt;
...&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root  36240 Mar  6 03:31 microcode.ko&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root 300952 Mar  6 03:31 mlx4_core.ko&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root 126960 Mar  6 03:31 mlx4_en.ko&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root  99544 Mar  6 03:31 mlx4_ib.ko&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root  21055 Mar  6 03:31 modules.alias&lt;/p&gt;


&lt;p&gt;This is also visible from the system boot messages:&lt;/p&gt;

&lt;p&gt;mlx4_core: Mellanox ConnectX core driver v1.1 (Dec, 2011)&lt;br/&gt;
mlx4_core: Initializing 0000:02:00.0&lt;br/&gt;
mlx4_core 0000:02:00.0: PCI INT A -&amp;gt; GSI 24 (level, low) -&amp;gt; IRQ 24&lt;br/&gt;
mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.0 (Dec 2011)&lt;br/&gt;
mlx4_en 0000:02:00.0: UDP RSS is not supported on this device.&lt;br/&gt;
mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)&lt;/p&gt;

&lt;p&gt;Setting hostname client-7.lab.whamcloud.com:  [  OK  ]&lt;br/&gt;
Setting up Logical Volume Management:   No volume groups found&lt;br/&gt;
[  OK  ]&lt;/p&gt;


&lt;p&gt;i.e. modules are loaded before init execute the run-level scripts.&lt;/p&gt;

&lt;p&gt;This could be a workaround for the OFA builds, too. I.e. add mlx4_&lt;/p&gt;
{core,en,ib}
&lt;p&gt; to /etc/sysconfig/kernel &lt;br/&gt;
to add ensure they started at system boot.&lt;/p&gt;</comment>
                            <comment id="53427" author="brian" created="Wed, 6 Mar 2013 07:11:21 +0000"  >&lt;p&gt;Frank,&lt;/p&gt;

&lt;p&gt;To be clear, the issue of conflict is not one of simply RPM naming, but its of having multiple initscripts trying to do the same things.  If we install an initscript in kernel-ib that fiddles with I/B and the user decides to also install the rdma RPM which also fiddles with I/B there is a conflict there as both should not be trying to do the same thing.  Ultimately we need our added kernel-ib to integrate with the base O/S as closely as we can.&lt;/p&gt;

&lt;p&gt;So the question becomes, why are the mlx4_* modules from the stock kernel loaded during boot but when they are supplied by kernel-ib they are not loaded?&lt;/p&gt;

&lt;p&gt;Perhaps you need to compare the operation of the rdma initscript with and without kernel-ib.  You could insert the following right after the first line (i.e. after the #! line) of the rdma initscript:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;exec 2&amp;gt;/tmp/rdma.debug
set -x
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will log the xtrace of that initscript to /tmp/rdma.debug.  Do that and boot with both kernel-ib installed and without it installed and compare them and see if they operate differently, and if they do, why they do.&lt;/p&gt;

&lt;p&gt;It might also be worth while taking an inventory of the installed modules (i.e. lsmod) before and after the rdma initscript runs during boot.  You could add an &quot;lsmod &amp;gt; /tmp/before&quot; to the initscript before it calls start() and an &quot;lsmod &amp;gt;/tmp/after&quot; after it exits from start() and again, run that with and without kernel-ib to see what difference in behaviour there is.&lt;/p&gt;

&lt;p&gt;Ultimately what you have here is a case were something ought to work but doesn&apos;t.  In such cases it&apos;s usually better to understand why something that ought to work but doesn&apos;t doesn&apos;t and approach from there.  The problem is that I don&apos;t think we yet know why that something that ought to work doesn&apos;t actually work so any attempt to band-aid it has a likelihood of causing some other unexpected problem and it might not happen until it&apos;s out in the field where it becomes a customer support problem (i.e. much more expensive to deal with) and a mea culpa.&lt;/p&gt;</comment>
                            <comment id="53448" author="heckes" created="Wed, 6 Mar 2013 11:09:22 +0000"  >&lt;p&gt;Hi Brian,&lt;/p&gt;

&lt;p&gt;reason why the stock (&apos;inkernel&apos;) starts the mlx4_&lt;/p&gt;
{core, en, ib}
&lt;p&gt; modules, is because they&apos;re included in the initial ramdisk of the Lustre kernel (--&amp;gt; see above in my previous comment). I think that is well understood.&lt;/p&gt;

&lt;p&gt;We could use the same idea for the external (OFA) builds to circumvent the risk for any clashes of whatever scripts available in the distro with the kernel-ib scripts.&lt;br/&gt;
This could be done by adding the modules to /etc/sysconfig/kernel add re-create the Lustre-kernel init-ramdisk, as said above.&lt;br/&gt;
Indeed applying the ed-script inside the lbuild script could be left as it is.&lt;/p&gt;

&lt;p&gt;For rhel5 there used to be a dedicated RPM (openib; listed in my first comment) that contained the init script &apos;/etc/init.d/openibd&apos; which is (was) supplied by OFED-1.4.* kernel-ib RPM, too. That was the conflict resolved in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-388&quot; title=&quot;Lustre&amp;#39;s kernel-ib RPM conflicts with EL5&amp;#39;s openib RPM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-388&quot;&gt;&lt;del&gt;LU-388&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
For rhel6 the openib-RPM doesn&apos;t exist anymore, i.e. the packaging has changed. &lt;/p&gt;

&lt;p&gt;But I agree rdma and openibd (of the OFED-1.5.4 kernel-ib) can modprobe the same modules (besides the HW core modules &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;) and set the IP address twice, but that won&apos;t do any harm, I guess. (I&apos;ll try that on Toro: client-7)&lt;/p&gt;</comment>
                            <comment id="53449" author="brian" created="Wed, 6 Mar 2013 11:15:32 +0000"  >&lt;p&gt;If the mlx4_* modules really are only being installed by virtue of them being in the ramdisk, why do they not get included in the ramdisk when kernel-ib is installed?  i.e. Why do we have to modify /etc/sysconfig/kernel for the kernel-ib case and not for the stock kernel case?&lt;/p&gt;</comment>
                            <comment id="53463" author="chris" created="Wed, 6 Mar 2013 13:40:07 +0000"  >&lt;p&gt;Brian: I have little insight into the detail on this. But I am surprised that the standard OFED build would not be the best outcome, why do we need to modify the standard build? Or more correctly why would the standard build be of a form that is not providing the best functionality?&lt;/p&gt;

&lt;p&gt;Frank: Is it the case that the standard OFED build, without the spec file change, builds, installs and runs properly - or have I missed something? &lt;/p&gt;</comment>
                            <comment id="53467" author="brian" created="Wed, 6 Mar 2013 14:00:09 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=chris&quot; class=&quot;user-hover&quot; rel=&quot;chris&quot;&gt;chris&lt;/a&gt;: Because the standard OFED build assumes a &quot;vanilla&quot; Linux installation does not really take into account vendor &quot;Value Add&quot; such as RedHat has done with their &quot;rdma&quot; package.  Ideally, their packaging process should try to figure out if they need to interoperate with the vendors &quot;Value Add&apos; but I don&apos;t believe it does&quot;.&lt;/p&gt;</comment>
                            <comment id="53523" author="heckes" created="Thu, 7 Mar 2013 06:43:10 +0000"  >&lt;p&gt;Created change for lbuild to alter the kernel-ib SPEC file based on the canonical target name of the distribution (will preserve changes for rhel5).&lt;/p&gt;

&lt;p&gt;Also continue investigating into option adding the mlx4_&lt;/p&gt;
{core,en,ib}
&lt;p&gt; to initrd and why it isn&apos;t done for ofa builds in parallel.&lt;/p&gt;</comment>
                            <comment id="54135" author="heckes" created="Fri, 15 Mar 2013 15:59:19 +0000"  >&lt;p&gt;For inkernel build the mlx4_core and mlx4_en are not part of the initrmamfs. I checked the initrd.kdump file by mistake. Anyway important finding is that the modules are started before the execution of the /etc/init.d/rdma - script&lt;/p&gt;

&lt;p&gt;For the inkernel build the following sequence relevant to the infiniband initialization is performed:&lt;/p&gt;

&lt;p&gt;init run /etc/rc.d/rc.sysinit&lt;br/&gt;
/etc/rc.sysinit run /sbin/start_udev&lt;br/&gt;
/sbin/start_udev runs udevd&lt;br/&gt;
udevd receives event from kernel that HCA interface is available&lt;br/&gt;
udevd triggers load of mlx4_core, and mlx4_en &lt;br/&gt;
/etc/rc.sysinit executes active run-level scripts&lt;br/&gt;
rdma is executed&lt;br/&gt;
   if mlx4_core is started mlx4_ib is started   ---&amp;gt; which will create interface (ib0, ib...)&lt;br/&gt;
   if interface is (ib0) available IP configuration is done&lt;br/&gt;
   rdma finish with success&lt;/p&gt;

&lt;p&gt;The &apos;critical&apos; part for script rdma is whether mlx4_core is loaded or not. If the module is not present the&lt;br/&gt;
initialization of the infiniband interface fails.&lt;/p&gt;

&lt;p&gt;The behaviour (for the inkernel) can be repeated at run-time by running udevadm monitor --environment and by executing &lt;br/&gt;
    /etc/init.d/rdma stop&lt;br/&gt;
    echo 1 &amp;gt; /sys/devices/pci0000\:00/0000\:00\:03.0/0000\:02\:00.0/remove&lt;/p&gt;

&lt;p&gt; --&amp;gt;  this will remove all mlx4_* modules and the HCA (infiniband) card from the OS&lt;/p&gt;

&lt;p&gt;Executing: &lt;br/&gt;
    echo 1 &amp;gt; &amp;gt; /sys/bus/pci/rescan&lt;/p&gt;

&lt;p&gt;adds the hardware and udevd starts the mlx4_en, mlx4_core driver (see client-7-)&lt;/p&gt;

&lt;p&gt;If the hardware isn&apos;t removed, but all mlx4_* modules are unloaded the udevd reloads the mlx4_core, mlx4_en &lt;br/&gt;
when starting the ib-interface via /etc/init.d/rdma.&lt;br/&gt;
The startup is handled by the entry:&lt;br/&gt;
alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_core&lt;/p&gt;


&lt;p&gt;For ofa builds the only the HCA is detected, but the drivers don&apos;t. Reason is a dublicate entry in&lt;br/&gt;
modules.alias for the ofa build:&lt;/p&gt;

&lt;p&gt;client-7-modules.alias-ofa:alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_en&lt;br/&gt;
client-7-modules.alias-ofa:alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_core&lt;/p&gt;


&lt;p&gt;Removing the entry for mlx4_en fixes the problem and rdma scripts works for ofa, too.&lt;/p&gt;</comment>
                            <comment id="54139" author="brian" created="Fri, 15 Mar 2013 17:04:42 +0000"  >&lt;p&gt;Frank,&lt;/p&gt;

&lt;p&gt;Was this discovery:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For ofa builds the only the HCA is detected, but the drivers don&apos;t. Reason is a dublicate entry in&lt;br/&gt;
modules.alias for the ofa build:&lt;/p&gt;

&lt;p&gt;client-7-modules.alias-ofa:alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_en&lt;br/&gt;
client-7-modules.alias-ofa:alias pci:v000015B3d0000673Csv*sd*bc*sc*i* mlx4_core&lt;/p&gt;

&lt;p&gt;Removing the entry for mlx4_en fixes the problem and rdma scripts works for ofa, too.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;made after we spoke on Friday?  i.e. is that the smoking gun and if we figure out why that duplicate entry (which is only there when using the OFA I/B, is that right?) is being created it will resolve the issue and the rdma initscript will be fully-functional?&lt;/p&gt;</comment>
                            <comment id="54208" author="heckes" created="Sat, 16 Mar 2013 17:57:25 +0000"  >&lt;p&gt;Yes, that the right, so we have two potential solutions for the problem. I didn&apos;t find out yet why the entries for mlx4_en are created. I&apos;ll that check on Monday.&lt;/p&gt;</comment>
                            <comment id="54305" author="heckes" created="Mon, 18 Mar 2013 20:57:13 +0000"  >&lt;p&gt;For both client and server ofa builds the modules mlx4_core, mlx4_en won&apos;t be loaded by udevd (started from /etc/rc.d/rc.sysinit) if the configuration file &apos;/etc/modprobe.d/mlx4_en.conf&apos; is present. If the file is removed (or moved to other directory or file name) startup of the mlx4_core, mlx4_en works and&lt;br/&gt;
therefore the interface ib0 is configured correctly by the &apos;/etc/init.d/rdma&apos; script.  &lt;/p&gt;


&lt;p&gt;Content of the file reads as:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@client-7 ~&amp;#93;&lt;/span&gt;# cat /etc/modprobe.d/mlx4_en.conf &lt;br/&gt;
install mlx4_core modprobe -&lt;del&gt;ignore-install $((modprobe -c | grep -wq &quot;^allow_unsupported_modules&quot;) &amp;amp;&amp;amp; echo &apos;&lt;/del&gt;-allow-unsupported-modules&apos;) mlx4_core &amp;amp;&amp;amp; if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q &quot;^MLX4_EN_LOAD=yes&quot; /etc/infiniband/openib.conf &amp;gt; /dev/null 2&amp;gt;&amp;amp;1); then modprobe mlx4_en; fi; else modprobe mlx4_en; fi&lt;br/&gt;
install mlx4_en modprobe -&lt;del&gt;ignore-install $((modprobe -c | grep -wq &quot;^allow_unsupported_modules&quot;) &amp;amp;&amp;amp; echo &apos;&lt;/del&gt;-allow-unsupported-modules&apos;) mlx4_en &amp;amp;&amp;amp; if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q &quot;^RUN_SYSCTL=yes&quot; /etc/infiniband/openib.conf &amp;gt; /dev/null 2&amp;gt;&amp;amp;1); then /sbin/sysctl_perf_tuning load; fi; fi&lt;br/&gt;
remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r --ignore-remove mlx4_en&lt;/p&gt;

&lt;p&gt;The file is owned by the external OFED kernel-ib RPM:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@client-7 ~&amp;#93;&lt;/span&gt;# rpm -q --whatprovides /etc/modprobe.d/mlx4_en.conf &lt;br/&gt;
kernel-ib-1.5.4-2.6.32_279.14.1.el6_lustre.g1f5b9fe.x86_64.x86_64&lt;br/&gt;
(Same for client kernel-ib RPM; version string is only different)&lt;/p&gt;


&lt;p&gt;The failed startup of the modules in the case &apos;mlx_en.conf&apos; is present can can be reproduced by:&lt;br/&gt;
&lt;del&gt;1&lt;/del&gt; Removing the HCA (echo 1 &amp;gt; /sys/devices/pci0000\:00/0000\:00\:03.0/0000\:02\:00.0/remove)&lt;br/&gt;
&lt;del&gt;2&lt;/del&gt; Rescan of PCI bus (echo 1 &amp;gt; /sys/bus/pci/rescan)&lt;br/&gt;
The output of &apos;udevadm monitor --environment&apos; run simultaneously, shows only the initialization, but no startup of the modules. The same test sequence with &apos;mlx4_en.conf&apos; removed shows that the modules are loaded correctly accordingly to the modules.&lt;/p&gt;
{alias, dep}
&lt;p&gt; mappping.&lt;/p&gt;

&lt;p&gt;Easiest fix for the problem will be to remove the file &apos;/etc/modprobe.d/mlx4_en.conf&apos; from the &apos;packaging list&apos; of the rpmbuild spec file for the OFED kernel-ib modules RPM.&lt;/p&gt;</comment>
                            <comment id="54343" author="brian" created="Tue, 19 Mar 2013 05:15:21 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Easiest fix for the problem will be to remove the file &apos;/etc/modprobe.d/mlx4_en.conf&apos; from the &apos;packaging list&apos; of the rpmbuild spec file for the OFED kernel-ib modules RPM.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Ahhh.  Nice detective work Frank!&lt;/p&gt;

&lt;p&gt;This &lt;tt&gt;/etc/modprobe.d/mlx4_en.conf&lt;/tt&gt; is marginally interesting.  Reformatting it&apos;s lack of whitespace for ease of reading:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;install mlx4_core modprobe -ignore-install $((modprobe -c | grep -wq &lt;span class=&quot;code-quote&quot;&gt;&quot;^allow_unsupported_modules&quot;&lt;/span&gt;) &amp;amp;&amp;amp;
    echo &lt;span class=&quot;code-quote&quot;&gt;&apos;-allow-unsupported-modules&apos;&lt;/span&gt;) mlx4_core &amp;amp;&amp;amp;
    &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; [ -e /etc/infiniband/openib.conf ]; then
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; ( grep -q &lt;span class=&quot;code-quote&quot;&gt;&quot;^MLX4_EN_LOAD=yes&quot;&lt;/span&gt; /etc/infiniband/openib.conf &amp;gt; /dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; 2&amp;gt;&amp;amp;1); then
            modprobe mlx4_en
        fi
    &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
        modprobe mlx4_en
    fi
install mlx4_en modprobe -ignore-install $((modprobe -c | grep -wq &lt;span class=&quot;code-quote&quot;&gt;&quot;^allow_unsupported_modules&quot;&lt;/span&gt;) &amp;amp;&amp;amp;
    echo &lt;span class=&quot;code-quote&quot;&gt;&apos;-allow-unsupported-modules&apos;&lt;/span&gt;) mlx4_en &amp;amp;&amp;amp;
    &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; [ -e /etc/infiniband/openib.conf ]; then
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; ( grep -q &lt;span class=&quot;code-quote&quot;&gt;&quot;^RUN_SYSCTL=yes&quot;&lt;/span&gt; /etc/infiniband/openib.conf &amp;gt; /dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; 2&amp;gt;&amp;amp;1); then
            /sbin/sysctl_perf_tuning load
        fi
    fi
remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r --ignore-remove mlx4_en
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It&apos;s an interesting little bit of code.  One thing about it worth noting is the reference to /etc/infiniband/openib.conf.  Is that file used for anything other than this module installation configuration?  If not, might as well remove it from the kernel-ib package as well.&lt;/p&gt;</comment>
                            <comment id="54471" author="heckes" created="Wed, 20 Mar 2013 15:01:36 +0000"  >&lt;p&gt;No the file (/etc/infiniband/openib.conf) is there but not the entries the command grep-command search for,&lt;br/&gt;
since they are removed with help of the 01-play-nice.....ed-script, but even if I add them the install directives&lt;br/&gt;
prevent both mlx4_core and mlx4_en from being started.&lt;/p&gt;

&lt;p&gt;I&apos;m sorry I forgot to append the line:&lt;br/&gt;
g/mlx4_en.conf/d&lt;/p&gt;

&lt;p&gt;to 01-play-nice-with-rhel5-rhel6.ed. Push it to git.&lt;/p&gt;</comment>
                            <comment id="55682" author="pjones" created="Sun, 7 Apr 2013 17:33:31 +0000"  >&lt;p&gt;Landed for 2.4&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="17911">LU-2972</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvk6v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6997</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>