[LU-1211] Issues building and installing Lustre 1.8.7-wc1 with MYRINET support Created: 13/Mar/12  Updated: 29/Apr/12  Resolved: 29/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Carlos Thomaz Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: paj, server
Environment:

Lustre 1.8.7-WC1
OS server version: Red Hat Enterprise Linux Server release 5.5 (Tikanga) - DDN EXASCALER 1.5.0 OS Image
2 x MDS Intel E5645 @ 2.4Ghz (6 core / Westmere), 48GB RAM
8 x OSS Intel E5645 @ 2.4Ghz (6 core / Westmere), 48GB RAM
Myrinet Myri10G dual protocol NIC (rev 01)


Attachments: File mx-lustre-build.script.gz    
Severity: 1
Epic: server
Rank (Obsolete): 6099

 Description   

We are having problems installing lustre with Myrinet support on a customer site.
The building process seems fine, the MX drivers work standalone (we can load the drivers, bring up interfaces, set IP address and get communication with other servers). We also manage to build lustre with no warning or error messages.
However, when installing the RPMs a bunch of kmxlnd.ko messages pops up concerning about unknown symbols, like mx_*

This is the process we are following:

1) Files we are using:
kernel-headers-2.6.18-274.3.1.el5_lustre.g9500ebf.x86_64.rpm
kernel-2.6.18-274.3.1.el5_lustre.g9500ebf.x86_64.rpm
lustre-source-1.8.7-wc1_2.6.18_274.3.1.el5_lustre.g9500ebf.x86_64.rpm
kernel-debuginfo-common-2.6.18-274.3.1.el5_lustre.g9500ebf.x86_64.rpm
kernel-devel-2.6.18-274.3.1.el5_lustre.g9500ebf.x86_64.rpm
mx_1.2.12.tar.gz

2) Install the kernel and lustre source, and reboot
rpm Uvh --nodeps kernel* lustre-source-*
reboot

3) build the MX driver
./configure --enable-kernel-lib --enable-10g --enable-ether-mode
make rpm
rpm -Uvh mx-1.2.12-1.x86_64.rpm

4) Build Lustre
./configure --enable-quota --with-server --disable-lru-resize --enable-ext4 --disable-health-write --with-mx=/root/mx/mx-1.2.12

make rpms

cd /usr/src/redhat/RPMS/x86_64/
rpm -Uvh lustre-1.8.7* lustre-ldiskfs* lustre-modules*

All lustre packages get installed but these warning messages pops up:
... <snip>
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_get_endpoint_addr
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_open_endpoint
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_finalize
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_iconnect
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_strerror
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_set_endpoint_addr_context
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_kirecv
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_get_endpoint_addr_context
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_wait_any
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_nic_id_to_board_number
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_close_endpoint
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx__init_api
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_register_unexp_handler
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_kisend
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_test_any
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_decompose_endpoint_addr
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_strstatus
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_cancel
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_set_request_timeout
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_wakeup
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_connect
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_decompose_endpoint_addr2
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_disconnect
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_get_endpoint_addr
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_open_endpoint
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_finalize
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_iconnect
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_strerror
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_set_endpoint_addr_context
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_kirecv
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_get_endpoint_addr_context
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_wait_any
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_nic_id_to_board_number
WARNING: /lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko needs unknown symbol mx_close_endpoint
... </snip>

5) Bring up the MX driver:
/opt/mx/sbin/mx_start_stop start
Loading mx driver
Creating mx devices

6) Bringing up kmxlnd
modprobe kmxlnd
FATAL: Error inserting kmxlnd (/lib/modules/2.6.18-274.3.1.el5_lustre.g9500ebf/updates/kernel/net/lustre/kmxlnd.ko): Unknown symbol in module, or unknown parameter (see dmesg)

And the same kind of messages are also logged on dmesg:
...<snip>
kmxlnd: Unknown symbol mx_get_endpoint_addr
kmxlnd: Unknown symbol mx_open_endpoint
kmxlnd: Unknown symbol mx_finalize
kmxlnd: Unknown symbol mx_iconnect
kmxlnd: Unknown symbol mx_strerror
kmxlnd: Unknown symbol mx_set_endpoint_addr_context
kmxlnd: Unknown symbol mx_kirecv
...</snip>

So, Does anyone at WC could help us to figure out what's wrong here and how can we make this configuration work?
Some sanity checklist or install guide maybe!?

Thank you.



 Comments   
Comment by Carlos Thomaz [ 13/Mar/12 ]

Just a remark about MX driver building process:
1) we followed lustre manuals and created a link between common and include
cd <path>/mx
ln -s common include

2) we tried different MX builds:
2.1) ./configure --with-kernel-lib
2.2) ./configure --with-kernel-lib --enable-10g --enable-ether-mode

3) when building lustre with MX support we also tried to point out to /opt/mx but that fails.
./configure --enable-quota --with-server --disable-lru-resize --enable-ext4 --disable-health-write --with-mx=/opt/mx
FAILED!

Comment by Peter Jones [ 13/Mar/12 ]

Cliff will help with this one

Comment by Cliff White (Inactive) [ 13/Mar/12 ]

Okay, it's been a bit since we've had customers using Myrinet, I am researching which versions we support.
Can you re-do the build and capture all the build output and attach to this bug?

Comment by Carlos Thomaz [ 13/Mar/12 ]

Cliff, Please find attached the compressed script file with the full output from the entire process described in here.
Thank you
Carlos

Comment by Johann Lombardi (Inactive) [ 13/Mar/12 ]

Could you please check if the symbols which are reported as missing actually exist in /proc/kallsyms once the mx module is loaded?

Comment by Carlos Thomaz [ 13/Mar/12 ]

Hi Johan. We are rebuilding the server at this moment. We found something... The DDN RHEL ISO bundled up the lustre kernel 2.6.18-238.12.1.el5_lustre.gce5e033. The following RPMs are installed by default:
kernel-2.6.18-238.12.1.el5_lustre.gce5e033
ddn-lustre-tools-0.5-2012.01.18.130500
kernel-headers-2.6.18-238.12.1.el5_lustre.gce5e033
lustre-modules-1.8.6-wc1_2.6.18_238.12.1.el5_lustre.gce5e033
lustre-1.8.6-wc1_2.6.18_238.12.1.el5_lustre.gce5e033
kernel-devel-2.6.18-238.12.1.el5_lustre.gce5e033
lustre-ldiskfs-3.1.50-wc1_2.6.18_238.12.1.el5_lustre.gce5e033
ddn-ofed-1.5.3.1-2.6.18_238.12.1.el5_lustre.gce5e033

When installing the new RPMs from whamcloud (in order to rebuild lustre with MX support) If, instead of deleting the old rpms (rpm -e), we just upgrade them (rpm -Uhv), we manage to build lustre successfully.

So, these are the differences on the packages: OSS5 is a default installation, OSS7 is the installation with MX modules.

[root@oss07 ~]# rpm -qa |grep lustre | sort
ddn-lustre-tools-0.5-2012.01.18.130500
ddn-ofed-1.5.3.1-2.6.18_238.12.1.el5_lustre.gce5e033
kernel-2.6.18-274.3.1.el5_lustre.g9500ebf
kernel-devel-2.6.18-274.3.1.el5_lustre.g9500ebf
kernel-headers-2.6.18-274.3.1.el5_lustre.g9500ebf
lustre-1.8.7-wc1_2.6.18_274.3.1.el5_lustre.g9500ebf
lustre-ldiskfs-3.1.51-wc1_2.6.18_274.3.1.el5_lustre.g9500ebf
lustre-modules-1.8.7-wc1_2.6.18_274.3.1.el5_lustre.g9500ebf
lustre-source-1.8.7-wc1_2.6.18_274.3.1.el5_lustre.g9500ebf
[root@oss07 ~]# ssh oss05 rpm -qa |grep lustre| sort
ddn-lustre-tools-0.5-2012.01.18.130500
ddn-ofed-1.5.3.1-2.6.18_238.12.1.el5_lustre.gce5e033
kernel-2.6.18-238.12.1.el5_lustre.gce5e033
kernel-devel-2.6.18-238.12.1.el5_lustre.gce5e033
kernel-headers-2.6.18-238.12.1.el5_lustre.gce5e033
lustre-1.8.6-wc1_2.6.18_238.12.1.el5_lustre.gce5e033
lustre-ldiskfs-3.1.50-wc1_2.6.18_238.12.1.el5_lustre.gce5e033
lustre-modules-1.8.6-wc1_2.6.18_238.12.1.el5_lustre.gce5e033

Our guess is, if you try to install a server from the scratch using the lustre packages g9500ebc it will not work.

The MX driver we are using is 1.2.12

We'll continue working in this and we'll update this ticket as soon as we get some news.

Thanks.

Comment by Liang Zhen (Inactive) [ 14/Mar/12 ]

hi, did you depmod after installing the modules?

Liang

Comment by Carlos Thomaz [ 14/Mar/12 ]

Hi Liang. Yes, we did and looks allright.

We managed to get things working and finished the building process. As explained before, seems like there some conflicts with files when upgrading some RPMs or when adding the lustre source RPM. We don't quite figure out yet what exactly is causing the problem since we are running late in this deployment. However, our plan is to continue to investigate it and understand why it happens.

Another question for now is about the MX compatibility mode. The servers (OSS and MDS) has MX cards and may run natively, but the clients are Gbit ethernet. As far as I understand we should run Lustre on tcp mode since that's the only thing the clients can communicate. However, when building the MX driver with ether-support (--enable-ether-mode --enable-10g-mode) the network seems to stop responding. We know this is not a lustre issue, but we are wondering if anyone has some suggestions on how to build the driver.

It's also interesting that even building the driver with no ethernet mode or 10g mode support we still able to bringup the interface, assign an IP address. This is an output from mx_info when built with ethernet mode and 10g mode support

[root@oss05 to-install]# /opt/mx/bin/mx_info
MX Version: 1.2.12
MX Build: root@oss08:/root/mx_ether/mx-1.2.12 Wed Mar 14 11:39:25 CDT 2012
2 Myrinet boards installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 364.4 MHz LANai, PCI-E x8, 2 MB SRAM
Status: Running, P0: Wrong Network
Network: Ethernet 10G

MAC Address: 00:60:dd:45:1a:20
Product code: 10G-PCIE2-8B2L-2QP
Part number: 09-04247
Serial number: 427870
Mapper: 00:60:dd:45:1a:21, version = 0x00000000, configured
Mapped hosts: 1

ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- —
0) 00:60:dd:45:1a:20 oss05:0 1,0
1) 00:60:dd:45:1a:21 oss05:1 D 0,0
===================================================================
Instance #1: 364.4 MHz LANai, PCI-E x8, 2 MB SRAM
Status: Running, P0: Wrong Network
Network: Ethernet 10G

MAC Address: 00:60:dd:45:1a:21
Product code: 10G-PCIE2-8B2L-2QP
Part number: 09-04247
Serial number: 427870
Mapper: 00:60:dd:45:1a:21, version = 0x00000000, configured
Mapped hosts: 1

ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- —
0) 00:60:dd:45:1a:20 oss05:0 D 0,0
1) 00:60:dd:45:1a:21 oss05:1 1,0
[root@oss05 to-install]#

We can't see the other host indexes, mac and host names and also there's a clear message on status line saying "Wrong message" .

this is how my modprobe.conf line looks like:
options lnet networks=mx0(myri0)

PS: We still capable to lctl ping.

Thanks
Carlos

Comment by Liang Zhen (Inactive) [ 15/Mar/12 ]

Hi Carlos, if it said "Wrong network", does it mean that the network adapter is connected to wrong port on a switch? I think we see this kind of error only if MX-10G was configured with "Ethernet mode" but the network adapter is connected to a 10G Myrinet switch port instead of a 10G Ethernet switch port on a switch.

This is a piece of information I just found:

MX-10G supports both MXoM (MX over Myrinet) and MXoE (MX over Ethernet). Refer to What is MX/Ethernet? for further explanation.

MXoM assumes that the Myri-10G Network Adapters are being used in "Myrinet mode" and are connected to 10G Myrinet switch ports on a Myri-10G switch.

MXoE assumes that the Myri-10G Network Adapters are being used in "Ethernet mode" and are connected to either a 10GbE switch or to the 10GbE switch ports on a Myri-10G switch."

The output of the MX tool mx_info reports the status of the firmware on the NIC as well as the status of the link connectivity. If you see a message similar to:
 
        Status:         Running, P0: Wrong Network
        Network:        Ethernet 10G 

then the Network: output indicates that MX-10G was configured with --enable-ether-mode for MXoE support. However, since Status reports Wrong Network, this indicates that the network adapters are erroneously connected to a 10G Myrinet switch port instead of a 10G Ethernet switch port.
To correct this error, you would then need to either reconfigure MX or connect the cables to a different switch.
Comment by Peter Jones [ 17/Apr/12 ]

Hi there

Is there any further action needed on this ticket?

Please advise

Peter

Comment by Cliff White (Inactive) [ 29/Apr/12 ]

I am going to close this issue, please reopen if you have more information

Generated at Sat Feb 10 01:14:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.