[LU-1211] Issues building and installing Lustre 1.8.7-wc1 with MYRINET support Created: 13/Mar/12 Updated: 29/Apr/12 Resolved: 29/Apr/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Carlos Thomaz | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | paj, server | ||
| Environment: |
Lustre 1.8.7-WC1 |
||
| Attachments: |
|
| Severity: | 1 |
| Epic: | server |
| Rank (Obsolete): | 6099 |
| Description |
|
We are having problems installing lustre with Myrinet support on a customer site. This is the process we are following: 1) Files we are using: 2) Install the kernel and lustre source, and reboot 3) build the MX driver 4) Build Lustre make rpms cd /usr/src/redhat/RPMS/x86_64/ All lustre packages get installed but these warning messages pops up: 5) Bring up the MX driver: 6) Bringing up kmxlnd And the same kind of messages are also logged on dmesg: So, Does anyone at WC could help us to figure out what's wrong here and how can we make this configuration work? Thank you. |
| Comments |
| Comment by Carlos Thomaz [ 13/Mar/12 ] |
|
Just a remark about MX driver building process: 2) we tried different MX builds: 3) when building lustre with MX support we also tried to point out to /opt/mx but that fails. |
| Comment by Peter Jones [ 13/Mar/12 ] |
|
Cliff will help with this one |
| Comment by Cliff White (Inactive) [ 13/Mar/12 ] |
|
Okay, it's been a bit since we've had customers using Myrinet, I am researching which versions we support. |
| Comment by Carlos Thomaz [ 13/Mar/12 ] |
|
Cliff, Please find attached the compressed script file with the full output from the entire process described in here. |
| Comment by Johann Lombardi (Inactive) [ 13/Mar/12 ] |
|
Could you please check if the symbols which are reported as missing actually exist in /proc/kallsyms once the mx module is loaded? |
| Comment by Carlos Thomaz [ 13/Mar/12 ] |
|
Hi Johan. We are rebuilding the server at this moment. We found something... The DDN RHEL ISO bundled up the lustre kernel 2.6.18-238.12.1.el5_lustre.gce5e033. The following RPMs are installed by default: When installing the new RPMs from whamcloud (in order to rebuild lustre with MX support) If, instead of deleting the old rpms (rpm -e), we just upgrade them (rpm -Uhv), we manage to build lustre successfully. So, these are the differences on the packages: OSS5 is a default installation, OSS7 is the installation with MX modules. [root@oss07 ~]# rpm -qa |grep lustre | sort Our guess is, if you try to install a server from the scratch using the lustre packages g9500ebc it will not work. The MX driver we are using is 1.2.12 We'll continue working in this and we'll update this ticket as soon as we get some news. Thanks. |
| Comment by Liang Zhen (Inactive) [ 14/Mar/12 ] |
|
hi, did you depmod after installing the modules? Liang |
| Comment by Carlos Thomaz [ 14/Mar/12 ] |
|
Hi Liang. Yes, we did and looks allright. We managed to get things working and finished the building process. As explained before, seems like there some conflicts with files when upgrading some RPMs or when adding the lustre source RPM. We don't quite figure out yet what exactly is causing the problem since we are running late in this deployment. However, our plan is to continue to investigate it and understand why it happens. Another question for now is about the MX compatibility mode. The servers (OSS and MDS) has MX cards and may run natively, but the clients are Gbit ethernet. As far as I understand we should run Lustre on tcp mode since that's the only thing the clients can communicate. However, when building the MX driver with ether-support (--enable-ether-mode --enable-10g-mode) the network seems to stop responding. We know this is not a lustre issue, but we are wondering if anyone has some suggestions on how to build the driver. It's also interesting that even building the driver with no ethernet mode or 10g mode support we still able to bringup the interface, assign an IP address. This is an output from mx_info when built with ethernet mode and 10g mode support [root@oss05 to-install]# /opt/mx/bin/mx_info MAC Address: 00:60:dd:45:1a:20 ROUTE COUNT MAC Address: 00:60:dd:45:1a:21 ROUTE COUNT We can't see the other host indexes, mac and host names and also there's a clear message on status line saying "Wrong message" . this is how my modprobe.conf line looks like: PS: We still capable to lctl ping. Thanks |
| Comment by Liang Zhen (Inactive) [ 15/Mar/12 ] |
|
Hi Carlos, if it said "Wrong network", does it mean that the network adapter is connected to wrong port on a switch? I think we see this kind of error only if MX-10G was configured with "Ethernet mode" but the network adapter is connected to a 10G Myrinet switch port instead of a 10G Ethernet switch port on a switch. This is a piece of information I just found: MX-10G supports both MXoM (MX over Myrinet) and MXoE (MX over Ethernet). Refer to What is MX/Ethernet? for further explanation.
MXoM assumes that the Myri-10G Network Adapters are being used in "Myrinet mode" and are connected to 10G Myrinet switch ports on a Myri-10G switch.
MXoE assumes that the Myri-10G Network Adapters are being used in "Ethernet mode" and are connected to either a 10GbE switch or to the 10GbE switch ports on a Myri-10G switch."
The output of the MX tool mx_info reports the status of the firmware on the NIC as well as the status of the link connectivity. If you see a message similar to:
Status: Running, P0: Wrong Network
Network: Ethernet 10G
then the Network: output indicates that MX-10G was configured with --enable-ether-mode for MXoE support. However, since Status reports Wrong Network, this indicates that the network adapters are erroneously connected to a 10G Myrinet switch port instead of a 10G Ethernet switch port.
To correct this error, you would then need to either reconfigure MX or connect the cables to a different switch.
|
| Comment by Peter Jones [ 17/Apr/12 ] |
|
Hi there Is there any further action needed on this ticket? Please advise Peter |
| Comment by Cliff White (Inactive) [ 29/Apr/12 ] |
|
I am going to close this issue, please reopen if you have more information |