Details
-
Bug
-
Resolution: Won't Fix
-
Minor
-
None
-
Lustre 2.0.0
-
None
-
RHEL 6.0 GA, ofed1.5.2, Lustre 2.0.0.1, Mellanox QDR Ib cards
-
3
-
8540
Description
Clients are not able to connect to server interfaces when there are two IB cards (and two lnets) configured on servers. We have a workaround consisting on "lctl ping" from servers to both lnets on every client. After that clients are able to connect to servers.
Once clients are mounted we see the problem if we run the "df -h /lustre" command on clients (obvious cause running this command client needs to contact OSSs).
At first we try to ping every interface on server:
client> lctl ping 10.50.0.7@o2ib0 => No response
client> lctl ping 10.50.1.7@o2ib1 => No response
client>dmesg
00000400:00000100:3.0F:1297255885.268873:0:2998:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for
12345-10.50.0.7@o2ib1: peer not alive
00000400:00020000:3.0:1297255885.279758:0:2998:0:(lib-move.c:2628:LNetGet()) error sending GET to 12345-10.50.0.7@o2ib1:
-113
00000800:00000100:0.0F:1297255885.284181:0:2435:0:(o2iblnd_cb.c:462:kiblnd_rx_complete()) Rx from 10.50.0.7@o2ib1 failed:
5
Then we ping client's interface (client has only one if) on both lnets:
server> lctl ping 10.50.0.50@o2ib0 => OK
server> lctl ping 10.50.0.50@o2ib1 => OK
And problem is solved, "df -h /lustre" will run correctly and all "lctl ping" from client to server's interface will work fine.
IPoIB ping command is working fine, we don't have DDR infiniband drivers running on our machines and we already tried a network configuration using ip2nets.
Here you have our ip2nets config (note that all machines in the [7-10] range are servers, with two IB cards, one lnet on every one, and all the rest of the machines are clients with only one IB interface and two lnets on every one):
[root@berlin5 ~]# cat /sys/module/lnet/parameters/ip2nets
o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 10.50.. ; o2ib1(ib0) 10.50..
So, it seems like clients are not able to choose between one of the interfaces on servers but once server has 'pinged' clients, these ones are now able to choose the right interface.
Do you think this could be an OFED bug? Maybe an lnet bug?
Attachments
Issue Links
- is related to
-
LU-12132 Lustre causes ARP request for Lnet Routers to be sent on wrong PKey IPoIB interface.
- Resolved