[LU-12132] Lustre causes ARP request for Lnet Routers to be sent on wrong PKey IPoIB interface. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.11.0
Labels:
None
Environment:
Ubuntu 16.04 with HWE kernel (4.15.0-46), MOFED 4.4, Lustre 2.11.56-32

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

On some of our systems we see ARP requests being sent over the wrong PKey IPoIB interface when mounting a lustre filesystem.

The ARP requests are for the LNet routers.

Setup (historical reasons...):

Two PKeys on the server side onto the same physical IB fabric (2 separate cards):

- ib0: PKey 8001 -> IP net 172.27.1.21/24 (one of the MDS:es)

- ib1: PKey 8002 -> IP net 172.27.2.21/24 (same MDS)

MDS01:

options lnet routes="o2ib240 172.27.1.[201-214]@o2ib1; o2ib244 172.27.2.[201-214]@o2ib2; o2ib4 172.27.1.[101-108]@o2ib1; o2ib8 172.27.2.[101-108]@o2ib2; tcp1 172.27.1.[91-92]@o2ib1; tcp2 172.27.2.[91-92]@o2ib2"

options lnet networks="o2ib1(ib0.8001), o2ib2(ib1.8002)"

OSS01:

options lnet routes="o2ib240 172.27.1.[201-214]@o2ib1; o2ib244 172.27.2.[201-214]@o2ib2; o2ib4 172.27.1.[101-108]@o2ib1; o2ib8 172.27.2.[101-108]@o2ib2; tcp1 172.27.1.[91-92]@o2ib1; tcp2 172.27.2.[91-92]@o2ib2"

options lnet networks="o2ib1(ib1.8001), o2ib2(ib3.8002)"

LNet routers Lustre server side:

- ib1: PKey 8001 -> IP net 172.27.1.101/24

- ib1: PKey 8002 -> IP net 172.27.2.101/24

Same Lnet router client side:

- ib0: PKey 8004 -> IP net 172.27.4.11/23

- ib0: PKey 8008 -> IP net 172.27.8.11/23

options lnet networks="o2ib1(ib1.8001), o2ib2(ib1.8002),o2ib4(ib0.8004), o2ib8(ib0.8008)"

Clients:172.27.9.198/23

- ib0: PKey 8004 -> IP net 172.27.5.198/23

- ib0: PKey 8008 -> IP net 172.27.9.198/23

options lnet routes="o2ib1 172.27.4.[11-18]@o2ib4; o2ib2 172.27.8.[11-18]@o2ib8"
options lnet networks="o2ib4(ib0.8004), o2ib8(ib0.8008)"

We have 2 set of client nodes, that are (as far as we can see) identical in configuration.

When doing "modprobe lustre" everything is ok, the arp table is created with correct mappings. (ibdump of this shows everything working as it should, file available if needed)

But when mounting the file systems there are some problems.

One set works perfectly, the other generates ARP requests for 172.27.4.x with source IP 172.27.8.x (or 172.27.9.x depending on the hosts IP)

The bad ARPs are generated over the 8008 PKey interface, never over the 8004 interface.

We have a ibdump of this problem available if it helps.

The end result (since we currently have arp_ignore=1 on the ib0, ib0.8004, and ib0.8008 interfaces on the LNet routers) are that we get the following ARP table on the problematic clients:

172.27.4.11 dev ib0.8004 lladdr a0:00:02:20:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:61 STALE
172.27.4.11 dev ib0.8008 FAILED
172.27.8.11 dev ib0.8008 lladdr a0:00:03:00:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:61 STALE
172.27.4.12 dev ib0.8004 lladdr a0:00:02:20:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:11 STALE
172.27.4.12 dev ib0.8008 FAILED
172.27.8.12 dev ib0.8008 lladdr a0:00:03:00:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:11 STALE
172.27.4.13 dev ib0.8004 lladdr a0:00:02:20:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:a1 STALE
172.27.4.13 dev ib0.8008 FAILED
172.27.8.13 dev ib0.8008 lladdr a0:00:03:00:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:a1 STALE
...

This results in severe problems with the lustre file system accesses.

The bad arp seem to be generated when trying to read a file, not when actually mounting the filesystem.

Attachments

Issue Links

is related to

LU-153 Clients cannot connect to servers with 2 IB cards until "lctl ping" is done from server to clients

Resolved

Lustre causes ARP request for Lnet Routers to be sent on wrong PKey IPoIB interface.

Details

Description

Attachments

Issue Links

Activity

People

Dates