[LU-11840] Multi rail dynamic discovery prevent mounting filesystem when some NIC is unreachable - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

In recent Lustre releases, some specific filesystem could not be mounted due to a communication error between clients and servers, depending on the LNET configuration.

If we have a filesystem running on a host with 2 interfaces, let say tcp0 and tcp1 and the devices are setup to reply on both interfaces (formatted with --servicenode IP1@tcp0,IP2@tcp1).

If a client is connected only to tcp0 and try to mount this filesystem, it fails with an I/O error because it is trying to connect using tcp1 interface.

Mount failed:

# mount -t lustre x.y.z.a@tcp:/lustre /mnt/lustre
mount.lustre: mount x.y.z.a@tcp:/lustre at /mnt/client failed: Input/output error
Is the MGS running?

dmesg shows that communication fails using the wrong IP

[422880.743179] LNetError: 19787:0:(lib-move.c:1714:lnet_select_pathway()) no route to a.b.c.d@tcp1

# lnetctl peer show
peer:
 - primary nid: a.b.c.d@tcp1
 Multi-Rail: False
 peer ni:
 - nid: x.y.z.a@tcp
 state: NA
 - nid: 0@<0:0>
 state:

Ping is OK though:

# lctl ping x.y.z.a@tcp
12345-0@lo
12345-a.b.c.d@tcp1
12345-x.y.z.a@tcp

This was tested with 2.10.5 and 2.12 as server versions and 2.10, 2.11 and 2.12 as client.

Only 2.10 client is able to mount the filesystem properly with this configuration

I git-bisected the regression down to 0f1aaad LU-9480 lnet: implement Peer Discovery

Looking at debug log, the client:

setups the peer with the proper NI
the pings the peer
updates the local peer info with the wrong NI coming from the ping reply

data in the reply seems to announce the tcp1 IP as the primary nid.

The client will then use this NI to contact the server even if it has no direct connection to it (tcp1) and has a correct one for the same peer (tcp0).

Attachments

Issue Links

is related to

LU-13548 LNet: b2_12 discovery of non-MR peers may yield unreachable peer NIs

Open

Activity

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Aurelien Degremont (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 08/Jan/19 1:50 PM

Updated:: 25/Nov/20 3:03 PM