Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.4
-
None
-
3
-
9223372036854775807
Description
If non-MR peer (2.10.8) is discovered by a 2.12 MR peer, the following problem may happen: if non-MR peer has LNets that are not defined on the MR peer, it is possible that a NID on the undefined LNet is listed as primary. Later this causes communication problems when mounting.
Here's an example of the buggy discovery:
lnetctl discover 192.168.1.123@o2ib4
discover:
- primary nid: 192.168.1.123@o2ib
Multi-Rail: False
peer ni:
- nid: 192.168.1.123@o2ib4
- nid: 192.168.1.123@o2ib
lnetctl peer show
peer:
- primary nid: 192.168.1.123@o2ib
Multi-Rail: False
peer ni:
- nid: 192.168.1.123@o2ib4
state: NA
- nid: 192.168.1.123@o2ib
state: NA
In the example above, the peer that is running the discovery has an only nid on o2ib4, and so designating a peer with a primary nid on o2ib is a problem.
Here's the lnet config on the MR peer (the peer running discovery):
lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib4
local NI(s):
- nid: 192.168.1.105@o2ib4
status: up
interfaces:
0: ib0
Here's the lnet config on the non-MR peer (the peer being discovered):
lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib
local NI(s):
- nid: 192.168.1.123@o2ib
status: up
interfaces:
0: ib0
- net type: o2ib4
local NI(s):
- nid: 192.168.1.123@o2ib4
status: up
interfaces:
0: ib0