[LU-15862] How to set prefer server NID Created: 17/May/22  Updated: 10/Jun/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8
Fix Version/s: None

Type: Question/Request Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

We have a critical Mellanox firmware issue causing HCA soft lockups. As a possible work around we are considering using  tcp NIDs, until we have a fix from Mellanox.

We want to configure servers with tcp and o2ib nids, doing so will avoid future down time switching NIDS on the server.

Is there a way to select/prefer NIDs on the client per filesystem?

Here is an example config:

client_a:  [o2ib(ib0) tcp(ib0)] (mount fs1 using tcp0) (mount fs2 using o2ib) (mount fs3 mount o2ib)

client_b: [o2ib414(ib0)]  (mount fs1 using tcp0) (mount fs2 using o2ib) (mount fs3 mount o2ib)

lrouter:  o2ib41(ib1) :: o2ib(ib0) tcp(ib0)

 

fs1-srv1:   o2ib(ib0) tcp(ib0)

fs2-srv1:  o2ib(ib0) tcp(ib0)

fs3-srv1: o2ib(ib0)



 Comments   
Comment by Andreas Dilger [ 17/May/22 ]

Serguei, can you please comment and/or sub-assign.

Comment by Colin Faber [ 17/May/22 ]

Hi mhanafi,

We're actively looking into this, would you mind if I ask, which mellanox firmware critical issue are you dealing with?

-cf

 

Comment by Amir Shehata (Inactive) [ 10/Jun/22 ]

Hi Mahmoud,

We implemented the UDSP feature. This allows adding rules to do what you're looking for. This feature was added in Lustre 2.15

Here is how it would work in your case

lnetctl udsp add o2ib0 --priority 0 
lnetctl udsp add tcp0 --priority 1

0 is the highest priority. This will prefer o2ib0 always unless o2ib0 becomes unreachable in which case you'd start using tcp0.

If you add both of these rules on all nodes, then o2ib0 will always be preferred. Through the health feature, LNet will detect if this network is not reachable or peers on this network are not reachable and start using tcp0.

This feature is built on the Multi-Rail discovery features. So discovery should be on to allow the nodes to associate both interfaces to the same peer.

Generated at Sat Feb 10 03:21:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.