[LU-15862] How to set prefer server NID Created: 17/May/22 Updated: 10/Jun/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.8 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We have a critical Mellanox firmware issue causing HCA soft lockups. As a possible work around we are considering using tcp NIDs, until we have a fix from Mellanox. We want to configure servers with tcp and o2ib nids, doing so will avoid future down time switching NIDS on the server. Is there a way to select/prefer NIDs on the client per filesystem? Here is an example config: client_a: [o2ib(ib0) tcp(ib0)] (mount fs1 using tcp0) (mount fs2 using o2ib) (mount fs3 mount o2ib) client_b: [o2ib414(ib0)] (mount fs1 using tcp0) (mount fs2 using o2ib) (mount fs3 mount o2ib) lrouter: o2ib41(ib1) :: o2ib(ib0) tcp(ib0)
fs1-srv1: o2ib(ib0) tcp(ib0) fs2-srv1: o2ib(ib0) tcp(ib0) fs3-srv1: o2ib(ib0) |
| Comments |
| Comment by Andreas Dilger [ 17/May/22 ] |
|
Serguei, can you please comment and/or sub-assign. |
| Comment by Colin Faber [ 17/May/22 ] |
|
Hi mhanafi, We're actively looking into this, would you mind if I ask, which mellanox firmware critical issue are you dealing with? -cf
|
| Comment by Amir Shehata (Inactive) [ 10/Jun/22 ] |
|
Hi Mahmoud, We implemented the UDSP feature. This allows adding rules to do what you're looking for. This feature was added in Lustre 2.15 Here is how it would work in your case lnetctl udsp add o2ib0 --priority 0 lnetctl udsp add tcp0 --priority 1 0 is the highest priority. This will prefer o2ib0 always unless o2ib0 becomes unreachable in which case you'd start using tcp0. If you add both of these rules on all nodes, then o2ib0 will always be preferred. Through the health feature, LNet will detect if this network is not reachable or peers on this network are not reachable and start using tcp0. This feature is built on the Multi-Rail discovery features. So discovery should be on to allow the nodes to associate both interfaces to the same peer. |