Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.8
-
None
-
3
-
9223372036854775807
Description
NodeA <-- o2ib0 --> GWn <-- o2ib1 --> NodeB
In the diagram above, LNet routers have one nid on o2ib0 and another one on o2ib1.
Currently with b2_12 code, if o2ib0 nid of a router goes down (e.g. link disconnected), then NodeB finds out about it only when it pings the router. In the meantime NodeB is still able to select the router for sending.
The router does send a "push" to NodeB when the router's NI on o2ib0 goes down. So instead of waiting, ping the router for its updated state as soon as the push is received.