Details
-
New Feature
-
Resolution: Fixed
-
Minor
-
None
-
None
-
9223372036854775807
Description
LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the underlying fabrics such as MLX and OPA.
LNet Health will monitor three different types of failures:
- local interface failures as reported by the underlying fabric
- remote interface failures as reported by the remote fabric
- network timeouts.
Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to retransmit messages across different types of interfaces. For example if a peer has both MLX and OPA interfaces and a transmit error is detected on one of them then LNet can retransmit the message on the other available interface.
Attachments
Issue Links
- is blocked by
-
LUDOC-396 Add documentation for the LNet Health feature
- Resolved
- is related to
-
LU-11271 LNet Health: o2iblnd, conditionally set health status
- Resolved
-
LU-11272 LNet Health: handle routing special case
- Resolved
-
LU-11422 Make LNet Selftest post Health backward compatible
- Resolved
-
LU-11273 LNet Health: update logging
- Resolved
-
LU-7734 LNet Multi-Rail Project
- Resolved
-
LU-10756 Send Uevents for interesting Lustre changes
- Open
-
LU-13510 Allow control over LND timeouts independent of lnet_transaction_timeout and retry_count
- Resolved