Details
-
Improvement
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
The current HA script (healthLNET) monitors LNET health based on individual network devices. However, MultiRail uses multiple network devices for a single network, so the failure of one device does not necessarily mean the entire LNet network is lost.
So we can create a new script to monitor MultiRail LNet network.
Here are the key requirements I’ve identified for this script:
- The script should as input the LNet network (e.g: o2ib1) to monitor. The network configuration and devices can be obtained via "lnetctl net show --net <net>"
- The script should generate one attribute per network devices to store the monitor score for that interface. So the admin can monitor and trigger action for a specific network device.
- The script should generate one HA attribute to store the monitor score value of the LNet network. This score should represent the average score across all LNet network interfaces.
- The script should monitor the LNet interfaces based on interface state and LNet health.
- Like the old script, the score should be computed based on the number of hosts that can be reached. But this time, this should use "lnetctl ping --source" to ping a peer from a specific LNet interface.