Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19662

HA script to monitor Multi-Rail networks

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Medium
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The current HA script (healthLNET) monitors LNET health based on individual network devices. However, MultiRail uses multiple network devices for a single network, so the failure of one device does not necessarily mean the entire LNet network is lost.

      So we can create a new script to monitor MultiRail LNet network.
      Here are the key requirements I’ve identified for this script:

      • The script should as input the LNet network (e.g: o2ib1) to monitor. The network configuration and devices can be obtained via "lnetctl net show --net <net>"
      • The script should generate one attribute per network devices to store the monitor score for that interface. So the admin can monitor and trigger action for a specific network device.
      • The script should generate one HA attribute to store the monitor score value of the LNet network. This score should represent the average score across all LNet network interfaces.
      • The script should monitor the LNet interfaces based on interface state and LNet health.
      • Like the old script, the score should be computed based on the number of hosts that can be reached. But this time, this should use "lnetctl ping --source" to ping a peer from a specific LNet interface.

      Attachments

        Activity

          People

            eaujames Etienne Aujames
            eaujames Etienne Aujames
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: