Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4473

Disable LNET routes without disrupting ongoing filesystem operations

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • 12253

    Description

      It is desirable to be able to gracefully take an LNET router out of service without disrupting ongoing filesystem operations. Since not all RPCs are re-sent we need a way to prevent routes from being used for new traffic while existing buffered messages continue to drain. I have a patch implementing one approach to achieving this behavior.

      The patch creates a pair of lctl commands, down_interfaces and up_interfaces. The down_interfaces command, when executed on an LNET router, sets the ni->ni_status->ns_status of each lnet_ni_t in the global LND instance list (except for LOLND) to a new status introduced by this patch, LNET_NI_STATUS_ADMINDOWN. An admin would use this command to remove an LNET router node from service in the following way:

      • Admin executes 'lctl down_interfaces' on the router node being removed.
      • After a small waiting period ( on the order of router_ping_timeout + max(dead_router_check_interval, live_router_check_interval) ) all clients and servers should have ping'd this router and received a response.
      • The response payload should show that all of this router's NIs are down (lnet_parse_rc_info() is modified so LNET_NI_STATUS_ADMINDOWN is treated the same as LNET_NI_STATUS_DOWN).
      • Now, when client or server attempts to send a new message to a remote network, and this router's routes are considered for the next hop, the routes are discarded since the servers and clients know that the router's NIs for the remote networks are down (see lnet_send()->lnet_find_route_locked()).
      • At this point the router should not be receiving any new incoming traffic other than router_checker pings.
      • The administrator can watch for any queued messages on the router node to drain via appropriate /proc interface.
      • Once the router no longer has any messages to send LNET can be stopped and unloaded.

      The up_interfaces command simply sets the ni->ni_status->ns_status of each lnet_ni_t in the global LND instance list (except for LOLND) to LNET_NI_STATUS_UP.

      Attachments

        Activity

          People

            wc-triage WC Triage
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: