Details

    • New Feature
    • Resolution: Fixed
    • Blocker
    • Lustre 2.2.0
    • Lustre 2.0.0, Lustre 1.8.6
    • None
    • 18,767
    • 10457

    Description

      Imperative recovery means the clients are notified explicitly when and where a failed target has
      restarted. Ideally the notification should occur after the target has mounted and is waiting for
      connections. A fully automated version of this will require a new mechanism for distributing node
      health information efficiently. A simple, initial version can be done as a /proc file for each
      import that can be used to initiate recovery.

      Attachments

        Issue Links

          Activity

            [LU-19] imperative recovery

            I think we can close this ticket now.

            simmonsja James A Simmons added a comment - I think we can close this ticket now.

            James - yes - that's a great idea.

            ian Ian Colle (Inactive) added a comment - James - yes - that's a great idea.

            Should ORNL post its future IR testing results here?

            simmonsja James A Simmons added a comment - Should ORNL post its future IR testing results here?

            Summary of scalability tests:

            1. we've got 125 client nodes, and then mount 500 mountpoints on each node to simulate the case of having ~60k client nodes;
            2. ping interval is set to 200 seconds;
            3. after an OST is restarted, it can be recovered in ~50 seconds(the time is counted from ost mount completed to df returned on clients).

            jay Jinshan Xiong (Inactive) added a comment - Summary of scalability tests: 1. we've got 125 client nodes, and then mount 500 mountpoints on each node to simulate the case of having ~60k client nodes; 2. ping interval is set to 200 seconds; 3. after an OST is restarted, it can be recovered in ~50 seconds(the time is counted from ost mount completed to df returned on clients).

            Found during IR testing

            adilger Andreas Dilger added a comment - Found during IR testing

            I think this over a little bit during the holiday. Now that we can't rely on MGS to notify the clients, we may have to build up a "healthy server ring" to help spread the event of restarting a target, and then use osc-ost, and mdc-mdt connection to notify clients. Please notice that we can always rely on MGS to detect faulting targets since even MGS is being restarted, the faulting target has to wait until the MGS revives. However, this means a target to target(or server to server) connection has to be introduced and I'm actually not in favor of doing this.

            Let's start it over. IMHO imperative recovery is a best-effort service, its purpose is to shorten the recovery window and it's acceptable to get back to `normal' recovery sometimes. By separating MGS and MDS and using MGS to notify clients, it can address the problem in a very simple way, and it mostly works since the # of OSTs in a cluster is much greater than the # of MGS and MDS, which further means the chance of OST failing is greater. WRT implementation, even changing the attribute of config lock or introducing new lock is actually not very important.

            How do you think, Robert?

            jay Jinshan Xiong (Inactive) added a comment - I think this over a little bit during the holiday. Now that we can't rely on MGS to notify the clients, we may have to build up a "healthy server ring" to help spread the event of restarting a target, and then use osc-ost, and mdc-mdt connection to notify clients. Please notice that we can always rely on MGS to detect faulting targets since even MGS is being restarted, the faulting target has to wait until the MGS revives. However, this means a target to target(or server to server) connection has to be introduced and I'm actually not in favor of doing this. Let's start it over. IMHO imperative recovery is a best-effort service, its purpose is to shorten the recovery window and it's acceptable to get back to `normal' recovery sometimes. By separating MGS and MDS and using MGS to notify clients, it can address the problem in a very simple way, and it mostly works since the # of OSTs in a cluster is much greater than the # of MGS and MDS, which further means the chance of OST failing is greater. WRT implementation, even changing the attribute of config lock or introducing new lock is actually not very important. How do you think, Robert?

            Yes, this scheme addresses the problem of ost failure. In case mgs, which sets up on the same server as mds, fails, imperative recovery doesn't help since mgs itself will spend a lot of time to be reconnected by all clients.

            I'm thinking it over again and will come up with a better solution.

            jay Jinshan Xiong (Inactive) added a comment - Yes, this scheme addresses the problem of ost failure. In case mgs, which sets up on the same server as mds, fails, imperative recovery doesn't help since mgs itself will spend a lot of time to be reconnected by all clients. I'm thinking it over again and will come up with a better solution.
            rread Robert Read added a comment -

            The design needs to support notifications for both OSTs and MDTs from the beginning, so we most likely can not rely the config lock or modify the MGS protocol.

            rread Robert Read added a comment - The design needs to support notifications for both OSTs and MDTs from the beginning, so we most likely can not rely the config lock or modify the MGS protocol.

            in the 1st phase, only ost restarting problem will be addressed - I assume that mgs is always alive so that config lock can be used to notify clients whenever ost targets are failed. Maybe here comes a suggestion to always separate MGS and MDS so that MDS target case can be addressed as well.

            The basic idea would be as follows:
            1. whenever a target starts, it will send a notification to MGS with server_register_target(), so that mgs knows that a new target comes. MGS has enough information to distinguish if this is a new target or a restarting target;
            2. A new GLIMPSE over config lock will be worked out so that MGS can use it to notify this event;
            3. An unique identification of target will be worked out and it will be transferred to MGS and then client, this is used to prevent dual recover from happening at client side. An random OSS ID unique to each running instance may work.

            wire protocol changes:
            1. target register a server: An OSS ID will be included in the message;
            2. GLIMPSE op of config lock.

            jay Jinshan Xiong (Inactive) added a comment - in the 1st phase, only ost restarting problem will be addressed - I assume that mgs is always alive so that config lock can be used to notify clients whenever ost targets are failed. Maybe here comes a suggestion to always separate MGS and MDS so that MDS target case can be addressed as well. The basic idea would be as follows: 1. whenever a target starts, it will send a notification to MGS with server_register_target(), so that mgs knows that a new target comes. MGS has enough information to distinguish if this is a new target or a restarting target; 2. A new GLIMPSE over config lock will be worked out so that MGS can use it to notify this event; 3. An unique identification of target will be worked out and it will be transferred to MGS and then client, this is used to prevent dual recover from happening at client side. An random OSS ID unique to each running instance may work. wire protocol changes: 1. target register a server: An OSS ID will be included in the message; 2. GLIMPSE op of config lock.
            rread Robert Read added a comment -

            That's probably the original bug's description (which I wrote), but a fully automated version of this does not require the health network - that's just to help with scaling. The main thing needed will be a mechanism for a restarted target to request another target (such as the mgt) to notify all connected clients to reconnect to the target using the current nid. The client will obviously need to support this notification
            and reconnect the target immediately. We're not planning to build the /proc file version.

            rread Robert Read added a comment - That's probably the original bug's description (which I wrote), but a fully automated version of this does not require the health network - that's just to help with scaling. The main thing needed will be a mechanism for a restarted target to request another target (such as the mgt) to notify all connected clients to reconnect to the target using the current nid. The client will obviously need to support this notification and reconnect the target immediately. We're not planning to build the /proc file version.

            People

              jay Jinshan Xiong (Inactive)
              jay Jinshan Xiong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: