Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10360

use Imperative Recovery logs for client->MDT/OST connections

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      The Imperative Recovery (IR) feature landed in LU-19 created a dynamic list of active server NIDs on the MGS for purposes of speeding up client recovery when a target failed over to another server node. A server failure triggered a notification from the MGS to the client to update its target NIDs to reconnect to the recovered server more quickly.

      It would be possible to extend this mechanism to also use the MGS IR log to do initial client mount, so that the MGS did not need to store the OST/MDT NIDs statically in the config log, but rather get the current NIDs directly from the dynamic MGS log. This would facilitate Lustre running in configurations where the server NIDs are not static (e.g. cloud, DHCP, etc). The initial connection to the MGS node(s) can already be done using the MGS hostname, since mount.lustre will do DNS name resolution.

      Some care would be needed when OSTs are being registered with the MGS, especially in testing environments where OSTs are reformatted regularly and often use the same fsname, since this may allow OSTs to register with the MGS that do not actually belong to the same filesystem.

      Attachments

        Issue Links

          Activity

            [LU-10360] use Imperative Recovery logs for client->MDT/OST connections

            "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50000
            Subject: LU-10360 ldlm: remove client_import_find_conn()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 12cbeaf1fb7bc83d7a842b71d6e8a33601e085ce

            gerrit Gerrit Updater added a comment - "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50000 Subject: LU-10360 ldlm: remove client_import_find_conn() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 12cbeaf1fb7bc83d7a842b71d6e8a33601e085ce

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/39911/
            Subject: LU-10360 tests: test dynamic NIDs feature
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2553f2fc8630061a8b6dbc5504d3f5277cb1cecf

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/39911/ Subject: LU-10360 tests: test dynamic NIDs feature Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2553f2fc8630061a8b6dbc5504d3f5277cb1cecf

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45905
            Subject: LU-10360 mgc: Use IR for client->MDS/OST connections
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: b1c09656513f3198adf849182617e6eafef76954

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45905 Subject: LU-10360 mgc: Use IR for client->MDS/OST connections Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: b1c09656513f3198adf849182617e6eafef76954

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40736
            Subject: LU-10360 mgs: Mount to dynamically added networks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4c68340088f2f56d16f6b1392de5ad7f7d139ff4

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40736 Subject: LU-10360 mgs: Mount to dynamically added networks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4c68340088f2f56d16f6b1392de5ad7f7d139ff4

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39613/
            Subject: LU-10360 mgc: Use IR for client->MDS/OST connections
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 37be05eca3f4aee15c946656a77f56967c15253d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39613/ Subject: LU-10360 mgc: Use IR for client->MDS/OST connections Project: fs/lustre-release Branch: master Current Patch Set: Commit: 37be05eca3f4aee15c946656a77f56967c15253d

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39911
            Subject: LU-10360 tests: test dynamic NIDs feature
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d0bfbcb3bb643ce6dc33590bd937cb3c935ac88a

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39911 Subject: LU-10360 tests: test dynamic NIDs feature Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d0bfbcb3bb643ce6dc33590bd937cb3c935ac88a

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39709
            Subject: LU-10360 mgs: Dynamic network updates
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 09cc831918a3d661055ccfbf8f12ee8f13d91ac2

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39709 Subject: LU-10360 mgs: Dynamic network updates Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 09cc831918a3d661055ccfbf8f12ee8f13d91ac2

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39613
            Subject: LU-10360 mgc: Use IR for client->MDS/OST connections
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 07b3c5e527ba6fe86d164d921acec8caafa5d757

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39613 Subject: LU-10360 mgc: Use IR for client->MDS/OST connections Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 07b3c5e527ba6fe86d164d921acec8caafa5d757

            What happens today if a server restarts on an unregistered NID? Does IR still work, and clients add the new NID to the failover list? If not true, it seems that could be a useful first step.

            The current case is that the client will drop any NID that it receives that is not in the list of configured NIDs in the import. This (AFAIK) is in mgc_apply_recover_logs() where it checks for any existing NID on the import matching the NIDs in the IR entry:

                            /* iterate all nids to find one */
                            /* find uuid by nid */
                            rc = -ENOENT;
                            for (i = 0; i < entry->mne_nid_count; i++) {
                                    rc = client_import_find_conn(obd->u.cli.cl_import,
                                                                 entry->u.nids[i],
                                                                 (struct obd_uuid *)uuid);
                                    if (rc == 0)
                                            break;
                            }
            

            It was done this way to prevent misconfigured/rogue OSTs from connecting to the MGS and advertising "lustre-OSTxxxx" as a target, but it is for a different filesystem named lustre. This has happened in the test environment because of many concurrent lustre filesystems and re-use of IP addresses for different test runs, and this causes very hard to diagnose problems. At a minimum, there should be a tunable parameter that enables/disables the ability to connect an import to "unknown" NIDs. A more complete solution to restrict connections to a specific filesystem UUID stored on MDT0000 would be very desirable, but would be for a separate ticket.

            In fact, if we can count on IR, then we can actually replace the entire NID list with whatever the IR NID is - a failover list of 1, which is the latest place the server started. If we can't count on IR (e.g. MGS is unavailable / unreachable), then a client would continue to use last known location, so maybe IR should include a list of failover NIDs provided by the newly restarting server.

            The MGS is required for initial mount, and is desirable for normal operation, but not strictly required since the client stores its own failover list. The mgs_nidtbl_entry allows space for multiple NIDs, but these are intended to be the current NID(s) of the target (i.e. if there are multiple interfaces for different LNets), but not the failover NIDs. In a dynamic environment, it isn't necessarily even possible to know what the failover NID is going to be in advance, so it isn't clear whether it is worthwhile to add the ability to specify failover NIDs via the IR NID table.

            It would make sense for the client to still parse the MGS config log (if NIDs are present) for any failover NIDs to handle the case of MGS failure. It could also store any previously sent dynamic target NIDs for that target in its import list, for the case where the MGS is not working, it can try them as it does today, but it would be more desirable to have a real UUID for the filesystem beyond just "$fsname-OSTxxxx" to avoid errors during testing if that IP has been reassigned to another filesystem of the same name.

            adilger Andreas Dilger added a comment - What happens today if a server restarts on an unregistered NID? Does IR still work, and clients add the new NID to the failover list? If not true, it seems that could be a useful first step. The current case is that the client will drop any NID that it receives that is not in the list of configured NIDs in the import. This (AFAIK) is in mgc_apply_recover_logs() where it checks for any existing NID on the import matching the NIDs in the IR entry: /* iterate all nids to find one */ /* find uuid by nid */ rc = -ENOENT; for (i = 0; i < entry->mne_nid_count; i++) { rc = client_import_find_conn(obd->u.cli.cl_import, entry->u.nids[i], (struct obd_uuid *)uuid); if (rc == 0) break ; } It was done this way to prevent misconfigured/rogue OSTs from connecting to the MGS and advertising " lustre-OSTxxxx " as a target, but it is for a different filesystem named lustre . This has happened in the test environment because of many concurrent lustre filesystems and re-use of IP addresses for different test runs, and this causes very hard to diagnose problems. At a minimum, there should be a tunable parameter that enables/disables the ability to connect an import to "unknown" NIDs. A more complete solution to restrict connections to a specific filesystem UUID stored on MDT0000 would be very desirable, but would be for a separate ticket. In fact, if we can count on IR, then we can actually replace the entire NID list with whatever the IR NID is - a failover list of 1, which is the latest place the server started. If we can't count on IR (e.g. MGS is unavailable / unreachable), then a client would continue to use last known location, so maybe IR should include a list of failover NIDs provided by the newly restarting server. The MGS is required for initial mount, and is desirable for normal operation, but not strictly required since the client stores its own failover list. The mgs_nidtbl_entry allows space for multiple NIDs, but these are intended to be the current NID(s) of the target (i.e. if there are multiple interfaces for different LNets), but not the failover NIDs. In a dynamic environment, it isn't necessarily even possible to know what the failover NID is going to be in advance, so it isn't clear whether it is worthwhile to add the ability to specify failover NIDs via the IR NID table. It would make sense for the client to still parse the MGS config log (if NIDs are present) for any failover NIDs to handle the case of MGS failure. It could also store any previously sent dynamic target NIDs for that target in its import list, for the case where the MGS is not working, it can try them as it does today, but it would be more desirable to have a real UUID for the filesystem beyond just "$fsname-OSTxxxx" to avoid errors during testing if that IP has been reassigned to another filesystem of the same name.
            adilger Andreas Dilger added a comment - - edited

            The IPv6 page discusses the use of IR for peer NID configuration. The mgs_nidtbl_entry already contains a list of all NIDs for a client:

            struct mgs_nidtbl_entry {
                    __u64           mne_version;    /* table version of this entry */
                    __u32           mne_instance;   /* target instance # */
                    __u32           mne_index;      /* target index */
                    __u32           mne_length;     /* length of this entry - by bytes */
                    __u8            mne_type;       /* target type LDD_F_SV_TYPE_OST/MDT */
                    __u8            mne_nid_type;   /* type of nid(mbz). for ipv6. */
                    __u8            mne_nid_size;   /* size of each NID, by bytes */
                    __u8            mne_nid_count;  /* # of NIDs in buffer */
                    union {
                            lnet_nid_t nids[0];     /* variable size buffer for NIDs. */
                    } u;
            };
            

            Since the MGS is already needed at initial client mount time, not being able to access the MGS IR service at mount would not be a reduction in functionality compared to needing the MGS to fetch the config logs.

            Using MGS IR to announce server NIDs to clients would also remove the complexity of changing NIDs in the configuration logs, which currently requires a full filesystem shutdown (stop all clients and unmount servers) and rewriting the config logs.

            One improvement that would be needed is for the servers to re-announce their NIDs if they are changed while the OST is mounted (e.g. expired DHCP lease, as opposed to the OST starting up on a new OSS). That wouldn't be much different than handling a target failover to another server, but would be noticeable on the clients.

            adilger Andreas Dilger added a comment - - edited The IPv6 page discusses the use of IR for peer NID configuration. The mgs_nidtbl_entry already contains a list of all NIDs for a client: struct mgs_nidtbl_entry { __u64 mne_version; /* table version of this entry */ __u32 mne_instance; /* target instance # */ __u32 mne_index; /* target index */ __u32 mne_length; /* length of this entry - by bytes */ __u8 mne_type; /* target type LDD_F_SV_TYPE_OST/MDT */ __u8 mne_nid_type; /* type of nid(mbz). for ipv6. */ __u8 mne_nid_size; /* size of each NID, by bytes */ __u8 mne_nid_count; /* # of NIDs in buffer */ union { lnet_nid_t nids[0]; /* variable size buffer for NIDs. */ } u; }; Since the MGS is already needed at initial client mount time, not being able to access the MGS IR service at mount would not be a reduction in functionality compared to needing the MGS to fetch the config logs. Using MGS IR to announce server NIDs to clients would also remove the complexity of changing NIDs in the configuration logs, which currently requires a full filesystem shutdown (stop all clients and unmount servers) and rewriting the config logs. One improvement that would be needed is for the servers to re-announce their NIDs if they are changed while the OST is mounted (e.g. expired DHCP lease, as opposed to the OST starting up on a new OSS). That wouldn't be much different than handling a target failover to another server, but would be noticeable on the clients.

            What happens today if a server restarts on an unregistered NID? Does IR still work, and clients add the new NID to the failover list? If not true, it seems that could be a useful first step. In fact, if we can count on IR, then we can actually replace the entire NID list with whatever the IR NID is - a failover list of 1, which is the latest place the server started.

            If we can't count on IR (e.g. MGS is unavailable / unreachable), then a client would continue to use last known location, so maybe IR should include a list of failover NIDs provided by the newly restarting server. NIDs (and failovers) just become dynamic (last reported by server startup) rather than statically defined by the first registration. IIRC back in the day we decided not to add new NIDs to the config log (statically), but I think the dynamic path with IR makes much more sense.  

            nrutman Nathan Rutman added a comment - What happens today if a server restarts on an unregistered NID? Does IR still work, and clients add the new NID to the failover list? If not true, it seems that could be a useful first step. In fact, if we can count on IR, then we can actually replace the entire NID list with whatever the IR NID is - a failover list of 1, which is the latest place the server started. If we can't count on IR (e.g. MGS is unavailable / unreachable), then a client would continue to use last known location, so maybe IR should include a list of failover NIDs provided by the newly restarting server. NIDs (and failovers) just become dynamic (last reported by server startup) rather than statically defined by the first registration. IIRC back in the day we decided not to add new NIDs to the config log (statically), but I think the dynamic path with IR makes much more sense.  

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated: