[LU-10360] use Imperative Recovery logs for client->MDT/OST connections Created: 08/Dec/17  Updated: 08/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Cloners
Related
is related to LU-10391 LNET: Support IPv6 Reopened
is related to LU-19 imperative recovery Resolved
is related to LU-5881 Allow hostnames in NID Resolved
is related to LU-14090 lctl replace_nids and starting target... Resolved
is related to LU-10359 remove NIDs from config llogs Open
is related to LU-11077 Client-specific tunable parameter con... Open
is related to LU-16086 add generic LNet network number support Open
is related to LU-16722 MGS config log restructuring and redu... Open
is related to LU-13306 allow clients to accept mgs_nidtbl_en... Resolved
is related to LU-13340 add LCFG_ADD_UUIDv6 and related commands Resolved
is related to LU-14668 LNet: do discovery in the background Resolved
is related to LU-14608 Adding second network to filesystem Open
Rank (Obsolete): 9223372036854775807

 Description   

The Imperative Recovery (IR) feature landed in LU-19 created a dynamic list of active server NIDs on the MGS for purposes of speeding up client recovery when a target failed over to another server node. A server failure triggered a notification from the MGS to the client to update its target NIDs to reconnect to the recovered server more quickly.

It would be possible to extend this mechanism to also use the MGS IR log to do initial client mount, so that the MGS did not need to store the OST/MDT NIDs statically in the config log, but rather get the current NIDs directly from the dynamic MGS log. This would facilitate Lustre running in configurations where the server NIDs are not static (e.g. cloud, DHCP, etc). The initial connection to the MGS node(s) can already be done using the MGS hostname, since mount.lustre will do DNS name resolution.

Some care would be needed when OSTs are being registered with the MGS, especially in testing environments where OSTs are reformatted regularly and often use the same fsname, since this may allow OSTs to register with the MGS that do not actually belong to the same filesystem.



 Comments   
Comment by Nathan Rutman [ 14/Nov/18 ]

What happens today if a server restarts on an unregistered NID? Does IR still work, and clients add the new NID to the failover list? If not true, it seems that could be a useful first step. In fact, if we can count on IR, then we can actually replace the entire NID list with whatever the IR NID is - a failover list of 1, which is the latest place the server started.

If we can't count on IR (e.g. MGS is unavailable / unreachable), then a client would continue to use last known location, so maybe IR should include a list of failover NIDs provided by the newly restarting server. NIDs (and failovers) just become dynamic (last reported by server startup) rather than statically defined by the first registration. IIRC back in the day we decided not to add new NIDs to the config log (statically), but I think the dynamic path with IR makes much more sense.  

Comment by Andreas Dilger [ 28/Feb/20 ]

The IPv6 page discusses the use of IR for peer NID configuration. The mgs_nidtbl_entry already contains a list of all NIDs for a client:

struct mgs_nidtbl_entry {
        __u64           mne_version;    /* table version of this entry */
        __u32           mne_instance;   /* target instance # */
        __u32           mne_index;      /* target index */
        __u32           mne_length;     /* length of this entry - by bytes */
        __u8            mne_type;       /* target type LDD_F_SV_TYPE_OST/MDT */
        __u8            mne_nid_type;   /* type of nid(mbz). for ipv6. */
        __u8            mne_nid_size;   /* size of each NID, by bytes */
        __u8            mne_nid_count;  /* # of NIDs in buffer */
        union {
                lnet_nid_t nids[0];     /* variable size buffer for NIDs. */
        } u;
};

Since the MGS is already needed at initial client mount time, not being able to access the MGS IR service at mount would not be a reduction in functionality compared to needing the MGS to fetch the config logs.

Using MGS IR to announce server NIDs to clients would also remove the complexity of changing NIDs in the configuration logs, which currently requires a full filesystem shutdown (stop all clients and unmount servers) and rewriting the config logs.

One improvement that would be needed is for the servers to re-announce their NIDs if they are changed while the OST is mounted (e.g. expired DHCP lease, as opposed to the OST starting up on a new OSS). That wouldn't be much different than handling a target failover to another server, but would be noticeable on the clients.

Comment by Andreas Dilger [ 20/Jul/20 ]

What happens today if a server restarts on an unregistered NID? Does IR still work, and clients add the new NID to the failover list? If not true, it seems that could be a useful first step.

The current case is that the client will drop any NID that it receives that is not in the list of configured NIDs in the import. This (AFAIK) is in mgc_apply_recover_logs() where it checks for any existing NID on the import matching the NIDs in the IR entry:

                /* iterate all nids to find one */
                /* find uuid by nid */
                rc = -ENOENT;
                for (i = 0; i < entry->mne_nid_count; i++) {
                        rc = client_import_find_conn(obd->u.cli.cl_import,
                                                     entry->u.nids[i],
                                                     (struct obd_uuid *)uuid);
                        if (rc == 0)
                                break;
                }

It was done this way to prevent misconfigured/rogue OSTs from connecting to the MGS and advertising "lustre-OSTxxxx" as a target, but it is for a different filesystem named lustre. This has happened in the test environment because of many concurrent lustre filesystems and re-use of IP addresses for different test runs, and this causes very hard to diagnose problems. At a minimum, there should be a tunable parameter that enables/disables the ability to connect an import to "unknown" NIDs. A more complete solution to restrict connections to a specific filesystem UUID stored on MDT0000 would be very desirable, but would be for a separate ticket.

In fact, if we can count on IR, then we can actually replace the entire NID list with whatever the IR NID is - a failover list of 1, which is the latest place the server started. If we can't count on IR (e.g. MGS is unavailable / unreachable), then a client would continue to use last known location, so maybe IR should include a list of failover NIDs provided by the newly restarting server.

The MGS is required for initial mount, and is desirable for normal operation, but not strictly required since the client stores its own failover list. The mgs_nidtbl_entry allows space for multiple NIDs, but these are intended to be the current NID(s) of the target (i.e. if there are multiple interfaces for different LNets), but not the failover NIDs. In a dynamic environment, it isn't necessarily even possible to know what the failover NID is going to be in advance, so it isn't clear whether it is worthwhile to add the ability to specify failover NIDs via the IR NID table.

It would make sense for the client to still parse the MGS config log (if NIDs are present) for any failover NIDs to handle the case of MGS failure. It could also store any previously sent dynamic target NIDs for that target in its import list, for the case where the MGS is not working, it can try them as it does today, but it would be more desirable to have a real UUID for the filesystem beyond just "$fsname-OSTxxxx" to avoid errors during testing if that IP has been reassigned to another filesystem of the same name.

Comment by Gerrit Updater [ 11/Aug/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39613
Subject: LU-10360 mgc: Use IR for client->MDS/OST connections
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07b3c5e527ba6fe86d164d921acec8caafa5d757

Comment by Gerrit Updater [ 22/Aug/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39709
Subject: LU-10360 mgs: Dynamic network updates
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 09cc831918a3d661055ccfbf8f12ee8f13d91ac2

Comment by Gerrit Updater [ 15/Sep/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39911
Subject: LU-10360 tests: test dynamic NIDs feature
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d0bfbcb3bb643ce6dc33590bd937cb3c935ac88a

Comment by Gerrit Updater [ 19/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39613/
Subject: LU-10360 mgc: Use IR for client->MDS/OST connections
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 37be05eca3f4aee15c946656a77f56967c15253d

Comment by Gerrit Updater [ 23/Nov/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40736
Subject: LU-10360 mgs: Mount to dynamically added networks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4c68340088f2f56d16f6b1392de5ad7f7d139ff4

Comment by Gerrit Updater [ 21/Dec/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45905
Subject: LU-10360 mgc: Use IR for client->MDS/OST connections
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: b1c09656513f3198adf849182617e6eafef76954

Comment by Gerrit Updater [ 04/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/39911/
Subject: LU-10360 tests: test dynamic NIDs feature
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2553f2fc8630061a8b6dbc5504d3f5277cb1cecf

Comment by Gerrit Updater [ 15/Feb/23 ]

"Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50000
Subject: LU-10360 ldlm: remove client_import_find_conn()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 12cbeaf1fb7bc83d7a842b71d6e8a33601e085ce

Comment by Gerrit Updater [ 08/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50000/
Subject: LU-10360 ldlm: remove client_import_find_conn()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 14544bdca5cc42a3ea80fe665e332fe4c88b081a

Comment by Andreas Dilger [ 08/Jan/24 ]

I think in conjunction with LU-10359 it should be possible to test a configuration that doesn't have server NIDs in the configuration at all, or totally incorrect NIDs in the config llog, to confirm that this is working properly.

A conf-sanity test case should be added to confirm that this is working properly.

Comment by Andreas Dilger [ 08/Jan/24 ]

It seems possible to transition systems to using dynamic NIDs by putting "lctl set_param mgc.*.dynamic_nids=1" as one of the first records in the config llog, so that clients will allow IR to determine where the MDTs and OSTs are located. After that is done, it would just be a matter of how long to create config llog records with the NIDs in them for backward compatibility. The original patch landed as v2_13_55-106-g37be05eca3, so it is in all 2.14.0 and later releases.

Generated at Sat Feb 10 02:34:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.