[LU-10360] use Imperative Recovery logs for client->MDT/OST connections Created: 08/Dec/17 Updated: 08/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The Imperative Recovery (IR) feature landed in It would be possible to extend this mechanism to also use the MGS IR log to do initial client mount, so that the MGS did not need to store the OST/MDT NIDs statically in the config log, but rather get the current NIDs directly from the dynamic MGS log. This would facilitate Lustre running in configurations where the server NIDs are not static (e.g. cloud, DHCP, etc). The initial connection to the MGS node(s) can already be done using the MGS hostname, since mount.lustre will do DNS name resolution. Some care would be needed when OSTs are being registered with the MGS, especially in testing environments where OSTs are reformatted regularly and often use the same fsname, since this may allow OSTs to register with the MGS that do not actually belong to the same filesystem. |
| Comments |
| Comment by Nathan Rutman [ 14/Nov/18 ] |
|
What happens today if a server restarts on an unregistered NID? Does IR still work, and clients add the new NID to the failover list? If not true, it seems that could be a useful first step. In fact, if we can count on IR, then we can actually replace the entire NID list with whatever the IR NID is - a failover list of 1, which is the latest place the server started. If we can't count on IR (e.g. MGS is unavailable / unreachable), then a client would continue to use last known location, so maybe IR should include a list of failover NIDs provided by the newly restarting server. NIDs (and failovers) just become dynamic (last reported by server startup) rather than statically defined by the first registration. IIRC back in the day we decided not to add new NIDs to the config log (statically), but I think the dynamic path with IR makes much more sense. |
| Comment by Andreas Dilger [ 28/Feb/20 ] |
|
The IPv6 page discusses the use of IR for peer NID configuration. The mgs_nidtbl_entry already contains a list of all NIDs for a client:
struct mgs_nidtbl_entry {
__u64 mne_version; /* table version of this entry */
__u32 mne_instance; /* target instance # */
__u32 mne_index; /* target index */
__u32 mne_length; /* length of this entry - by bytes */
__u8 mne_type; /* target type LDD_F_SV_TYPE_OST/MDT */
__u8 mne_nid_type; /* type of nid(mbz). for ipv6. */
__u8 mne_nid_size; /* size of each NID, by bytes */
__u8 mne_nid_count; /* # of NIDs in buffer */
union {
lnet_nid_t nids[0]; /* variable size buffer for NIDs. */
} u;
};
Since the MGS is already needed at initial client mount time, not being able to access the MGS IR service at mount would not be a reduction in functionality compared to needing the MGS to fetch the config logs. Using MGS IR to announce server NIDs to clients would also remove the complexity of changing NIDs in the configuration logs, which currently requires a full filesystem shutdown (stop all clients and unmount servers) and rewriting the config logs. One improvement that would be needed is for the servers to re-announce their NIDs if they are changed while the OST is mounted (e.g. expired DHCP lease, as opposed to the OST starting up on a new OSS). That wouldn't be much different than handling a target failover to another server, but would be noticeable on the clients. |
| Comment by Andreas Dilger [ 20/Jul/20 ] |
The current case is that the client will drop any NID that it receives that is not in the list of configured NIDs in the import. This (AFAIK) is in mgc_apply_recover_logs() where it checks for any existing NID on the import matching the NIDs in the IR entry:
/* iterate all nids to find one */
/* find uuid by nid */
rc = -ENOENT;
for (i = 0; i < entry->mne_nid_count; i++) {
rc = client_import_find_conn(obd->u.cli.cl_import,
entry->u.nids[i],
(struct obd_uuid *)uuid);
if (rc == 0)
break;
}
It was done this way to prevent misconfigured/rogue OSTs from connecting to the MGS and advertising "lustre-OSTxxxx" as a target, but it is for a different filesystem named lustre. This has happened in the test environment because of many concurrent lustre filesystems and re-use of IP addresses for different test runs, and this causes very hard to diagnose problems. At a minimum, there should be a tunable parameter that enables/disables the ability to connect an import to "unknown" NIDs. A more complete solution to restrict connections to a specific filesystem UUID stored on MDT0000 would be very desirable, but would be for a separate ticket.
The MGS is required for initial mount, and is desirable for normal operation, but not strictly required since the client stores its own failover list. The mgs_nidtbl_entry allows space for multiple NIDs, but these are intended to be the current NID(s) of the target (i.e. if there are multiple interfaces for different LNets), but not the failover NIDs. In a dynamic environment, it isn't necessarily even possible to know what the failover NID is going to be in advance, so it isn't clear whether it is worthwhile to add the ability to specify failover NIDs via the IR NID table. It would make sense for the client to still parse the MGS config log (if NIDs are present) for any failover NIDs to handle the case of MGS failure. It could also store any previously sent dynamic target NIDs for that target in its import list, for the case where the MGS is not working, it can try them as it does today, but it would be more desirable to have a real UUID for the filesystem beyond just "$fsname-OSTxxxx" to avoid errors during testing if that IP has been reassigned to another filesystem of the same name. |
| Comment by Gerrit Updater [ 11/Aug/20 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39613 |
| Comment by Gerrit Updater [ 22/Aug/20 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39709 |
| Comment by Gerrit Updater [ 15/Sep/20 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39911 |
| Comment by Gerrit Updater [ 19/Sep/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39613/ |
| Comment by Gerrit Updater [ 23/Nov/20 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40736 |
| Comment by Gerrit Updater [ 21/Dec/21 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45905 |
| Comment by Gerrit Updater [ 04/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/39911/ |
| Comment by Gerrit Updater [ 15/Feb/23 ] |
|
"Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50000 |
| Comment by Gerrit Updater [ 08/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50000/ |
| Comment by Andreas Dilger [ 08/Jan/24 ] |
|
I think in conjunction with LU-10359 it should be possible to test a configuration that doesn't have server NIDs in the configuration at all, or totally incorrect NIDs in the config llog, to confirm that this is working properly. A conf-sanity test case should be added to confirm that this is working properly. |
| Comment by Andreas Dilger [ 08/Jan/24 ] |
|
It seems possible to transition systems to using dynamic NIDs by putting "lctl set_param mgc.*.dynamic_nids=1" as one of the first records in the config llog, so that clients will allow IR to determine where the MDTs and OSTs are located. After that is done, it would just be a matter of how long to create config llog records with the NIDs in them for backward compatibility. The original patch landed as v2_13_55-106-g37be05eca3, so it is in all 2.14.0 and later releases. |