[LU-15177]  cs_update live batch update hung waiting for MDT recovery to complete Created: 29/Oct/21  Updated: 22/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Problem: MDT-MDT interop check prevents (single MDT) rolling update from 2.12.0.5-based -> 2.12.4.1-based Lustre because the versions are too far away from one another to complete inter-op recovery.

two similar code snippets found in ptlrpc_connect_interpret() and
target_handle_connect():

                        /*
                         * We do not support the MDT-MDT interoperations with
                         * different version MDT because of protocol changes.
                         */
                        if (unlikely(major != LUSTRE_MAJOR ||
                                     minor != LUSTRE_MINOR ||
                                     abs(patch - LUSTRE_PATCH) > 3)) {
                                LCONSOLE_WARN("%s (%u.%u.%u.%u) refused the connection from different version MDT (%d.%d.%d.%d) %s %s\n",
                                              target->obd_name, LUSTRE_MAJOR,
                                              LUSTRE_MINOR, LUSTRE_PATCH,
                                              LUSTRE_FIX, major, minor, patch,
                                              OBD_OCD_VERSION_FIX(data->ocd_version),
                                              libcfs_nid2str(req->rq_peer.nid),
                                              str);
                                GOTO(out, rc = -EPROTO);
                        }

looks the constant of "3" was chosen for some specific protocol changes in the past, but is it still needed ?



 Comments   
Comment by Gerrit Updater [ 29/Oct/21 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45408
Subject: LU-15177 ldlm: do not check patch version
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cae10c33a8672d72a4b2c07cc8af14e51b7a362c

Comment by Andreas Dilger [ 30/Oct/21 ]

The version delta was chosen somewhat arbitrarily. There really should not be a significant version skew between MDS versions, as this is never tested. Even with rolling upgrades, it is typical to failover 1/2 of the MDTs to their backup, upgrade half of the RPMs, then failover all MDTs to their peer and upgrade the other half of the peers. Since there is also not currently good support for DNE to work with offline MDTs (i.e. the system will wait for all MDTs to recover before it is usable), doing something like upgrading only 1/4 of the MDS nodes doesn't reduce downtime at all, but rather lengthens the downtime.

I'm curious why you would want to do this?

Generated at Sat Feb 10 03:16:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.