Hi Bobi,
I found the following codes in lustre_start_mgc() introduced by patch https://review.whamcloud.com/7509 for LU-3829 might have some issue:
/*
* LU-3829.
* Here we only take the first mgsnid as its primary
* serving mgs node, the rest mgsnid will be taken as
* failover mgs node, otherwise they would be takens
* as multiple nids of a single mgs node.
*/
while (class_parse_nid(ptr, &nid, &ptr) == 0) {
rc = do_lcfg(mgcname, nid, LCFG_ADD_UUID,
niduuid, NULL, NULL, NULL);
if (rc == 0) {
i = 1;
break;
}
}
For multiple NIDs separated by commas on one MGS node, the above codes would only add the first NID and then break the while loop. After lustre_start_simple() is performed, the remaining comma-separated MGS NID(s) would not be added because the following condition doesn't work:
i = 1;
while (ptr && ((*ptr == ':' ||
class_find_param(ptr, PARAM_MGSNODE, &ptr) == 0))) {
ptr is previously assigned with ptr = lsi->lsi_lmd->lmd_mgs, and for comma-separated MGS NIDs, ptr now points to a ','. There is also no mgsnode= in the string, so the above condition does not work. All of the NIDs after the first comma would not be added.
It seems this is the root cause of the latest issue reported by Darby in LU-8311. The ptr has a value of 192.52.98.30@tcp,10.148.0.30@o2ib:192.52.98.31@tcp,10.148.0.31@o2ib, but from the debug log in debug.log.ib_and_eth, I can only see the first NID 192.52.98.30@tcp was added.
Patch landed to master branch for Lustre 2.10.0.