[LU-4243] multiple servicenodes or failnids: wrong client llog registration Created: 12/Nov/13  Updated: 13/Mar/14  Resolved: 20/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Major
Reporter: Erich Focht Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

failover MDS/MGS, failover OSTs


Issue Links:
Duplicate
is duplicated by LU-4043 clients unable to reconnect after OST... Resolved
Severity: 3
Rank (Obsolete): 11555

 Description   

Since Lustre 2.4.0 we had problems with clients that could not connect after eg. the MGS was failing over. Most experiments we did with a client on the standby MGS server, the symptom was that the client only worked from the active MGS/MDS node, not from the passive one!

The reason for the problem seems to be commit d9d27cad and the following hunk of the patch:
@@ -1447,13 +1481,11 @@ static int mgs_write_log_failnids(const struct lu_env *env,
failnodeuuid, cliname);
rc = record_add_uuid(env, llh, nid, failnodeuuid);
}

  • if (failnodeuuid) { + if (failnodeuuid) rc = record_add_conn(env, llh, cliname, failnodeuuid); - name_destroy(&failnodeuuid); - failnodeuuid = NULL; - }

    }

+ name_destroy(&failnodeuuid);
return rc;
}

This leads to a wrong lustre client llog when the lustre block devices are formated with multiple --servicenode options! Here is an example:

#09 (224)marker 6 (flags=0x01, v2.5.0.0) lnec-MDT0000 'add mdc' Mon Nov 11 17:03:21 2013-
#10 (080)add_uuid nid=10.3.0.34@o2ib(0x500000a030022) 0: 1:10.3.0.34@o2ib
#11 (128)attach 0:lnec-MDT0000-mdc 1:mdc 2:lnec-clilmv_UUID
#12 (136)setup 0:lnec-MDT0000-mdc 1:lnec-MDT0000_UUID 2:10.3.0.34@o2ib
#13 (080)add_uuid nid=10.3.0.34@o2ib(0x500000a030022) 0: 1:10.3.0.34@o2ib
#14 (104)add_conn 0:lnec-MDT0000-mdc 1:10.3.0.34@o2ib
#15 (080)add_uuid nid=10.3.0.35@o2ib(0x500000a030023) 0: 1:10.3.0.34@o2ib
#16 (104)add_conn 0:lnec-MDT0000-mdc 1:10.3.0.34@o2ib
#17 (160)modify_mdc_tgts add 0:lnec-clilmv 1:lnec-MDT0000_UUID 2:0 3:1 4:lnec-MDT0000-mdc_UUID
#18 (224)marker 6 (flags=0x02, v2.5.0.0) lnec-MDT0000 'add mdc' Mon Nov 11 17:03:21 2013-

The last add_uuid should have 1:10.3.0.35@o2ib instead of 1:10.3.0.34@o2ib.

And the reason is that only the first nid of the first --servicenode AKA --failnode entry is considered.

Please revert that little patch of mgs_llog.c.

Regards,
Erich



 Comments   
Comment by Erich Focht [ 12/Nov/13 ]

The patch again, with hopefully proper formatting:

@@ -1447,13 +1481,11 @@ static int mgs_write_log_failnids(const struct lu_env *env,
                                failnodeuuid, cliname);
                        rc = record_add_uuid(env, llh, nid, failnodeuuid);
                 }
-                if (failnodeuuid) {
+               if (failnodeuuid)
                        rc = record_add_conn(env, llh, cliname, failnodeuuid);
-                        name_destroy(&failnodeuuid);
-                        failnodeuuid = NULL;
-                }
         }
 
+       name_destroy(&failnodeuuid);
         return rc;
 }
 
Comment by Peter Jones [ 12/Nov/13 ]

Hongchao

Could you please help with this one?

Thanks

Peter

Comment by Andreas Dilger [ 15/Nov/13 ]

Erich, to clarify, this problem only happens when you are trying to mount a client from the backup MGS node?

Comment by Oleg Drokin [ 15/Nov/13 ]

I wonder if it's also related somehow to LU-3829

Comment by Erich Focht [ 16/Nov/13 ]

Andreas, the problem occurs also for normal clients, IIRC. The peculiarity of the setup is that we formatted the Lustre devices with two -servicenode arguments (which transform into failover.node options). The first mount of the devices was on the first of the service nodes. This used to work fine under 2.1.X. The other (maybe most widely used) way of formatting is with just one -failnode option, say -failnode B, and first mount of device on node A.

The patch mentioned in my first comment leads to the problem that the failnodeuuid is set only once in the entire registration process, i.e. it is set to the first failover.node argument (or --sevicenode) which was used for formatting. Instead failnodeuuid should be set again for each of the appearing failover.node options. We verified that reverting that patch fixes the problem in 2.5.0.

Oleg, we've seen the issue with two --mgsnode options, too, but that's different. Seems fixed in 2.5.0.

Comment by Hongchao Zhang [ 22/Nov/13 ]

the problem here is that the LNet doesn't use the other NIDs with the same "distance" and "order" contained in the same UUID, which is "10.3.0.34@o2ib"
in this case (see "ptlrpc_uuid_to_connection" and "ptlrpc_uuid_to_peer" for detailed info).

this issue will still exist even the patch mentioned is reverted if MDT is formatted with "--servicenode 10.3.0.34,10.3.0.35" for these two NIDs will use
the same UUID "10.3.0.34@o2ib".

Comment by Hongchao Zhang [ 22/Nov/13 ]

the patch is tracked at http://review.whamcloud.com/#/c/8372/

Comment by Zhenyu Xu [ 27/Nov/13 ]

Hi Hongchao,

Why "if MDT is formatted with "--servicenode 10.3.0.34,10.3.0.35" for these two NIDs will use the same UUID "10.3.0.34@o2ib"."? I think these two service nodes will be parsed to different failnodeuuid string.

Comment by Hongchao Zhang [ 29/Nov/13 ]

in mkfs_lustre.c

int parse_opts(int argc, char *const argv[], struct mkfs_opts *mop,
               char **mountopts)
{
     ...
     case 's': {
         ...
         nids = convert_hostnames(optarg);
         if (!nids)
             return 1;
         rc = add_param(mop->mo_ldd.ldd_params, PARAM_FAILNODE,
                        nids);
         free(nids);
         ...
     }
     ...
}

"convert_hostnames" does little to "10.3.0.34,10.3.0.35", and "add_param" will add a single "failover.node" param.

Comment by Zhenyu Xu [ 29/Nov/13 ]

add_param() would separate the string into two parameters, like it shows on my VM machine

# mkfs.lustre --mgs --mdt --fsname=lustre --index=0 --servicenode 10.3.0.34,10.3.0.35 --reformat /dev/sdb
   Permanent disk data:
Target:     lustre:MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x1065
              (MDT MGS first_time update no_primnode )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: failover.node=10.3.0.34@tcp failover.node=10.3.0.35@tcp
Comment by Hongchao Zhang [ 20/Dec/13 ]

oh, Yes, it has been fixed in LU-3445 (http://review.whamcloud.com/#/c/6686/), which was landed on b2_4_1 and master, b2_4_0 still has this problem.

Comment by Peter Jones [ 20/Dec/13 ]

Landed for 2.4.2 and 2.6. Will be landed for 2.5.1 shortly.

Comment by John Fuchs-Chesney (Inactive) [ 20/Feb/14 ]

Hello Erich – Do you have what you need on this issue? If so, can I go ahead and mark it as resolved? Thanks, ~ jfc.

Generated at Sat Feb 10 01:40:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.