[LU-16204] Connections from MGC to a Combined MGS/MDT on failover node not working Created: 04/Oct/22  Updated: 04/Oct/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.9, Lustre 2.15.0, Lustre 2.15.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Etienne Aujames Assignee: Etienne Aujames
Resolution: Unresolved Votes: 0
Labels: None
Environment:

master branch + VMs
2.12.9 + patches on a production cluster
(more than 1 MDT per node)


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After migrating the MGS/MDT0000 resources from node 1 to node 2. MGC still on node 1 is unable to connect to the combined MGS/MDT on node 2.

The issue here is that the combined MGS/MDT ignore failover nids for the MGS. When this target is mounted first, it creates a mgc without failover nids.
The others targets will reuse this mgc device (same name), without adding new failover nodes.
So, when the MDT/MGS target is mounted on node2, mgc on node 1 is not able to connect because of the missing failover nid.

e.g:

[root@mds1 ~]# tunefs.lustre --dryrun /dev/mapper/mds1_flakey
checking for existing Lustre data: found

   Read previous values:
Target:     lustrefs-MDT0000
Index:      0
Lustre FS:  lustrefs
Mount type: ldiskfs
Flags:      0x5
              (MDT MGS )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=10.0.2.4@tcp:10.0.2.7@tcp

[root@mds1 ~]# tunefs.lustre --dryrun /dev/mapper/mds2_flakey       
checking for existing Lustre data: found

   Read previous values:
Target:     lustrefs-MDT0001
Index:      1
Lustre FS:  lustrefs
Mount type: ldiskfs
Flags:      0x1
              (MDT )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=10.0.2.4@tcp:10.0.2.7@tcp

Mount mds1_flakey and then mds2_flakey:

[root@mds1 ~]# mount -tlustre /dev/mapper/mds1_flakey /media/lustrefs/mds1
[root@mds1 ~]# mount -tlustre /dev/mapper/mds2_flakey /media/lustrefs/mds2
[root@mds1 ~]# lctl dl
  0 UP osd-ldiskfs lustrefs-MDT0000-osd lustrefs-MDT0000-osd_UUID 8
  1 UP mgs MGS MGS 4
  2 UP mgc MGC10.0.2.4@tcp 9b8dda76-560c-449d-ad56-a81a673cd1aa 4
  3 UP mds MDS MDS_uuid 2
...
[root@mds1 ~]# lctl get_param mgc.MGC10.0.2.4@tcp.import
mgc.MGC10.0.2.4@tcp.import=
import:
    name: MGC10.0.2.4@tcp
    target: MGS
    state: FULL
    connect_flags: [ version, barrier, adaptive_timeouts, full20, imp_recov, bulk_mbits, second_flags, reply_mbits ]
    connect_data:
       flags: 0xa000011001002020
       instance: 0
       target_version: 2.15.51.0
    import_flags: [ pingable, connect_tried ]
    connection:
       failover_nids: [ 0@lo ]                                        <----------------
       current_connection: 0@lo
       connection_attempts: 1
       generation: 1
       in-progress_invalidations: 0
       idle: 78545 sec

Mount mds2_flakey and then mds1_flakey:

[root@mds1 ~]# lctl get_param mgc.MGC10.0.2.4@tcp.import                                                           
mgc.MGC10.0.2.4@tcp.import=
import:
    name: MGC10.0.2.4@tcp
    target: MGS
    state: FULL
    connect_flags: [ version, barrier, adaptive_timeouts, full20, imp_recov, bulk_mbits, second_flags, reply_mbits ]
    connect_data:
       flags: 0xa000011001002020
       instance: 0
       target_version: 2.15.51.0
    import_flags: [ pingable, connect_tried ]
    connection:
       failover_nids: [ 0@lo, 10.0.2.7@tcp ]              <----------------
       current_connection: 0@lo
       connection_attempts: 10
       generation: 1
       in-progress_invalidations: 0
       idle: 60 sec

Generated at Sat Feb 10 03:24:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.