[LU-15539] clients report mds_mds_connection in connect_flags after lustre update on servers Created: 09/Feb/22  Updated: 01/Apr/22  Resolved: 21/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-2.12.8_6.llnl
3.10.0-1160.53.1.1chaos.ch6.x86_64
RHEL7.9
zfs-0.7.11-9.8llnl


Issue Links:
Related
is related to LU-13356 lctl conf_param hung on the MGS node Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. 

The node on which the MGS runs, copper1, began reporting "new MDS connections" from NIDs that are assigned to client nodes:

Lustre: MGS: Received new MDS connection from 192.168.128.68@o2ib38, keep former export from same NID
Lustre: MGS: Received new MDS connection from 192.168.128.8@o2ib42, keep former export from same NID
Lustre: MGS: Received new MDS connection from 192.168.131.78@o2ib39, keep former export from same NID
Lustre: MGS: Received new MDS connection from 192.168.132.204@o2ib39, keep former export from same NID
Lustre: MGS: Received new MDS connection from 192.168.134.127@o2ib27, keep former export from same NID

Clients connect flags includes "mds_mds_connection":

[root@quartz7:lustre]# head */*/connect_flags
==> mgc/MGC172.19.3.1@o2ib600/connect_flags <==
flags=0x2000011005002020
flags2=0x0
version
barrier
adaptive_timeouts
mds_mds_connection
full20
imp_recov
bulk_mbits

The clients are running lustre lustre-2.12.7_2.llnl, which does not have "LU-13356 client: don't use OBD_CONNECT_MNE_SWAB".

Shutting down the servers and restoring them to lustre-2.12.7_2.llnl did not change the symptoms.

Patch stacks are:
https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl

Seen during the same lustre server update where we saw LU-15541 but appears to be a separate issue



 Comments   
Comment by Olaf Faaland [ 09/Feb/22 ]

For my records, my local ticket is TOSS5543

Comment by Olaf Faaland [ 09/Feb/22 ]

When we attempted to revert to lustre-2.12.7_2.llnl after observing the issue, we did not restore any server-side data; we just powered the servers down, booted them into the image with lustre-2.12.7_2.llnl, imported pools, and mounted the Lustre targets.

After doing this, we continued to see the same symptoms.

The clients which were still mounted (about 4,000) still had the incorrect connect flag from the earlier connection. This was not reset when they reconnected (or attempted to reconnect). I believe this connect flag not being reset may be why the symptoms did not change after the reboot.

Comment by Olaf Faaland [ 09/Feb/22 ]

Perhaps, does https://jira.whamcloud.com/browse/LU-15453 need to be rolled out to the clients before it is rolled out to the servers?

Comment by Olaf Faaland [ 09/Feb/22 ]

I am going to force-umount all the clients, before I try bringing the file system up on lustre-2.12.7_2.llnl, because I don't see another way to force the connect flags on the clients to change.

I don't see evidence that this flag is recorded in the config logs somehow, and so I believe that a force umount on the client and then a mount of the file system running lustre-2.12.7_2.llnl will succeed in removing the mds_mds_connection flag. But if I'm wrong about that, please let me know ASAP.

Thanks

Comment by Olaf Faaland [ 09/Feb/22 ]

Most of these clients mount one or more other Lustre file systems (whose servers are running lustre-2.12.7_2.llnl and have not yet been upgraded to 2.12.8_6.llnl). On the example node I'm looking at, "lustre1" is the file system we tried to update, and "lustre2" is the other one which stayed at 2.12.7_2.llnl the whole time.

On this client node, the connect_flags for the other file system, "lustre2", also shows "mds_mds_connection".

Will I need to umount lustre2 and remount it to get the client to forget that connection flag?

(There are two separate MGSs - one in the cluster that hosts "lustre1", and one in the cluster that hosts "lustre2", and each MGS knows only about the file system in the cluster it lives in. So I didn't expect this.)

Comment by Patrick Farrell [ 09/Feb/22 ]

Olaf,

Yes, I believe that should work.  (Re: the unmount plan)

Comment by Olaf Faaland [ 09/Feb/22 ]

Thanks, Patrick.

Comment by Olaf Faaland [ 09/Feb/22 ]

Patrick, I'd like to clarify which actions I've mentioned are still uncertain/need to be looked into, and which I should proceed with. So please confirm:

(1) I should "umount -f /p/lustre1" (the file system that was briefly at 2.12.8_6.llnl) from every client that has it mounted?
(2) Do I need to umount /p/lustre2 at the same time?
(3) Since /p/lustre2 has been up the whole time, has jobs running against it, and was always at the "good" version, is there a less disruptive way you can think of?

thanks

Comment by Lai Siyao [ 09/Feb/22 ]
Perhaps, does https://jira.whamcloud.com/browse/LU-15453 need to be rolled out to the clients before it is rolled out to the servers?

Do you mean https://review.whamcloud.com/37880/ from LU-13356? If so, yes, it should be rolled out to clients.

Comment by Olaf Faaland [ 09/Feb/22 ]

Thank you, Lai

Comment by Patrick Farrell [ 09/Feb/22 ]

Olaf,

I would think (1) should be sufficient - the issue is an overlapping connect flag so servers without the new one should be unaffected.

Comment by Olaf Faaland [ 09/Feb/22 ]

OK, thanks Patrick

Comment by Olaf Faaland [ 09/Feb/22 ]

The mds_mds_connection flag is appearing in connect_flags on a client running lustre-2.12.7_2.llnl, which does not have patch 37880, on which I've umounted all lustre file systems, stopped lnet, verified that that libcfs is not loaded (nor any other dependent modules), started lnet, and mounted lustre2 whose servers are also all running lustre-2.12.7_2.llnl.

This is incorrect, right?

[root@zinci:~]# pdsh -a rpm -q lustre | dshbak -c
----------------
ezinc[1-52]
----------------
lustre-2.12.7_2.llnl-2.ch6.x86_64

[root@zinci:~]# pdsh -w e1 lctl list_nids
e1: 172.19.3.1@o2ib600
[root@zinci:~]# pdsh -w e1 df -h -t lustre
e1: Filesystem      Size  Used Avail Use% Mounted on
e1: zinc1/mgs       2.7T   21M  2.7T   1% /mnt/lustre/MGS
e1: zinc1/mdt1      2.8T  183G  2.7T   7% /mnt/lustre/lsh-MDT0000

and

[root@quartz187:~]# pdsh -w e10 umount -a -t lustre
[root@quartz187:~]# pdsh -w e10 systemctl stop lnet
[root@quartz187:~]# pdsh -w e10 lsmod | grep libcfs
[root@quartz187:~]# pdsh -w e10 systemctl start lnet
[root@quartz187:~]# pdsh -w e10 mount -T /etc/fstab.d/ /p/lustre2
[root@quartz187:~]# pdsh -w e10 cat /sys/kernel/debug/lustre/mgc/MGC172.19.3.1@o2ib600/connect_flags
e10: flags=0x2000011005002020
e10: flags2=0x0
e10: version
e10: barrier
e10: adaptive_timeouts
e10: mds_mds_connection
e10: full20
e10: imp_recov
e10: bulk_mbits
Comment by Lai Siyao [ 09/Feb/22 ]

It looks correct to see this in 2.12.7, though it means MNE_SWAB here. After you land 37880, it should be gone.

Comment by Olaf Faaland [ 09/Feb/22 ]

Lai, it seems as if the soft lockups in the description above are an LNet issue unrelated to the flag.  Do you agree?  If so, I'll create a separate Jira issue for it.

Comment by Lai Siyao [ 10/Feb/22 ]

Yes, it looks so.

Comment by Olaf Faaland [ 10/Feb/22 ]

We umounted all the clients while the server cluster was down, then brought the server cluster back up in 2.12.7_2.llnl.  We did not see the "Received new MDS connection" messages on bringup.

We are proceeding with client cluster updates, and will update server clusters in about 2 weeks.

Comment by Olaf Faaland [ 21/Mar/22 ]

In retrospect the isssue that brought kept the lustre file system from coming up was the LNet issue documented in https://jira.whamcloud.com/browse/LU-15541 so reduced this issue priority to "Minor".

Comment by Olaf Faaland [ 21/Mar/22 ]

Updated all the clients to 2.12.8_6.llnl (or later) with patch "LU-13356 client: don't use OBD_CONNECT_MNE_SWAB".
Then updated the servers to 2.12.8_6.llnl (or later) with that patch.
No longer seeing inappropriate "Received new MDS connection" messages on bringup.

Generated at Sat Feb 10 03:19:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.