[LU-15539] clients report mds_mds_connection in connect_flags after lustre update on servers Created: 09/Feb/22 Updated: 01/Apr/22 Resolved: 21/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.12.8_6.llnl |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. The node on which the MGS runs, copper1, began reporting "new MDS connections" from NIDs that are assigned to client nodes: Lustre: MGS: Received new MDS connection from 192.168.128.68@o2ib38, keep former export from same NID Lustre: MGS: Received new MDS connection from 192.168.128.8@o2ib42, keep former export from same NID Lustre: MGS: Received new MDS connection from 192.168.131.78@o2ib39, keep former export from same NID Lustre: MGS: Received new MDS connection from 192.168.132.204@o2ib39, keep former export from same NID Lustre: MGS: Received new MDS connection from 192.168.134.127@o2ib27, keep former export from same NID Clients connect flags includes "mds_mds_connection": [root@quartz7:lustre]# head */*/connect_flags ==> mgc/MGC172.19.3.1@o2ib600/connect_flags <== flags=0x2000011005002020 flags2=0x0 version barrier adaptive_timeouts mds_mds_connection full20 imp_recov bulk_mbits The clients are running lustre lustre-2.12.7_2.llnl, which does not have " Shutting down the servers and restoring them to lustre-2.12.7_2.llnl did not change the symptoms. Patch stacks are: Seen during the same lustre server update where we saw LU-15541 but appears to be a separate issue |
| Comments |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
For my records, my local ticket is TOSS5543 |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
When we attempted to revert to lustre-2.12.7_2.llnl after observing the issue, we did not restore any server-side data; we just powered the servers down, booted them into the image with lustre-2.12.7_2.llnl, imported pools, and mounted the Lustre targets. After doing this, we continued to see the same symptoms. The clients which were still mounted (about 4,000) still had the incorrect connect flag from the earlier connection. This was not reset when they reconnected (or attempted to reconnect). I believe this connect flag not being reset may be why the symptoms did not change after the reboot. |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
Perhaps, does https://jira.whamcloud.com/browse/LU-15453 need to be rolled out to the clients before it is rolled out to the servers? |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
I am going to force-umount all the clients, before I try bringing the file system up on lustre-2.12.7_2.llnl, because I don't see another way to force the connect flags on the clients to change. I don't see evidence that this flag is recorded in the config logs somehow, and so I believe that a force umount on the client and then a mount of the file system running lustre-2.12.7_2.llnl will succeed in removing the mds_mds_connection flag. But if I'm wrong about that, please let me know ASAP. Thanks |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
Most of these clients mount one or more other Lustre file systems (whose servers are running lustre-2.12.7_2.llnl and have not yet been upgraded to 2.12.8_6.llnl). On the example node I'm looking at, "lustre1" is the file system we tried to update, and "lustre2" is the other one which stayed at 2.12.7_2.llnl the whole time. On this client node, the connect_flags for the other file system, "lustre2", also shows "mds_mds_connection". Will I need to umount lustre2 and remount it to get the client to forget that connection flag? (There are two separate MGSs - one in the cluster that hosts "lustre1", and one in the cluster that hosts "lustre2", and each MGS knows only about the file system in the cluster it lives in. So I didn't expect this.) |
| Comment by Patrick Farrell [ 09/Feb/22 ] |
|
Olaf, Yes, I believe that should work. (Re: the unmount plan) |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
Thanks, Patrick. |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
Patrick, I'd like to clarify which actions I've mentioned are still uncertain/need to be looked into, and which I should proceed with. So please confirm: (1) I should "umount -f /p/lustre1" (the file system that was briefly at 2.12.8_6.llnl) from every client that has it mounted? thanks |
| Comment by Lai Siyao [ 09/Feb/22 ] |
Perhaps, does https://jira.whamcloud.com/browse/LU-15453 need to be rolled out to the clients before it is rolled out to the servers? Do you mean https://review.whamcloud.com/37880/ from |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
Thank you, Lai |
| Comment by Patrick Farrell [ 09/Feb/22 ] |
|
Olaf, I would think (1) should be sufficient - the issue is an overlapping connect flag so servers without the new one should be unaffected. |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
OK, thanks Patrick |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
The mds_mds_connection flag is appearing in connect_flags on a client running lustre-2.12.7_2.llnl, which does not have patch 37880, on which I've umounted all lustre file systems, stopped lnet, verified that that libcfs is not loaded (nor any other dependent modules), started lnet, and mounted lustre2 whose servers are also all running lustre-2.12.7_2.llnl. This is incorrect, right? [root@zinci:~]# pdsh -a rpm -q lustre | dshbak -c ---------------- ezinc[1-52] ---------------- lustre-2.12.7_2.llnl-2.ch6.x86_64 [root@zinci:~]# pdsh -w e1 lctl list_nids e1: 172.19.3.1@o2ib600 [root@zinci:~]# pdsh -w e1 df -h -t lustre e1: Filesystem Size Used Avail Use% Mounted on e1: zinc1/mgs 2.7T 21M 2.7T 1% /mnt/lustre/MGS e1: zinc1/mdt1 2.8T 183G 2.7T 7% /mnt/lustre/lsh-MDT0000 and [root@quartz187:~]# pdsh -w e10 umount -a -t lustre [root@quartz187:~]# pdsh -w e10 systemctl stop lnet [root@quartz187:~]# pdsh -w e10 lsmod | grep libcfs [root@quartz187:~]# pdsh -w e10 systemctl start lnet [root@quartz187:~]# pdsh -w e10 mount -T /etc/fstab.d/ /p/lustre2 [root@quartz187:~]# pdsh -w e10 cat /sys/kernel/debug/lustre/mgc/MGC172.19.3.1@o2ib600/connect_flags e10: flags=0x2000011005002020 e10: flags2=0x0 e10: version e10: barrier e10: adaptive_timeouts e10: mds_mds_connection e10: full20 e10: imp_recov e10: bulk_mbits |
| Comment by Lai Siyao [ 09/Feb/22 ] |
|
It looks correct to see this in 2.12.7, though it means MNE_SWAB here. After you land 37880, it should be gone. |
| Comment by Olaf Faaland [ 09/Feb/22 ] |
|
Lai, it seems as if the soft lockups in the description above are an LNet issue unrelated to the flag. Do you agree? If so, I'll create a separate Jira issue for it. |
| Comment by Lai Siyao [ 10/Feb/22 ] |
|
Yes, it looks so. |
| Comment by Olaf Faaland [ 10/Feb/22 ] |
|
We umounted all the clients while the server cluster was down, then brought the server cluster back up in 2.12.7_2.llnl. We did not see the "Received new MDS connection" messages on bringup. We are proceeding with client cluster updates, and will update server clusters in about 2 weeks. |
| Comment by Olaf Faaland [ 21/Mar/22 ] |
|
In retrospect the isssue that brought kept the lustre file system from coming up was the LNet issue documented in https://jira.whamcloud.com/browse/LU-15541 so reduced this issue priority to "Minor". |
| Comment by Olaf Faaland [ 21/Mar/22 ] |
|
Updated all the clients to 2.12.8_6.llnl (or later) with patch " |