[LU-14108] Mounting targets created with mkfs "network" option should disable discovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
- lnet

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The --network= option to mkfs.lustre allows restricting a target (OST/MDT) to a given LNet network. This makes it register to the MGS with the specified network only. However, dynamic discovery is unaware of this restriction and this can create problems.

If this scenario is recognised, it is possible to deal with it in at least two ways:

1) Prevent the mount with an error.

2) Disable dynamic discovery with a warning, prevent from enabling discovery in the future.

Attachments

Issue Links

is related to

LU-11057 Client mount option "-o network=net" does not work with LNet dynamic peer discovery

Resolved

Activity

[LU-14108] Mounting targets created with mkfs "network" option should disable discovery

Gerrit Updater added a comment - 04/Oct/22 7:34 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46632/
Subject: ~~LU-14108~~ mount: prevent if --network and discovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e82836f56ee7a9337a86ad0a32f19751024c7ec6

Gerrit Updater added a comment - 04/Oct/22 7:34 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46632/ Subject: LU-14108 mount: prevent if --network and discovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: e82836f56ee7a9337a86ad0a32f19751024c7ec6

Gerrit Updater added a comment - 25/Apr/22 9:42 PM

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47137
Subject: ~~LU-14108~~ tests: mount with discovery and network
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 038db7e1872e6ef743921744fb09f0f0c201124a

Gerrit Updater added a comment - 25/Apr/22 9:42 PM "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47137 Subject: LU-14108 tests: mount with discovery and network Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 038db7e1872e6ef743921744fb09f0f0c201124a

Gerrit Updater added a comment - 26/Feb/22 12:04 AM

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46632
Subject: ~~LU-14108~~ mount: prevent if --network and discovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f5f0dba9c253a25ba9b6d06a0b4e361aa48690d5

Gerrit Updater added a comment - 26/Feb/22 12:04 AM "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46632 Subject: LU-14108 mount: prevent if --network and discovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f5f0dba9c253a25ba9b6d06a0b4e361aa48690d5

Serguei Smirnov added a comment - 02/Nov/20 11:14 PM

To reproduce, use --network option when formatting targets with mkfs.lustre. The following is an example from the customer setup:

"

The error is occurring between MDS (md1) and OSS (all OSS showing similar errors). For example, in the case of rcf2-OST000d, it is being mounted on os2, but md1 is trying to connect it to os1 repeatedly. Usually rcf2-OST000d is mounted on os2 (primary OSS for this OST) and error should not happen.

tunefs.lustre shows network=o2ib1, and primary is os2 (172.20.2.26@o2ib1) and secondary is os1 (172.20.2.25@o2ib1)

"

[root@os2 ~]# tunefs.lustre --dryrun /dev/mapper/OST0d
checking for existing Lustre data: found

   Read previous values:
Target:     rcf2-OST000d
Index:      13
Lustre FS:  rcf2
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro
Parameters:  mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1


   Permanent disk data:
Target:     rcf2-OST000d
Index:      13
Lustre FS:  rcf2
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro
Parameters:  mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1

exiting before disk write.

It should be possible to reproduce with a simpler setup.

The symptom is the error message similar to the following:

Oct 22 11:58:27 os1-s kernel: Lustre: rcf2-OST0001: Received new MDS connection from 172.20.2.11@o2ib, removing former export from same NID
Oct 22 11:58:27 os1-s kernel: Lustre: Skipped 47 previous similar messages
 ...
Oct 22 12:02:13 os1-s kernel: LustreError: 137-5: rcf2-OST000d_UUID: not available for connect from 172.20.2.11@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 22 12:02:13 os1-s kernel: LustreError: Skipped 47 previous similar messages

To see how the targets are registered to the MGS, run command similar to the following on the MGS:

lctl --device MGS llog_print rcf2-MDT0000

This should show that only NIDs on the network specified as the option with mkfs were used.

Serguei Smirnov added a comment - 02/Nov/20 11:14 PM To reproduce, use --network option when formatting targets with mkfs.lustre. The following is an example from the customer setup: " The error is occurring between MDS (md1) and OSS (all OSS showing similar errors). For example, in the case of rcf2-OST000d, it is being mounted on os2, but md1 is trying to connect it to os1 repeatedly. Usually rcf2-OST000d is mounted on os2 (primary OSS for this OST) and error should not happen. tunefs.lustre shows network=o2ib1, and primary is os2 (172.20.2.26@o2ib1) and secondary is os1 (172.20.2.25@o2ib1) " [root@os2 ~]# tunefs.lustre --dryrun /dev/mapper/OST0d checking for existing Lustre data: found Read previous values: Target: rcf2-OST000d Index: 13 Lustre FS: rcf2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro Parameters: mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1 Permanent disk data: Target: rcf2-OST000d Index: 13 Lustre FS: rcf2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro Parameters: mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1 exiting before disk write. It should be possible to reproduce with a simpler setup. The symptom is the error message similar to the following: Oct 22 11:58:27 os1-s kernel: Lustre: rcf2-OST0001: Received new MDS connection from 172.20.2.11@o2ib, removing former export from same NID Oct 22 11:58:27 os1-s kernel: Lustre: Skipped 47 previous similar messages ... Oct 22 12:02:13 os1-s kernel: LustreError: 137-5: rcf2-OST000d_UUID: not available for connect from 172.20.2.11@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Oct 22 12:02:13 os1-s kernel: LustreError: Skipped 47 previous similar messages To see how the targets are registered to the MGS, run command similar to the following on the MGS: lctl --device MGS llog_print rcf2-MDT0000 This should show that only NIDs on the network specified as the option with mkfs were used.

Mounting targets created with mkfs "network" option should disable discovery

Details

Description

Attachments

Issue Links

Activity

People

Dates