[LU-14108] Mounting targets created with mkfs "network" option should disable discovery Created: 02/Nov/20 Updated: 30/May/23 Resolved: 30/May/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Serguei Smirnov | Assignee: | Cyril Bordage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lnet | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The --network= option to mkfs.lustre allows restricting a target (OST/MDT) to a given LNet network. This makes it register to the MGS with the specified network only. However, dynamic discovery is unaware of this restriction and this can create problems. If this scenario is recognised, it is possible to deal with it in at least two ways: 1) Prevent the mount with an error. 2) Disable dynamic discovery with a warning, prevent from enabling discovery in the future. |
| Comments |
| Comment by Serguei Smirnov [ 02/Nov/20 ] |
|
To reproduce, use --network option when formatting targets with mkfs.lustre. The following is an example from the customer setup: " The error is occurring between MDS (md1) and OSS (all OSS showing similar errors). For example, in the case of rcf2-OST000d, it is being mounted on os2, but md1 is trying to connect it to os1 repeatedly. Usually rcf2-OST000d is mounted on os2 (primary OSS for this OST) and error should not happen. tunefs.lustre shows network=o2ib1, and primary is os2 (172.20.2.26@o2ib1) and secondary is os1 (172.20.2.25@o2ib1) "
[root@os2 ~]# tunefs.lustre --dryrun /dev/mapper/OST0d
checking for existing Lustre data: found
Read previous values:
Target: rcf2-OST000d
Index: 13
Lustre FS: rcf2
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1
Permanent disk data:
Target: rcf2-OST000d
Index: 13
Lustre FS: rcf2
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1
exiting before disk write.
It should be possible to reproduce with a simpler setup. The symptom is the error message similar to the following: Oct 22 11:58:27 os1-s kernel: Lustre: rcf2-OST0001: Received new MDS connection from 172.20.2.11@o2ib, removing former export from same NID Oct 22 11:58:27 os1-s kernel: Lustre: Skipped 47 previous similar messages ... Oct 22 12:02:13 os1-s kernel: LustreError: 137-5: rcf2-OST000d_UUID: not available for connect from 172.20.2.11@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Oct 22 12:02:13 os1-s kernel: LustreError: Skipped 47 previous similar messages To see how the targets are registered to the MGS, run command similar to the following on the MGS: lctl --device MGS llog_print rcf2-MDT0000 This should show that only NIDs on the network specified as the option with mkfs were used. |
| Comment by Gerrit Updater [ 26/Feb/22 ] |
|
"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46632 |
| Comment by Gerrit Updater [ 25/Apr/22 ] |
|
"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47137 |
| Comment by Gerrit Updater [ 04/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46632/ |