[LU-14668] LNet: do discovery in the background Created: 04/May/21  Updated: 07/Feb/24  Resolved: 13/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Improvement Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10360 use Imperative Recovery logs for clie... Open
is related to LU-15541 Soft lockups in LNetPrimaryNID() and ... Open
is related to LU-15169 Regression in "024f9303bc LU-14668 ln... Resolved
is related to LU-14566 Skip discovery in LNetPrimaryNID when... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When the file system is being mounted the llog is traversed and a local peer representation at the pltrpc layer is created. As part of this process ptlrpc_connection_get() -> LNetPrimaryNID() path gets executed. As a result LNet performs the discovery protocol, to update its local representation of the peer. This involves communicating with the NID provided by the ptlrpc_connection_get() call. Prior to the introduction of LNetPrimaryNID() no communication with the remote peer was performed at this point. This led to the situation where when the llog contains references to old NIDs, or NIDs for bad interfaces, the connection to that NID can take up to the LND timeout (in the 50s range) to expire. This could extend the mount time considerably.

To avoid this issue we can change the concept of Primary NID. Primary NID currently is a global concept derived from the first interface configured on the node. However, there doesn't seem to be a need to make this a global concept. Each node can have a different view of the primary NID of the peers it communicates with, as long as it keeps the Primary NID consistent through out the life of the peer.

Since Lustre is the one which requests the initial connection to the peer, it already provides LNet with the NID which it prefers to use (likely the one configured). LNet can lock that NID as the primary NID of the node, even if it is not the first interface configured on the node.

This actually clarifies some confusion encountered on some sites, where the first interface configured on the system is not on the same network as the peer's interface.

For example a tcp client can mount a server on the TCP network. However the server has the o2ib interface configured first. On the TCP client the peer shows the o2ib as the primary NID. This can be confusion when viewing configuration.

By locking the primary NID of the peer to the tcp NID, then viewing the peer configuration from the tcp client will make more sense.

This way the primary NID concept becomes a node local concept. It is the NID by which a Lustre node references a peer. Different lustre nodes can reference the same peer by different NIDs.

Practically speaking usually the FS is configured with the first NID which is reachable. From a TCP client it would be the first tcp interface configured and the same for other networks. However, the solution doesn't demand that.

The solution will be spread across the following patches

  1. Introduce a LOCK_PRIMARY state to the peer. This is set when LNetPrimaryNID() is called on a new peer or a peer is explicitly added by Lustre.
  2. When a peer is in LOCK_PRIMARY state, the primary NID provided by lustre will not change. The peer can be populated by other interfaces' NIDs; however, the primary NID will not change
  3. Get Lustre to pre-define the Primary NID and the constituent NIDs, such that a call to LNetPrimaryNID() on a constituent NID returns consistent result and is not dependant on the completion of the discovery protocol.
  4. If a peer was manually discovered, then Lustre explicitly adds it using a different primary NID afterwards, the Lustre configuration path will take precedence. The peer will be deleted and recreated with the primary NID Lustre uses.
  5. When lustre deletes the UUID, the lock the LNet peer should be removed.
  6. TBD: Should we be removing the lock from an LNet Peer when Lustre evicts a node or when Lustre is unmounted?

This solution should avoid long mount delays. However, it will not help in the case when the Primary NID used by Lustre is not reachable or LNet encounters network delays reaching that NID.

On mount the Lustre needs to reach the MGS to retrieve the server NID information in the llog. 

obd_connect()>lmv_connect>lmv_connect_mdc->client_connect_import->ptlrpc_connect_import() to connect

it then does a sync OBD_STATFS to MDT0000 to test its aliveness (maybe to wait for the MDT0000 connection to complete), then checks some connection features on the MDT to verify it is not too old, then gets the root directory FID from MDT0000 for the mount. after that, it follows a similar process to connect to the OSTs, but it doesn't wait for them to finish

The purpose of this solution is not to delay mount on servers which might not be reachable during mount time. By pushing discovery in the background, the discovery can complete at its own time. Any messages to the node under discovery will be sent only after discovery is complete. Therefore, NIDs provided by lustre client for servers necessary for mount will by definition need to be reachable for the mount to complete. Other nodes which are not needed at mount time will not block mount.



 Comments   
Comment by Gerrit Updater [ 06/May/21 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43562
Subject: LU-14668 lnet: peer state to lock primary nid
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d8cf8ecaf7f4337d74347a1f67e6ac2f3de6647

Comment by Gerrit Updater [ 06/May/21 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43563
Subject: LU-14668 lnet: Lock primary NID logic
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 61ef5157b6d63a7a63b814d53954902517240678

Comment by Gerrit Updater [ 06/May/21 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43564
Subject: LU-14668 lnet: override manually discovered peers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0499b18bcba40164ca83693d2abb20f82dbb7303

Comment by Gerrit Updater [ 06/May/21 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43565
Subject: LU-14668 lnet: don't delete peer created by Lustre
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6bb4a91360fbb1ca6c0cf952f13a9eaad8f85943

Comment by Gerrit Updater [ 25/May/21 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43788
Subject: LU-14668 lnet: Peers added via kernel API should be permanent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c942067894767777e76c4114d6e7d09ab354fd90

Comment by Chris Horn [ 20/Jul/21 ]

This solution should avoid long mount delays. However, it will not help in the case when the Primary NID used by Lustre is not reachable or LNet encounters network delays reaching that NID.

The patches associated with this ticket are built on top of LU-14661. With LU-14661, Lustre can provide complete peer information for multi-rail peers to LNet before any message is sent to the peer. This means that we do not need to rely on the primary NID being operable for initial discovery. Granted, it maybe likely that the primary NID is tried first, and that will cause some delay, but w/LNet health and resends we can quickly try alternative peer NIs if they are available.

One question I have about this feature is how we deal with cases where servers get new IPs? Or some OSS is decommissioned and new one brought up with different IPs or re-using one or more old IPs, etc. Is the capability provided by this ticket robust enough to handle that, or are the administrative procedures for doing these things such that it is a non-issue for LNet?

Comment by Amir Shehata (Inactive) [ 21/Jul/21 ]

The intent is to have Lustre dictate the primary NID of the node. All other interfaces will be discovered in the standard method. If new NIDs are added to an existing node then, the addition of the extra NIDs will trigger a discovery round to enable LNet to use the new NIDs. However, if the admin changes the primary NID of the node, IE the NID which Lustre was configured with, then this will result in communication problems. However, I believe that this behaviour doesn't introduce any extra regression. Currently if the NIDs which lustre were initially configured with were changed, then tunefs will need to be re-run to update the configuration.

There is an existing patch, which needs to be updated, which brings in the functionality to handle new NIDs being added: https://review.whamcloud.com/#/c/39709/

This patch is also intended to handle the case where OSSes are decommissioned, the file system brought down and then brought up again. the llog will have the NIDs of the decommissioned OSSes, currently we will attempt to discover these and we've seen that this could result in long mount times. The o2iblnd changes do not completely resolve this issue. With this feature the OSSes will be discovered in the background and will not cause the mount to wait for their discovery. Only when the first attempt to communicate with a node via real traffic will the traffic be queued until discovery is complete.

The other cases which you mentioned, IE IPs are re-used, this patch doesn't change the behaviour of LNet for these scenarios.

Comment by Gerrit Updater [ 18/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43562/
Subject: LU-14668 lnet: peer state to lock primary nid
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 684943e2d0c2ad095e3521586d61d007b4f49abd

Comment by Gerrit Updater [ 18/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43563/
Subject: LU-14668 lnet: Lock primary NID logic
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 024f9303bc6f32a3113357c864765c4f9c93ed03

Comment by Chris Horn [ 26/Oct/21 ]

It seems this has caused a serious regression on master where clients are unable to mount a filesystem under routed LNet configurations:

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43563/
Subject: LU-14668 lnet: Lock primary NID logic
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 024f9303bc6f32a3113357c864765c4f9c93ed03

Comment by Chris Horn [ 26/Oct/21 ]

ashehata green should this commit be reverted ^ ?

Comment by Chris Horn [ 26/Oct/21 ]

I think the aforementioned commit will break any routed configuration where the clients mount the filesystem using non-primary NIDs. For example:

MGS

10.16.100.52@o2ib
10.16.100.53@o2ib
10.16.100.52@o2ib10
10.16.100.53@o2ib10

Clients have routes to the o2ib10 network, so they mount using something like:

mount -t lustre 10.16.100.52@o2ib10,10.16.100.53@o2ib10:/lustre ...

LNetPrimaryNID() on the client returns 10.16.100.52@o2ib10 as the primary NID (because of https://review.whamcloud.com/43563/ ), so client sets up ptlrpc connection using this NID. But incoming messages from the MGS have the actual primary NID, 10.16.100.52@o2ib. So they do not match and the incoming messages get dropped. This prevents the client from being able to mount.

walleye-p5:~ # !grep
grep lustre /etc/fstab
10.16.100.52@o2ib10,10.16.100.53@o2ib10:10.16.100.54@o2ib11,10.16.100.55@o2ib11:/kjcf05 /lus/kjcf05 lustre rw,flock,lazystatfs,noauto 0 0
walleye-p5:~ # mount /lus/kjcf05
mount.lustre: mount 10.16.100.52@o2ib10,10.16.100.53@o2ib10:10.16.100.54@o2ib11,10.16.100.55@o2ib11:/kjcf05 at /lus/kjcf05 failed: Input/output error
Is the MGS running?
walleye-p5:~ #

If I revert https://review.whamcloud.com/43563 then I'm able to mount:

walleye-p5:~ # mount /lus/kjcf05
walleye-p5:~ # lfs check servers
kjcf05-OST0000-osc-ffff8888361cd000 active.
kjcf05-OST0001-osc-ffff8888361cd000 active.
kjcf05-OST0002-osc-ffff8888361cd000 active.
kjcf05-OST0003-osc-ffff8888361cd000 active.
kjcf05-MDT0000-mdc-ffff8888361cd000 active.
kjcf05-MDT0001-mdc-ffff8888361cd000 active.
MGC10.16.100.52@o2ib10 active.
walleye-p5:~ #
Comment by Chris Horn [ 26/Oct/21 ]

I think the regression doesn't strictly apply to routed configurations, but any client mount where the client's initial connection attempt goes to a non-primary NID. This would be typical for routed clients. Not so much with direct connect, but it is possible there too (like with multi-homed servers)

Comment by Chris Horn [ 27/Oct/21 ]

I opened https://jira.whamcloud.com/browse/LU-15169 for the regression

Comment by Gerrit Updater [ 22/Feb/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50106
Subject: LU-14668 lnet: Lock primary NID logic
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4a73f60230216639001b6a217bd90324fc2af8a

Comment by Gerrit Updater [ 27/Feb/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50149
Subject: LU-14668 lnet: add 'force' option to lnetctl peer del
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a16359493570865d662bf9df494dc28cc8d0527e

Comment by Gerrit Updater [ 28/Feb/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50159
Subject: LU-14668 lnet: add 'lock_prim_nid" lnet module parameter
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f7f554dd75d2721d87719adfa5874c76c6444b95

Comment by Gerrit Updater [ 08/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50106/
Subject: LU-14668 lnet: Lock primary NID logic
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: aacb16191a72bc6db1155030849efb0d6971a572

Comment by Gerrit Updater [ 08/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/43788/
Subject: LU-14668 lnet: Peers added via kernel API should be permanent
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 41733dadd8ad0e87e44dd19e25e576e90484cb9b

Comment by Gerrit Updater [ 08/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/43565/
Subject: LU-14668 lnet: don't delete peer created by Lustre
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7cc5b4329fc2eecbf09dbda85efe58f4ad5a32b9

Comment by Gerrit Updater [ 08/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50149/
Subject: LU-14668 lnet: add 'force' option to lnetctl peer del
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f1b2d8d60c593a670b36006bcf9b040549d8c13a

Comment by Gerrit Updater [ 09/Mar/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50249
Subject: LU-14668 tests: verify state of peer added with '--lock_prim'
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 070704850b80f6260db0e59b79c08aedcfb8b993

Comment by Gerrit Updater [ 28/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50159/
Subject: LU-14668 lnet: add 'lock_prim_nid" lnet module parameter
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fc7a0d6013b46ebc17cdfdccc04a5d1d92c6af24

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50249/
Subject: LU-14668 tests: verify state of peer added with '--lock_prim'
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9b6fcfa334b153e52caec16d4cfd180306826a3a

Comment by Gerrit Updater [ 25/May/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51130
Subject: LU-14668 lnet: Lock primary NID logic
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 316b8af5bdc27da6792656f6a6ff0b2e320aae79

Comment by Gerrit Updater [ 25/May/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51131
Subject: LU-14668 lnet: Peers added via kernel API should be permanent
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: fc7d1dc4c7db65d71934305b8bdbe9156c7c0ee9

Comment by Gerrit Updater [ 25/May/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51132
Subject: LU-14668 lnet: don't delete peer created by Lustre
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 7e30665cf972248713b9b9cfbc4f436d3f95247e

Comment by Gerrit Updater [ 25/May/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51133
Subject: LU-14668 lnet: add 'force' option to lnetctl peer del
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: d6bb43d5e8598af1828de31b42c2f8f1cb17b023

Comment by Gerrit Updater [ 25/May/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51134
Subject: LU-14668 lnet: add 'lock_prim_nid" lnet module parameter
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: c01c95917ff9dd46a382d9c3a81660820eb89080

Comment by Gerrit Updater [ 25/May/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51135
Subject: LU-14668 tests: verify state of peer added with '--lock_prim'
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 7e46cd092120ea8807fb78598df15de042e8bae5

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51130/
Subject: LU-14668 lnet: Lock primary NID logic
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: b341288179d9b3ad594b461586d826d6811db5a1

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51131/
Subject: LU-14668 lnet: Peers added via kernel API should be permanent
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: f63e87f0a88a856d5cc38039afef704676ff5521

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51132/
Subject: LU-14668 lnet: don't delete peer created by Lustre
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 26d11f254795a2869ae30a7e5d6ebf2bee59f879

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51133/
Subject: LU-14668 lnet: add 'force' option to lnetctl peer del
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 8c4df87ec21bf5d61dab4b6580fc7f7ecfa91e37

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51134/
Subject: LU-14668 lnet: add 'lock_prim_nid" lnet module parameter
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 6cfc8e55a2e77c9c91b81a8842e2cbd886025298

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51135/
Subject: LU-14668 tests: verify state of peer added with '--lock_prim'
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 7ee579d25a614946ba22a5a08fdc4373c41ef8f1

Comment by Peter Jones [ 13/Nov/23 ]

AFAICT this is merged for 2.15.4 and 2.16 (there is just one outstanding patch that should be abandoned)

Generated at Sat Feb 10 03:11:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.