Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
When the file system is being mounted the llog is traversed and a local peer representation at the pltrpc layer is created. As part of this process ptlrpc_connection_get() -> LNetPrimaryNID() path gets executed. As a result LNet performs the discovery protocol, to update its local representation of the peer. This involves communicating with the NID provided by the ptlrpc_connection_get() call. Prior to the introduction of LNetPrimaryNID() no communication with the remote peer was performed at this point. This led to the situation where when the llog contains references to old NIDs, or NIDs for bad interfaces, the connection to that NID can take up to the LND timeout (in the 50s range) to expire. This could extend the mount time considerably.
To avoid this issue we can change the concept of Primary NID. Primary NID currently is a global concept derived from the first interface configured on the node. However, there doesn't seem to be a need to make this a global concept. Each node can have a different view of the primary NID of the peers it communicates with, as long as it keeps the Primary NID consistent through out the life of the peer.
Since Lustre is the one which requests the initial connection to the peer, it already provides LNet with the NID which it prefers to use (likely the one configured). LNet can lock that NID as the primary NID of the node, even if it is not the first interface configured on the node.
This actually clarifies some confusion encountered on some sites, where the first interface configured on the system is not on the same network as the peer's interface.
For example a tcp client can mount a server on the TCP network. However the server has the o2ib interface configured first. On the TCP client the peer shows the o2ib as the primary NID. This can be confusion when viewing configuration.
By locking the primary NID of the peer to the tcp NID, then viewing the peer configuration from the tcp client will make more sense.
This way the primary NID concept becomes a node local concept. It is the NID by which a Lustre node references a peer. Different lustre nodes can reference the same peer by different NIDs.
Practically speaking usually the FS is configured with the first NID which is reachable. From a TCP client it would be the first tcp interface configured and the same for other networks. However, the solution doesn't demand that.
The solution will be spread across the following patches
- Introduce a LOCK_PRIMARY state to the peer. This is set when LNetPrimaryNID() is called on a new peer or a peer is explicitly added by Lustre.
- When a peer is in LOCK_PRIMARY state, the primary NID provided by lustre will not change. The peer can be populated by other interfaces' NIDs; however, the primary NID will not change
- Get Lustre to pre-define the Primary NID and the constituent NIDs, such that a call to LNetPrimaryNID() on a constituent NID returns consistent result and is not dependant on the completion of the discovery protocol.
- If a peer was manually discovered, then Lustre explicitly adds it using a different primary NID afterwards, the Lustre configuration path will take precedence. The peer will be deleted and recreated with the primary NID Lustre uses.
- When lustre deletes the UUID, the lock the LNet peer should be removed.
- TBD: Should we be removing the lock from an LNet Peer when Lustre evicts a node or when Lustre is unmounted?
This solution should avoid long mount delays. However, it will not help in the case when the Primary NID used by Lustre is not reachable or LNet encounters network delays reaching that NID.
On mount the Lustre needs to reach the MGS to retrieve the server NID information in the llog.
obd_connect()>lmv_connect>lmv_connect_mdc->client_connect_import->ptlrpc_connect_import() to connect
it then does a sync OBD_STATFS to MDT0000 to test its aliveness (maybe to wait for the MDT0000 connection to complete), then checks some connection features on the MDT to verify it is not too old, then gets the root directory FID from MDT0000 for the mount. after that, it follows a similar process to connect to the OSTs, but it doesn't wait for them to finish
The purpose of this solution is not to delay mount on servers which might not be reachable during mount time. By pushing discovery in the background, the discovery can complete at its own time. Any messages to the node under discovery will be sent only after discovery is complete. Therefore, NIDs provided by lustre client for servers necessary for mount will by definition need to be reachable for the mount to complete. Other nodes which are not needed at mount time will not block mount.
Attachments
Issue Links
- is related to
-
LU-18572 Regression in 2.15.4 backport of b341288179 LU-14668 lnet: Lock primary NID logic
- Open
-
LU-17544 with lock_prim_nid=1 it seems to be possible that an unreachable nid gets primary nid
- Open
-
LU-15169 Regression in "024f9303bc LU-14668 lnet: Lock primary NID logic" breaks client mounts
- Resolved
-
LU-15541 Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()
- Resolved
-
LU-17664 Regression in 2.15.4 backport of LU-14668 lnet: add 'lock_prim_nid" lnet module parameter
- Resolved
-
LU-14566 Skip discovery in LNetPrimaryNID when lnet_peer_discovery_disabled is set
- Resolved
- is related to
-
LU-10360 use Imperative Recovery logs for client->MDT/OST connections
- Open