[LU-14668] LNet: do discovery in the background Created: 04/May/21 Updated: 07/Feb/24 Resolved: 13/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
When the file system is being mounted the llog is traversed and a local peer representation at the pltrpc layer is created. As part of this process ptlrpc_connection_get() -> LNetPrimaryNID() path gets executed. As a result LNet performs the discovery protocol, to update its local representation of the peer. This involves communicating with the NID provided by the ptlrpc_connection_get() call. Prior to the introduction of LNetPrimaryNID() no communication with the remote peer was performed at this point. This led to the situation where when the llog contains references to old NIDs, or NIDs for bad interfaces, the connection to that NID can take up to the LND timeout (in the 50s range) to expire. This could extend the mount time considerably. To avoid this issue we can change the concept of Primary NID. Primary NID currently is a global concept derived from the first interface configured on the node. However, there doesn't seem to be a need to make this a global concept. Each node can have a different view of the primary NID of the peers it communicates with, as long as it keeps the Primary NID consistent through out the life of the peer. Since Lustre is the one which requests the initial connection to the peer, it already provides LNet with the NID which it prefers to use (likely the one configured). LNet can lock that NID as the primary NID of the node, even if it is not the first interface configured on the node. This actually clarifies some confusion encountered on some sites, where the first interface configured on the system is not on the same network as the peer's interface. For example a tcp client can mount a server on the TCP network. However the server has the o2ib interface configured first. On the TCP client the peer shows the o2ib as the primary NID. This can be confusion when viewing configuration. By locking the primary NID of the peer to the tcp NID, then viewing the peer configuration from the tcp client will make more sense. This way the primary NID concept becomes a node local concept. It is the NID by which a Lustre node references a peer. Different lustre nodes can reference the same peer by different NIDs. Practically speaking usually the FS is configured with the first NID which is reachable. From a TCP client it would be the first tcp interface configured and the same for other networks. However, the solution doesn't demand that. The solution will be spread across the following patches
This solution should avoid long mount delays. However, it will not help in the case when the Primary NID used by Lustre is not reachable or LNet encounters network delays reaching that NID. On mount the Lustre needs to reach the MGS to retrieve the server NID information in the llog. obd_connect() it then does a sync OBD_STATFS to MDT0000 to test its aliveness (maybe to wait for the MDT0000 connection to complete), then checks some connection features on the MDT to verify it is not too old, then gets the root directory FID from MDT0000 for the mount. after that, it follows a similar process to connect to the OSTs, but it doesn't wait for them to finish The purpose of this solution is not to delay mount on servers which might not be reachable during mount time. By pushing discovery in the background, the discovery can complete at its own time. Any messages to the node under discovery will be sent only after discovery is complete. Therefore, NIDs provided by lustre client for servers necessary for mount will by definition need to be reachable for the mount to complete. Other nodes which are not needed at mount time will not block mount. |
| Comments |
| Comment by Gerrit Updater [ 06/May/21 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43562 |
| Comment by Gerrit Updater [ 06/May/21 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43563 |
| Comment by Gerrit Updater [ 06/May/21 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43564 |
| Comment by Gerrit Updater [ 06/May/21 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43565 |
| Comment by Gerrit Updater [ 25/May/21 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43788 |
| Comment by Chris Horn [ 20/Jul/21 ] |
The patches associated with this ticket are built on top of One question I have about this feature is how we deal with cases where servers get new IPs? Or some OSS is decommissioned and new one brought up with different IPs or re-using one or more old IPs, etc. Is the capability provided by this ticket robust enough to handle that, or are the administrative procedures for doing these things such that it is a non-issue for LNet? |
| Comment by Amir Shehata (Inactive) [ 21/Jul/21 ] |
|
The intent is to have Lustre dictate the primary NID of the node. All other interfaces will be discovered in the standard method. If new NIDs are added to an existing node then, the addition of the extra NIDs will trigger a discovery round to enable LNet to use the new NIDs. However, if the admin changes the primary NID of the node, IE the NID which Lustre was configured with, then this will result in communication problems. However, I believe that this behaviour doesn't introduce any extra regression. Currently if the NIDs which lustre were initially configured with were changed, then tunefs will need to be re-run to update the configuration. There is an existing patch, which needs to be updated, which brings in the functionality to handle new NIDs being added: https://review.whamcloud.com/#/c/39709/ This patch is also intended to handle the case where OSSes are decommissioned, the file system brought down and then brought up again. the llog will have the NIDs of the decommissioned OSSes, currently we will attempt to discover these and we've seen that this could result in long mount times. The o2iblnd changes do not completely resolve this issue. With this feature the OSSes will be discovered in the background and will not cause the mount to wait for their discovery. Only when the first attempt to communicate with a node via real traffic will the traffic be queued until discovery is complete. The other cases which you mentioned, IE IPs are re-used, this patch doesn't change the behaviour of LNet for these scenarios. |
| Comment by Gerrit Updater [ 18/Aug/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43562/ |
| Comment by Gerrit Updater [ 18/Aug/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43563/ |
| Comment by Chris Horn [ 26/Oct/21 ] |
|
It seems this has caused a serious regression on master where clients are unable to mount a filesystem under routed LNet configurations:
|
| Comment by Chris Horn [ 26/Oct/21 ] |
| Comment by Chris Horn [ 26/Oct/21 ] |
|
I think the aforementioned commit will break any routed configuration where the clients mount the filesystem using non-primary NIDs. For example: MGS 10.16.100.52@o2ib 10.16.100.53@o2ib 10.16.100.52@o2ib10 10.16.100.53@o2ib10 Clients have routes to the o2ib10 network, so they mount using something like: mount -t lustre 10.16.100.52@o2ib10,10.16.100.53@o2ib10:/lustre ... LNetPrimaryNID() on the client returns 10.16.100.52@o2ib10 as the primary NID (because of https://review.whamcloud.com/43563/ ), so client sets up ptlrpc connection using this NID. But incoming messages from the MGS have the actual primary NID, 10.16.100.52@o2ib. So they do not match and the incoming messages get dropped. This prevents the client from being able to mount. walleye-p5:~ # !grep grep lustre /etc/fstab 10.16.100.52@o2ib10,10.16.100.53@o2ib10:10.16.100.54@o2ib11,10.16.100.55@o2ib11:/kjcf05 /lus/kjcf05 lustre rw,flock,lazystatfs,noauto 0 0 walleye-p5:~ # mount /lus/kjcf05 mount.lustre: mount 10.16.100.52@o2ib10,10.16.100.53@o2ib10:10.16.100.54@o2ib11,10.16.100.55@o2ib11:/kjcf05 at /lus/kjcf05 failed: Input/output error Is the MGS running? walleye-p5:~ # If I revert https://review.whamcloud.com/43563 then I'm able to mount: walleye-p5:~ # mount /lus/kjcf05 walleye-p5:~ # lfs check servers kjcf05-OST0000-osc-ffff8888361cd000 active. kjcf05-OST0001-osc-ffff8888361cd000 active. kjcf05-OST0002-osc-ffff8888361cd000 active. kjcf05-OST0003-osc-ffff8888361cd000 active. kjcf05-MDT0000-mdc-ffff8888361cd000 active. kjcf05-MDT0001-mdc-ffff8888361cd000 active. MGC10.16.100.52@o2ib10 active. walleye-p5:~ # |
| Comment by Chris Horn [ 26/Oct/21 ] |
|
I think the regression doesn't strictly apply to routed configurations, but any client mount where the client's initial connection attempt goes to a non-primary NID. This would be typical for routed clients. Not so much with direct connect, but it is possible there too (like with multi-homed servers) |
| Comment by Chris Horn [ 27/Oct/21 ] |
|
I opened https://jira.whamcloud.com/browse/LU-15169 for the regression |
| Comment by Gerrit Updater [ 22/Feb/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50106 |
| Comment by Gerrit Updater [ 27/Feb/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50149 |
| Comment by Gerrit Updater [ 28/Feb/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50159 |
| Comment by Gerrit Updater [ 08/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50106/ |
| Comment by Gerrit Updater [ 08/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/43788/ |
| Comment by Gerrit Updater [ 08/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/43565/ |
| Comment by Gerrit Updater [ 08/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50149/ |
| Comment by Gerrit Updater [ 09/Mar/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50249 |
| Comment by Gerrit Updater [ 28/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50159/ |
| Comment by Gerrit Updater [ 11/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50249/ |
| Comment by Gerrit Updater [ 25/May/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51130 |
| Comment by Gerrit Updater [ 25/May/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51131 |
| Comment by Gerrit Updater [ 25/May/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51132 |
| Comment by Gerrit Updater [ 25/May/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51133 |
| Comment by Gerrit Updater [ 25/May/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51134 |
| Comment by Gerrit Updater [ 25/May/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51135 |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51130/ |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51131/ |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51132/ |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51133/ |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51134/ |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51135/ |
| Comment by Peter Jones [ 13/Nov/23 ] |
|
AFAICT this is merged for 2.15.4 and 2.16 (there is just one outstanding patch that should be abandoned) |