Improve mount.lustre with many MGS NIDs (LU-16738)

[LU-17379] try MGS NIDs more quickly at initial mount Created: 19/Dec/23  Updated: 07/Feb/24

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.3
Fix Version/s: None

Type: Technical task Priority: Minor
Reporter: Andreas Dilger Assignee: Mikhail Pershin
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17357 Client can use incorrect sec flavor w... Open
is related to LU-17476 lnet: only report mismatched nid in M... Open
is related to LU-17505 socklnd: return LNET_MSG_STATUS_NETWO... Open
is related to LU-16738 Improve mount.lustre with many MGS NIDs Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The MGC should try all of the MGS NIDs provided on the command line fairly rapidly the first time after mount (without RPC retry) so that the client can detect the case of the MGS running on a backup node quickly. Otherwise, if the mount command-line has many NIDs (4 is very typical, but may be 8 or even 16 in some cases) then the mount can be stuck for several minutes trying to find the MGS running on the right node.

The MGC should preferably one NID per node first, then the second NID on each node, etc. This can be done efficiently, assuming the NIDs are provided properly with colon-separated <MGSNODE>:<MGSNODE> blocks, and comma-separated "<MGSNID>,<MGSNID>" entries within that.

However, the NIDs are often not listed correctly with ":" separators, so if there are more than 4 MGS NIDs on the command-line, then it would be better to to do a "bisection" of the NIDs to best handle the case of 2/4/8 interfaces per node vs. 2/4/8 separate nodes. For example, try NIDs in order 0, nids/2, nids/4, nids*3/4 for 4 NIDs, then nids/8, nids*5/8, nids*3/8, nids*7/8 for 8 NIDs, then nids/16, nids*9/16, nids*5/16, nids*13/16, nids*3/16, nids*11/16, nids*7/16, nids*15/16 for 16 NIDs, and similarly for 32 NIDs (the maximum).

It should be fairly quick to determine if the MGS is not responding on a particular NID, because the client will get a rapid error response (e.g. -ENODEV or -ENOTCONN or -EHOSTUNREACH with a short RPC timeout) so in that case it should try all of the NIDs once quickly. If it gets ETIMEDOUT that might mean the node is unavailable or overloaded, or it might mean the MGS is not running yet, so the client should back off and retry the NIDs with a longer timeout after the initial burst.

However, in most cases the MGS should be running on some node and it just needs to avoid going into slow "backoff" mode until after it has tried all of the NIDs at least once.

It would make sense in this case to quiet the "connecting to MGS not running on this node" message for MGS connections so that it doesn't spam the console.



 Comments   
Comment by Mikhail Pershin [ 22/Jan/24 ]

While investigating this more closely I've found that is not as trivial as it seems. First of all there is time limit for mgc_enqueue() set inside function itself:

    /* Limit how long we will wait for the enqueue to complete */
    req->rq_delay_limit = short_limit ? 5 : MGC_ENQUEUE_LIMIT(exp->exp_obd);

and it is the main reason why only limited amount of NIDs are checked - that is how many attempt were fit in that limit. Note, that limit is just a time enough to find MGS on second node. Basically that is all we can guarantee. I suppose at time that was introduced we were not aware about 4 failover nodes for MGS, not speaking about 16 even.

Interesting that with patch from LU-17357 we would always wait for sptlrpc config with timeout enough to scan all mgs nodes despite send limit values above. That means in turn there is no need to set that limit to shorter values as that causes only more often re-queue attempts.

So the problem with exiting without scanning all mgs nodes should be resolved by that long enough waiting for stlrpc config and this ticket is more targeted now on reducing total amount of that waiting time

First important notice about NIDs scanning - each mgs node separated by ':' is added in connection list of import and can be managed in import_select_connection() in any way we would choose, either time-based like now or by bisect proposed by Andreas or by any other way. Meanwhile the mgs NIDs inside single node separated by ',' are out of reach because they are added to connection at LNET level and only LNET is managing them, they are not even visible via lctl get_param mgc.*.import unlike connections. That makes non-trivial task to try just first NIDs on each node, then second and so on. Basically that means import_select_connection() should be able to notify LNET it needs to try just single NID from many

Another way to improve - we can reduce timeout for the first round on connection attempts based ob amount of them in connection list. Now it is always 5s for first attempt. It can be reduced to lesser value if there are many connection linearly or by other means, though it makes no sense to use value less than obd_get_at_min probably.

As for current working schema of import_select_connection() - it used last_attempt time per connection to choose the least recently used one when there is no other preference. For any new connection its last_attempt value is 0 and such one will be used preferably. That means the first round is linear, it goes from connection list head to its end, gets connection with 0 last_attempt and try, then next one in list will be tried as having 0 and so on.

So proposed bisect approach (or any other) can be done over those connections still having 0 (not yet tried) while we have any in the list.

When all have non-zero last_attempt values then we also have options - to use the least recent (it is supposed to keep initial connection order, not exactly due to 1s time granularity but still), or think about other approaches , e.g. try just imp_last_success_conn if exists or maybe it makes sense to remember primary nodes and try them first always on each new round, as HA would try to return nodes there after all

 

Comment by Andreas Dilger [ 23/Jan/24 ]

Mike, thanks for digging into the details here.

It looks like in import_select_connection() the client could try to connect to the different NIDs to see which one is alive, rather than waiting on each one separately? That could potentially be done "semi-parallel", like:

  • send connect to first NID with 30s or longer RPC timeout (in case server is busy)
  • wait 5s for reply, check if any NID has connected
  • if no connect yet, send to next NID
  • when connect is completed to some NID, set flag in export to indicate no more connections to be tried

We would probably want to quiet the "initial connect" messages from the clients, maybe with MSG_CONNECT_INITIAL so that they don't spam the server logs when trying all the servers with "LustreError: 137-5: lfs00-MDT0005_UUID: not available for connect from 10.89.104.111@tcp (no target)" when all of the clients are trying to connect to different servers.

Comment by Mikhail Pershin [ 02/Feb/24 ]

So far the most troublesome case is when some NIDs in NID list are unavailable, e.g. bad address or missing network:

# time mount -t lustre 192.168.56.11@tcp:192.168.6.12@tcp:192.168.56.13@tcp:192.168.56.14@tcp:192.168.56.101@tcp:/lustre /mnt/lustre

real    0m59.289s
user    0m0.003s
sys    0m0.051s
 

Note the second node is 192.168.6.12 which has no interface on current node. If all addresses are on x.x.56.x network then mount took about 12s to reach correct address which is the last one in the list. So as expected it takes about 4s per node to try.

But the situation goes bad if any NID is from missing network.  The request is expired by ptlrpc timeout as expected in about 10s, according with deadline:

00000100:00100000:2.0:1706879770.016570:0:6197:0:(client.c:2337:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1706879760/real 0]  req@ffff8800a5fb93c0 x1789792650076096/t0(0) o250->MGC192.168.56.11@tcp@192.168.6.12@tcp:26/25 lens 520/544 e 0 to 1 dl 1706879770 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0

but it has no further effect as it starts to wait for LNet to unlink it, which happens after about 40s:

00000100:00000040:1.0:1706879811.687504:0:6197:0:(lustre_net.h:2443:ptlrpc_rqphase_move()) @@@ move request phase from UnregRPC to Rpc  req@ffff8800a5fb93c0 x1789792650076096/t0(0) o250->MGC192.168.56.11@tcp@192.168.6.12@tcp:26/25 lens 520/544 e 0 to 1 dl 1706879770 ref 1 fl UnregRPC:EeXNQU/200/ffffffff rc -110/-1 job:'kworker.0' uid:0 gid:0

And the reason of such behavior is lnet peer discovery, which started when RPC was sent:

00000100:00000200:2.0:1706879760.019547:0:6197:0:(niobuf.c:86:ptl_send_buf()) Sending 520 bytes to portal 26, xid 1789792650076096, offset 0
00000400:00000200:2.0:1706879760.019570:0:6197:0:(lib-move.c:5284:LNetPut()) LNetPut -> 12345-192.168.6.12@tcp
00000400:00000200:2.0:1706879760.019605:0:6197:0:(peer.c:2385:lnet_peer_queue_for_discovery()) Queue peer 192.168.6.12@tcp: -114

it fails and exit but discovery is going in background. And this RPC may proceed only when discovery is timed out despite all RPC deadlines and timeouts:

00000400:00000200:1.0:1706879811.687440:0:6196:0:(peer.c:3061:lnet_discovery_event_handler()) Received event: 6
00000400:00000200:1.0:1706879811.687442:0:6196:0:(peer.c:2385:lnet_peer_queue_for_discovery()) Queue peer 192.168.6.12@tcp: -114
00000400:00000010:1.0:1706879811.687443:0:6196:0:(api-ni.c:1821:lnet_ping_buffer_free()) kfreed 'pbuf': 281 at ffff88011151fba8.
00000400:00000200:1.0:1706879811.687447:0:6196:0:(peer.c:3717:lnet_peer_ping_failed()) peer 192.168.6.12@tcp:-110
00000400:00000200:1.0:1706879811.687449:0:6196:0:(peer.c:4092:lnet_peer_discovery()) peer 192.168.6.12@tcp(ffff8800a5ea06a8) state 0x102060 rc -110
00000400:00000200:1.0:1706879811.687450:0:6196:0:(peer.c:2401:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 192.168.6.12@tcp

So in general we need some way either to prevent peer discovery (and lnetctl set discovery 0 doesn't work) for particular RPCs at least or to manage discovery timeout to be not longer than RPC deadline. Also, we could consider to force discovery to stop when request is expired, no need to discover anything already, we are just waiting for nothing. 

ssmirnov , can you assist with that and propose possible solutions maybe?

 

 

 

Comment by Serguei Smirnov [ 02/Feb/24 ]

Yes, I can see the same even with just two NIDs in the mount command out of which the first NID is unreachable. 

Default lnet_transaction_timeout of 150 is enough in this case for the mount to fail. Reducing lnet_transaction_timeout to 30 allows the mount to succeed. We're going to be hearing about this from the field I'm sure.

I haven't tried a mount with lots of ":"-separated NIDs, but it looks like the supplied NIDs are being discovered in the background in parallel. In my test, discovery for the second (reachable) NID completed almost immediately. However, lustre probably just didn't know it could actually talk to it.

Aside from timeout manipulation, it seems that lustre could benefit from knowing that the peer is reachable before firing off a request to it. Not sure what would be the best way to accomplish this within the current architecture. Maybe registering some sort of (optional) callback so that after calling LnetAddPeer lustre gets notified that the peer got added successfully, i.e. discovery is done? If there's a lustre thread waiting on these events and acting on them in the background, this could work. There may be possibly unwanted effects from this: for example, if the first listed server is slower than the second, the second one will be picked for the mount ahead of the first.

 

 

 

 

Comment by Mikhail Pershin [ 03/Feb/24 ]

ssmirnov , can we avoid such behavior somehow? The current problem is that Lustre get RPC failed in 10s but can't proceed further until LNET discovery is timed out. That is unexpected behavior, we shouldn't wait about 40s for RPC with 10s deadline. So my question is that possible somehow to notify LNET that we don't need to wait for that particular lnet_libmd? Right now when request is timed out then inside ptlrpc_unregister_reply() we call LNetMDUnlink() for reply md and for request md if it is not unlinked yet (which means it is not yet sent as in our case), technically that unlink for request md should invoke request_out_callback() and finalize RPC processing as I'd expect. But on practice that doesn't work, because MD is referenced:

00000400:00000200:0.0:1706967574.984808:0:30492:0:(lib-md.c:64:lnet_md_unlink()) Queueing unlink of md ffff8800934b2f78

and that unlink happens only when discovery is done:

00000400:00000200:0.0:1706967618.538695:0:30491:0:(peer.c:3061:lnet_discovery_event_handler()) Received event: 6
00000400:00000200:0.0:1706967618.538696:0:30491:0:(peer.c:2385:lnet_peer_queue_for_discovery()) Queue peer 192.168.6.12@tcp: -114
00000400:00000200:0.0:1706967618.538703:0:30491:0:(peer.c:4092:lnet_peer_discovery()) peer 192.168.6.12@tcp(ffff8800a1fc9dd8) state 0x102060 rc -110
00000400:00000200:0.0:1706967618.538704:0:30491:0:(peer.c:2401:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 192.168.6.12@tcp
00000400:00000200:0.0:1706967618.538706:0:30491:0:(lib-msg.c:1020:lnet_is_health_check()) msg ffff8800ba545448 not committed for send or receive
00000400:00000200:0.0:1706967618.538706:0:30491:0:(lib-md.c:68:lnet_md_unlink()) Unlinking md ffff8800934b2f7

and only now RPC is finalized. Is there any way to don't delay md unlink but abort that MD sending immediately? I mean just technically at LNET level.

 

As for idea to don't even try peers which are not yet discovered - can we just check peer status from ptlrpc in some way? Probably we can add new primitive similar to LNetAddPeer() or LNetDebugPeer(), say LNetDiscoverPeer() which would return current discovery status (uptodate, discovering, disabled, etc.) . In that case we could just skip not discovered peers and try first those who discovered and avoid being stuck on dead peers

Comment by Andreas Dilger [ 03/Feb/24 ]

Mike, in some cases the LNet layer is unable to complete the MD and deregister until a timeout finishes, because the RDMA address may have been given out to a remote node. That said, if the host is unreachable or returns an error immediately then that shouldn't happen.

Is it possible for the MGC to send separate RPCs asynchronously, so that it doesn't care what happens at the LNet layer? That way, the client can have short timeouts, and try the different MGS NIDs quickly (eg. a few seconds apart), then wait for the first one to successfully reply. We do the same with STATFS and GETATTR RPCs on the OST side.

One important point is to silence the console errors on the servers for this case, so the logs are not spammed with "refusing connect from XXXX for MGS" errors (though it should still be printed for other targets since that has been very useful for debugging network issues recently.

Comment by Mikhail Pershin [ 03/Feb/24 ]

Andreas, I don't see how that can be done, all RPCs are using import to work with, and import uses only one current connection. That is what import_select_connection() does, choose just one from many, then we try to send RPC over it. For your idea we would need to setup imports for each NID to send RPCs to them in parallel, i.e. instead of one MGC import we would need 16 if there are 16 NIDs. And we have to organize them somehow to mark only one as real while others as 'potential'. We could think about how to send RPC with one import but over particular connection maybe, E.g. pings could be sent over each one listed in import and update their status, and import could use that status while choosing a new connection. But that need to be implemented from scratch and it seems that would be somehow the same as LNET peer discovery but at ptlrpc level. Right now I think that peer discovery does in background almost the same as you describe. So I'd try to get peer info from LNET and use it to choose at least alive peers

Comment by James A Simmons [ 03/Feb/24 ]

Mikhail what you described is very similar to LU-10360 work.

Comment by Andreas Dilger [ 03/Feb/24 ]

Mike, you are right, I wasn't thinking about this side of things. Doing this in parallel at the LNet level would be better. Can the MGC pass all of the peer NIDs to LNet directly to speed up discovery? That would require the MGS NIDs to be specified correctly ("," vs. ":" separators), or possibly have LNet not "trust" the NIDs given on the command-line as all being from the same host.

James, I don't see how IR can help with the initial MGS connection, since LU-10360 is all about getting the current target NiD(s) from the MGS. That would be a chicken-and-egg problem.

Comment by Serguei Smirnov [ 03/Feb/24 ]

LNet does appear to be discovering provided NIDs in parallel, at least in my two-NID ":"-separated test, with the first  of the two NIDs unreachable, the second NID was being discovered immediately. (This may be different with "," separated NIDs.) I don't know what happens at lustre layer exactly, but it looks like it needs to wait for a confirmation on which NID can be used before trying to establish a (lustre-level) connection. That's why I was proposing that there's a thread waiting on "discovery complete" events from LNet. To avoid using provided peers in random order, the thread can wait (for a shorter time) before picking the first available peer ahead of the first listed one. That said, it is definitely possible to add a peer status checker to LNet API if lustre layer prefers polling.

Comment by Andreas Dilger [ 03/Feb/24 ]

Ideally, the MGC code could just give the full list of NIDs to ptlrpc and/or LNet along with the RPC and LNet would sanity the right place. I'm totally fine to change the MGC and/or ptlrpc to do something to notify LNet about all of the MGS NIDs in advance of sending the RPC, so that the right layer can do it the best. In all likelihood, we should probably do the same thing for other connections as well, but they happen in the background and are less noticeable.

I think it would be better to have the MGC and other connections just wait for the RPC reply, rather than polling. That depends (AFAIK) on LNet knowing all of the possible NIDs to try for the connection, and I don't think that happens today. It is currently the MGC code that tries all of the NIDs in sequence, and that seems redundant. Also, getting NID handling out of Lustre and into LNet would be a good thing all around.

Comment by Andreas Dilger [ 03/Feb/24 ]

Mike, is this something you could work on?

Serguei, can you provide some input on what kind of LNet API you would like to pass multiple NIDs from the MGC down to speed up peer discovery? There may be the risk that the mount command-line contains NIDs that are not correctly specified as belonging to the same host, so there would have to be some defense against that.

I was considering a console error to provide feedback to the admin. However, if we get MGS NIDs from round-robin DNS (per LU-16738) then we wouldn't have any way to distinguish which NIDs belong to the same or different hosts, so if LNet can handle this ambiguity during discovery automatically, then printing a console message is pointless, as would be the need to properly separate the NIDs on the mount command-line, and we could deprecate the use of ":" to separate NIDs and simplify IPv6 address parsing (though using DNS is probably still better than specifying IP addresses directly).

Comment by Mikhail Pershin [ 03/Feb/24 ]

I don't think we really need to pass NIDs to LNET, it has all of them added already and is doing discovery in background when they are added and each time when new RPC is sent, so we can rely on current discovery status - note, that upon mount all listed nodes in mount command have just been added to LNET and all are doing discovery already.

Right now I think that easier approach would be to keep current scheme when import_select_connection() chooses one connection from many based on info about it. So far that is just last_attempt time, we can get peer discovery status (as I see it is just passing status from LNET which is easy to impement), also it could be useful to remember how often connection has been attempted and really used (that would give us stats how often HA uses that node). For me that looks enough to select alive and most used in past node to connect to and that is generic either for MGC or for other imports. 

Not sure about other details but it looks like that can be done incrementally

Comment by Serguei Smirnov [ 03/Feb/24 ]

tappro, I think I can add something like LNetGetPeerStatus which would return current status of the peer provided any of its NIDs. Is that something you could use for starters?

Comment by Mikhail Pershin [ 05/Feb/24 ]

ssmirnov , yes, that would be helpful

Comment by Gerrit Updater [ 05/Feb/24 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53926
Subject: LU-17379 lnet: add LNetPeerDiscovered to LNet API
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0140bec6cfe4bfd25fbf4088c510867daab3ebb7

Comment by Serguei Smirnov [ 05/Feb/24 ]

LNetPeerDiscovered may be useful in the case of ":"-separated NIDs, but not with ","-separated NIDs in the mount string.

As far as I can see, the issue with ","-separated NIDs is that LNetPrimaryNID is called just once - it initiates discovery using the first listed NID (primary?) as a target, but doesn't do anything with the knowledge of the non-primary NIDs until the discovery issued to the peer's primary NID fails.

I'm going to experiment with modifying LNetPrimaryNID so it may handle this case better, so there may be another patch addressing that.

Comment by Serguei Smirnov [ 06/Feb/24 ]

tappro 

Modifying LNetPrimaryNID is going to take a little longer, but this change:

https://review.whamcloud.com/#/c/fs/lustre-release/+/53930/

may also be useful in your testing if you use socklnd.

 

Comment by Gerrit Updater [ 06/Feb/24 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53933
Subject: LU-17379 lnet: parallelize peer discovery via LNetAddPeer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a63ead3388b4bcac48f8cf1c9489092af0e92a46

Comment by Gerrit Updater [ 06/Feb/24 ]

"Mikhail Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53937
Subject: LU-17379 ptlrpc: fix check for callback discard
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eef6974270086c09c67914257e10732a94db5c9b

Comment by Mikhail Pershin [ 06/Feb/24 ]

while testing mount with unavailable NIDs I've found that attempt to call request out callback while unlinking reply doesn't work. The reason is that check for rq_reply_unlinked is done too early, that flag is set in reply callback from LNetMDUnlink() which is called after discard check. So I've made that patch above in context of this ticket. It doesn't look as major issue, I don't even think it will have noticeable outcome, but at least it makes original idea to work 

Generated at Sat Feb 10 03:34:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.