Details

    • Technical task
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.14.0, Lustre 2.15.3
    • None
    • 3
    • 9223372036854775807

    Description

      The MGC should try all of the MGS NIDs provided on the command line fairly rapidly the first time after mount (without RPC retry) so that the client can detect the case of the MGS running on a backup node quickly. Otherwise, if the mount command-line has many NIDs (4 is very typical, but may be 8 or even 16 in some cases) then the mount can be stuck for several minutes trying to find the MGS running on the right node.

      The MGC should preferably one NID per node first, then the second NID on each node, etc. This can be done efficiently, assuming the NIDs are provided properly with colon-separated <MGSNODE>:<MGSNODE> blocks, and comma-separated "<MGSNID>,<MGSNID>" entries within that.

      However, the NIDs are often not listed correctly with ":" separators, so if there are more than 4 MGS NIDs on the command-line, then it would be better to to do a "bisection" of the NIDs to best handle the case of 2/4/8 interfaces per node vs. 2/4/8 separate nodes. For example, try NIDs in order 0, nids/2, nids/4, nids*3/4 for 4 NIDs, then nids/8, nids*5/8, nids*3/8, nids*7/8 for 8 NIDs, then nids/16, nids*9/16, nids*5/16, nids*13/16, nids*3/16, nids*11/16, nids*7/16, nids*15/16 for 16 NIDs, and similarly for 32 NIDs (the maximum).

      It should be fairly quick to determine if the MGS is not responding on a particular NID, because the client will get a rapid error response (e.g. -ENODEV or -ENOTCONN or -EHOSTUNREACH with a short RPC timeout) so in that case it should try all of the NIDs once quickly. If it gets ETIMEDOUT that might mean the node is unavailable or overloaded, or it might mean the MGS is not running yet, so the client should back off and retry the NIDs with a longer timeout after the initial burst.

      However, in most cases the MGS should be running on some node and it just needs to avoid going into slow "backoff" mode until after it has tried all of the NIDs at least once.

      It would make sense in this case to quiet the "connecting to MGS not running on this node" message for MGS connections so that it doesn't spam the console.

      Attachments

        Issue Links

          Activity

            [LU-17379] try MGS NIDs more quickly at initial mount

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53933/
            Subject: LU-17379 lnet: parallelize peer discovery via LNetAddPeer
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ae6d373bc6af6b9bb74650e27fb4c1bb87bbf4bf

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53933/ Subject: LU-17379 lnet: parallelize peer discovery via LNetAddPeer Project: fs/lustre-release Branch: master Current Patch Set: Commit: ae6d373bc6af6b9bb74650e27fb4c1bb87bbf4bf

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54022/
            Subject: LU-17379 mgc: try MGS nodes faster
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 94d05d0737db256a64626bfe6fa9801819230d8a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54022/ Subject: LU-17379 mgc: try MGS nodes faster Project: fs/lustre-release Branch: master Current Patch Set: Commit: 94d05d0737db256a64626bfe6fa9801819230d8a

            Ok so if the plan is to always call LNetAddPeer() with just a single NID, then on LNet level the following will happen:

            If "primary NID locking" is enabled:

            • Separate peer is created every time LNetAddPeer() is called, initially containing a single NID as a locked primary.
            • Discovery starts in the background, for each of the peers created above, in parallel
            • Discovery responses come back and peer records get merged if needed. When merging, the NID which got locked the earliest will be kept "primary"

            Let's consider possible outcomes:

            • If all peer NIDs are up, the peer records will get built and the first listed NID for each peer should get "locked" as primary.
            • Same result will be seen if some peer NIDs are "down", as long as there is least one peer NID which is "up" for every peer.
            • If all NIDs are "down" for a peer, then peer records for each such NID will remain unmerged.
            • If one of the NIDs passed to LNetAddPeer() is actually "bogus", then the corresponding peer record will also remain unmerged.

            Current assumption is that with "primary NID locking" enabled, the first listed peer NID is used as primary for the peer. This enables Lustre to build UUID and open connection early, queue transactions and move on - before discovery is complete (I'm pretty sure this was the purpose of primary NID locking feature). It appears to me that calling LNetAddPeer() with a single NID at a time takes away this benefit because Lustre can't know the primary NID until discovery completes. Not sure how different it is going to be delay-wise from not having "primary NID locking" enabled.

             

            On the other hand, if LNetAddPeer() is given a list of NIDs which Lustre expects to belong to the same peer:

            • Separate peer may be created for each NID in the list, containing a single NID, but only first of the provided NIDs is locked as primary in the respective peer. Other peers are marked to add the first NID as a locked primary - even if discovery later reveals no such NID
            • Discovery starts in the background, for each of the peers created above, in parallel
            • Discovery responses come back, peer records get merged if needed. "Marked" peer NID is kept as "locked primary"

            The outcomes are similar, except that:

            • If first NID passed to LNetAddPeer() is "bogus", but there's at least one NID which is "up" for the peer, a peer record is still created with the first NID locked as primary

            So if LNetAddPeer() is given a list of NIDs, then we can keep using the first listed peer NID as a "peer ID" and build UUID based on it right away, without having to wait for discovery to complete.

            ssmirnov Serguei Smirnov added a comment - Ok so if the plan is to always call LNetAddPeer() with just a single NID, then on LNet level the following will happen: If "primary NID locking" is enabled : Separate peer is created every time LNetAddPeer() is called, initially containing a single NID as a locked primary. Discovery starts in the background, for each of the peers created above, in parallel Discovery responses come back and peer records get merged if needed. When merging, the NID which got locked the earliest will be kept "primary" Let's consider possible outcomes: If all peer NIDs are up, the peer records will get built and the first listed NID for each peer should get "locked" as primary. Same result will be seen if some peer NIDs are "down", as long as there is least one peer NID which is "up" for every peer. If all NIDs are "down" for a peer, then peer records for each such NID will remain unmerged. If one of the NIDs passed to LNetAddPeer() is actually "bogus", then the corresponding peer record will also remain unmerged. Current assumption is that with "primary NID locking" enabled, the first listed peer NID is used as primary for the peer. This enables Lustre to build UUID and open connection early, queue transactions and move on - before discovery is complete (I'm pretty sure this was the purpose of primary NID locking feature). It appears to me that calling LNetAddPeer() with a single NID at a time takes away this benefit because Lustre can't know the primary NID until discovery completes. Not sure how different it is going to be delay-wise from not having "primary NID locking" enabled.   On the other hand, if LNetAddPeer() is given a list of NIDs which Lustre expects to belong to the same peer: Separate peer may be created for each NID in the list, containing a single NID, but only first of the provided NIDs is locked as primary in the respective peer. Other peers are marked to add the first NID as a locked primary - even if discovery later reveals no such NID Discovery starts in the background, for each of the peers created above, in parallel Discovery responses come back, peer records get merged if needed. "Marked" peer NID is kept as "locked primary" The outcomes are similar, except that: If first NID passed to LNetAddPeer() is "bogus", but there's at least one NID which is "up" for the peer, a peer record is still created with the first NID locked as primary So if LNetAddPeer() is given a list of NIDs, then we can keep using the first listed peer NID as a "peer ID" and build UUID based on it right away, without having to wait for discovery to complete.

            On further thinking, basically Lustre needs to maintain list of peers to connect to. First it is built from mount options or config and cause background discovery like it is done now. Then that list should just be updated when any peer have its primary NID changed or uptodate state. There are two options - 1) create new event to notify Lustre about that 2) let Lustre to poll that when it is needed (usually in import_select_connection())

            In any case Lustre should rebuild list of peers at that moment, removing peers with the same primary NID as duplicates

             

            tappro Mikhail Pershin added a comment - On further thinking, basically Lustre needs to maintain list of peers to connect to. First it is built from mount options or config and cause background discovery like it is done now. Then that list should just be updated when any peer have its primary NID changed or uptodate state. There are two options - 1) create new event to notify Lustre about that 2) let Lustre to poll that when it is needed (usually in import_select_connection()) In any case Lustre should rebuild list of peers at that moment, removing peers with the same primary NID as duplicates  
            tappro Mikhail Pershin added a comment - - edited

            ssmirnov ,  I got it otherwise actually :

            1. Lustre is going to handle all provided NIDs like colon-separated, so consider all as separated peers for the same UUID. Each is considered as 'peer to attempt to connect to'. Lustre adds each one by LNetAddPeer() and calls LNetPrimaryNID() which is always the same single NID that was added. At that point its discovery state may be not yet revealed (in background)
            2. If some NIDs belong to the same peer then Lustre is able to recognize that, so eventually its list of 'peers to attempt to connect' get rid of duplicates - the same peer NIDs and will consist of 'primary NIDs' as LNet knows them
            3. Bad or not alive peers stays at Lustre level with 'non uptodate' status until discovered otherwise, when Lustre tries them time to time at reconnection attempts.

            In that sense after step 1) Lustre will have each UUID corresponds to the same NID, it is 1-1 relation, and after 2) only UUID related to primary NID will remains. They might have more NIDs attached/handled at LNET level (if it really able to merge them)

            What is still unclear to me - can LNet really merge NIDs belonging to same peer? How can it notify Lustre about that or how Lustre can discover that? In general we need mechanism to update Lustre connection primary NID effectively when LNet changed it for any reason.

             

            tappro Mikhail Pershin added a comment - - edited ssmirnov ,  I got it otherwise actually : Lustre is going to handle all provided NIDs like colon-separated, so consider all as separated peers for the same UUID. Each is considered as 'peer to attempt to connect to'. Lustre adds each one by LNetAddPeer() and calls LNetPrimaryNID() which is always the same single NID that was added. At that point its discovery state may be not yet revealed (in background) If some NIDs belong to the same peer then Lustre is able to recognize that, so eventually its list of 'peers to attempt to connect' get rid of duplicates - the same peer NIDs and will consist of 'primary NIDs' as LNet knows them Bad or not alive peers stays at Lustre level with 'non uptodate' status until discovered otherwise, when Lustre tries them time to time at reconnection attempts. In that sense after step 1) Lustre will have each UUID corresponds to the same NID, it is 1-1 relation, and after 2) only UUID related to primary NID will remains. They might have more NIDs attached/handled at LNET level (if it really able to merge them) What is still unclear to me - can LNet really merge NIDs belonging to same peer? How can it notify Lustre about that or how Lustre can discover that? In general we need mechanism to update Lustre connection primary NID effectively when LNet changed it for any reason.  

            With "primary NID locking" the primary NID provided by Lustre serves as UUID for the peer. Without "primary NID locking", it is the "discovered" primary NID. So in this sense, what timday is describing is already implemented, although it may be confusing to be using the primary NID as UUID.

            tappro, do I understand correctly that on high level Lustre is going to do the following (assuming that the first listed NID in this case is to be designated "primary" when "primary NID locking" is enabled, no matter if it is configured on the peer)? :

            • Lustre iterates over groups of NIDs expected to belong to different peers (":"-separated)
            • Create a list of peer NIDs (comma-separated) and use LNetAddPeer API to create the peer.
            • If "primary NID locking" is enabled, then at this point Lustre and LNet already know the peer UUID.
            • If "primary NID locking" is disabled, Lustre uses LNetPeerDiscovered iterating over peer NIDs until it is found that peer is discovered

            Does this look correct? Do we need any additional Lustre-side changes for this? 

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - With "primary NID locking" the primary NID provided by Lustre serves as UUID for the peer. Without "primary NID locking", it is the "discovered" primary NID. So in this sense, what timday is describing is already implemented, although it may be confusing to be using the primary NID as UUID. tappro , do I understand correctly that on high level Lustre is going to do the following (assuming that the first listed NID in this case is to be designated "primary" when "primary NID locking" is enabled, no matter if it is configured on the peer)? : Lustre iterates over groups of NIDs expected to belong to different peers (":"-separated) Create a list of peer NIDs (comma-separated) and use LNetAddPeer API to create the peer. If "primary NID locking" is enabled, then at this point Lustre and LNet already know the peer UUID. If "primary NID locking" is disabled, Lustre uses LNetPeerDiscovered iterating over peer NIDs until it is found that peer is discovered Does this look correct? Do we need any additional Lustre-side changes for this?  Thanks, Serguei.
            timday Tim Day added a comment -

            I've looked into the ptlrpc connection code a lot (https://review.whamcloud.com/c/fs/lustre-release/+/54225 and friends). I think most Lustre peer and connection handling could be ripped out and replaced by LNet (notably lustre/ptlrpc/connection.c and perhaps lustre/obdclass/lustre_peer.c). Any time Lustre needs to communicate, it could use opaque peer handles that it gets from LNet. Lustre could 'tag' peers with UUIDs indicating which services are served from that peer. From reading the thread, I get the impression that others might be leading in that direction.

             

            In my experience, discovery gives LNet a much better sense of the state of the world than Lustre has. Lustre seems more liable to cache bad info since it tracks NIDs is so many places (and never updates them). Cutting out a lot of that handling could make Lustre/LNet peering a lot easier to grasp, IMHO.

            timday Tim Day added a comment - I've looked into the ptlrpc connection code a lot ( https://review.whamcloud.com/c/fs/lustre-release/+/54225 and friends). I think most Lustre peer and connection handling could be ripped out and replaced by LNet (notably lustre/ptlrpc/connection.c and perhaps lustre/obdclass/lustre_peer.c). Any time Lustre needs to communicate, it could use opaque peer handles that it gets from LNet. Lustre could 'tag' peers with UUIDs indicating which services are served from that peer. From reading the thread, I get the impression that others might be leading in that direction.   In my experience, discovery gives LNet a much better sense of the state of the world than Lustre has. Lustre seems more liable to cache bad info since it tracks NIDs is so many places (and never updates them). Cutting out a lot of that handling could make Lustre/LNet peering a lot easier to grasp, IMHO.

            Andreas,

            I was probably using wrong terminology. On lustre side ptlrpc_connection_get is using LNetPrimaryNID() to set "peer nid" for the connection. Some other lustre functions, for example gss_svc_upcall_handle_init()  use LNetPrimaryNID() to find out which primary NID corresponds to a given NID and later match it to internal resources using that. So it appears that Lustre is using primary NID to define a connection.

            ssmirnov Serguei Smirnov added a comment - Andreas, I was probably using wrong terminology. On lustre side ptlrpc_connection_get is using LNetPrimaryNID() to set "peer nid" for the connection. Some other lustre functions, for example gss_svc_upcall_handle_init()  use LNetPrimaryNID() to find out which primary NID corresponds to a given NID and later match it to internal resources using that. So it appears that Lustre is using primary NID to define a connection.

            if there's no "primary NID locking" is because we don't know what the primary NID is until discovery is complete. So Lustre can't initiate transactions - we need to know how to identify the peer first.

            Is that "transaction" something at the LNet level? I'm not familiar with that from the Lustre RPC level. Also, I think the "primary NID" wouldn't matter at all if we have your peer match bits patch? Otherwise we shouldn't care which NID is used, as long as it gets to the right node in the end.

            adilger Andreas Dilger added a comment - if there's no "primary NID locking" is because we don't know what the primary NID is until discovery is complete. So Lustre can't initiate transactions - we need to know how to identify the peer first. Is that "transaction" something at the LNet level? I'm not familiar with that from the Lustre RPC level. Also, I think the "primary NID" wouldn't matter at all if we have your peer match bits patch? Otherwise we shouldn't care which NID is used, as long as it gets to the right node in the end.

            tappro, without "primary NID locking" lp_primary_peer is undefined until discovery is complete. 

            But in that case how it is supposed to work without 'locked' option at all when lp_primary_peer can be also updated depending on network changes?

            Once discovery is complete, primary NID can only change if the peer is reconfigured.

            If there is a way to sort that out from Lustre then we can, do you have an idea how to do that?

            If LNetPeerAdd operates as in current 53933, and probably some other tweaks, Lustre should be able to iterate over the list of NIDs and call LNetPeerDiscovered/LNetPrimaryNID on every NID. The output can be resolved to a list of "primary NIDs", each primary NID would correspond to a distinct peer.

            ssmirnov Serguei Smirnov added a comment - tappro , without "primary NID locking" lp_primary_peer is undefined until discovery is complete.  But in that case how it is supposed to work without 'locked' option at all when  lp_primary_peer  can be also updated depending on network changes? Once discovery is complete, primary NID can only change if the peer is reconfigured. If there is a way to sort that out from Lustre then we can, do you have an idea how to do that? If LNetPeerAdd operates as in current 53933, and probably some other tweaks, Lustre should be able to iterate over the list of NIDs and call LNetPeerDiscovered/LNetPrimaryNID on every NID. The output can be resolved to a list of "primary NIDs", each primary NID would correspond to a distinct peer.

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: