Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16738 Improve mount.lustre with many MGS NIDs
  3. LU-17379

try MGS NIDs more quickly at initial mount

    XMLWordPrintable

Details

    • Technical task
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.14.0, Lustre 2.15.3
    • None
    • 3
    • 9223372036854775807

    Description

      The MGC should try all of the MGS NIDs provided on the command line fairly rapidly the first time after mount (without RPC retry) so that the client can detect the case of the MGS running on a backup node quickly. Otherwise, if the mount command-line has many NIDs (4 is very typical, but may be 8 or even 16 in some cases) then the mount can be stuck for several minutes trying to find the MGS running on the right node.

      The MGC should preferably one NID per node first, then the second NID on each node, etc. This can be done efficiently, assuming the NIDs are provided properly with colon-separated <MGSNODE>:<MGSNODE> blocks, and comma-separated "<MGSNID>,<MGSNID>" entries within that.

      However, the NIDs are often not listed correctly with ":" separators, so if there are more than 4 MGS NIDs on the command-line, then it would be better to to do a "bisection" of the NIDs to best handle the case of 2/4/8 interfaces per node vs. 2/4/8 separate nodes. For example, try NIDs in order 0, nids/2, nids/4, nids*3/4 for 4 NIDs, then nids/8, nids*5/8, nids*3/8, nids*7/8 for 8 NIDs, then nids/16, nids*9/16, nids*5/16, nids*13/16, nids*3/16, nids*11/16, nids*7/16, nids*15/16 for 16 NIDs, and similarly for 32 NIDs (the maximum).

      It should be fairly quick to determine if the MGS is not responding on a particular NID, because the client will get a rapid error response (e.g. -ENODEV or -ENOTCONN or -EHOSTUNREACH with a short RPC timeout) so in that case it should try all of the NIDs once quickly. If it gets ETIMEDOUT that might mean the node is unavailable or overloaded, or it might mean the MGS is not running yet, so the client should back off and retry the NIDs with a longer timeout after the initial burst.

      However, in most cases the MGS should be running on some node and it just needs to avoid going into slow "backoff" mode until after it has tried all of the NIDs at least once.

      It would make sense in this case to quiet the "connecting to MGS not running on this node" message for MGS connections so that it doesn't spam the console.

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: