Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14566

Skip discovery in LNetPrimaryNID when lnet_peer_discovery_disabled is set

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 9223372036854775807

    Description

      If discovery is disabled locally then the discovery thread will not
      modify any peer objects as a result of the discovery process. Thus,
      the primary NID of any peer we're asked to discover will not change
      as a result of discovery. Therefore, we do not need to actually
      perform discovery in LNetPrimaryNID() if discovery is disabled
      locally. Since this routine can result in long client mount times
      when a Lustre server is down we should avoid this unnecessary
      discovery.

      Attachments

        Issue Links

          Activity

            [LU-14566] Skip discovery in LNetPrimaryNID when lnet_peer_discovery_disabled is set

            We encountered an issue with 2 LNet routes missing on the server side (OSS): the clients could communicate with server but the servers could not answer.

            Clients tried periodically to connect to the servers maintaining the missing peers in the discovery list (the_lnet.ln_dc_working). This have the consequences to wait indefinitely for peer discovery in ll_ostXX_XXX threads and progressively contaminating all the available threads (the client keep sending connection requests).

            The server became unavailable for all the clients.

            The "LNet discovery" and the "LNet health" is disabled on the clients and on the servers.

            eaujames Etienne Aujames added a comment - We encountered an issue with 2 LNet routes missing on the server side (OSS): the clients could communicate with server but the servers could not answer. Clients tried periodically to connect to the servers maintaining the missing peers in the discovery list (the_lnet.ln_dc_working). This have the consequences to wait indefinitely for peer discovery in ll_ostXX_XXX threads and progressively contaminating all the available threads (the client keep sending connection requests). The server became unavailable for all the clients. The "LNet discovery" and the "LNet health" is disabled on the clients and on the servers.

            Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43537
            Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 58cab53ba8c387722e638c097c0496af007f091f

            gerrit Gerrit Updater added a comment - Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43537 Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 58cab53ba8c387722e638c097c0496af007f091f
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43141/
            Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 16264da9e3c43a6368a25b6ded4113e8cfa57427

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43141/ Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled Project: fs/lustre-release Branch: master Current Patch Set: Commit: 16264da9e3c43a6368a25b6ded4113e8cfa57427
            hornc Chris Horn added a comment -

            That change will only impact local peers. Remote clients will still have to wait for the full lnet transaction timeout

            hornc Chris Horn added a comment - That change will only impact local peers. Remote clients will still have to wait for the full lnet transaction timeout

            Do you see this issue even with:

             LU-13972 o2iblnd: Don't retry indefinitely

            ?

            ashehata Amir Shehata (Inactive) added a comment - Do you see this issue even with: LU-13972 o2iblnd: Don't retry indefinitely ?

            Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43141
            Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: eebda0deedd9a053f432b9638c70db2ce1cdf8ca

            gerrit Gerrit Updater added a comment - Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43141 Subject: LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: eebda0deedd9a053f432b9638c70db2ce1cdf8ca

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: