Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11413

Large performance degradation in routing environment.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0
    • None
    • None
    • any LST server/client and two routers
    • 3
    • 9223372036854775807

    Description

      Large performance degradation found with MD testing in routed env. While simplify a test case, I found an async routes was is root cause of this. Replication is quite simple,
      you need LST server and client in different logical networks and ask server to be route traffic via 'roter1' while client have route a traffic via route2.
      Example of results is
      server have route via 73@o2ib and lst results in case different router used:
      [LNet Rates of 172.18.1.4@o2ib1]
      [R] Avg: 8615 RPC/s Min: 8615 RPC/s Max: 8615 RPC/s
      [W] Avg: 8615 RPC/s Min: 8615 RPC/s Max: 8615 RPC/s
      but once routing changed - results is much better.
      [root@c-lmo069 ~]# lnetctl route del --net o2ib1 --gateway 172.18.2.76@o2ib10
      [root@c-lmo069 ~]# lnetctl route add --net o2ib1 --gateway 172.18.2.73@o2ib10

      [root@c-lmo069 ~]# lctl dk > log; bash /root/lnet.sh write 4k 1; lctl dk > log-r4
      Performing write
      SESSION: read/write FEATURES: 1 TIMEOUT: 300 FORCE: No
      172.18.1.4@o2ib1 are added to session
      172.18.2.69@o2ib10 are added to session
      Test was added successfully
      bulk_rw is running now
      [LNet Rates of 172.18.1.4@o2ib1]
      [R] Avg: 11349 RPC/s Min: 11349 RPC/s Max: 11349 RPC/s
      [W] Avg: 11349 RPC/s Min: 11349 RPC/s Max: 11349 RPC/s

      lnet.sh is a simple script to send a 4k rpc, with 1 send in parallel (concurrent sends == 1).

      This issue can replicated with socklnd also.

      socklnd problem is - reply is scheduled to the different thread and can be send quickly as possible.

      o2ib problem - it looks need additional network messaged to distribute a credits between nodes, as o2ib protocol have send an additional credits which sending a request which assume a reply.

      First part of these issues is looks easy and caused an incomplete implementation with with lnet_send() function call, it's never use a right source NID to make it preferable.

                     /* NB: we probably want to use NID of msg::msg_from as 3rd
                       * parameter (router NID) if it's routed message */
                      rc = lnet_send(msg->msg_ev.target.nid, msg, LNET_NID_ANY);
      

      while second part of problem, more complex.
      two problems in this area.
      1) server can set just a single router and we need to route all traffic to this to avoid performance penalty.
      It can be done with adding a incoming message counter as similar as lpni_seq counted as outgunning events.

      2) server router is outside of router's list. did we need to add this router to make ability to choose it in lnet_find_route_locked() ?

      Attachments

        Activity

          [LU-11413] Large performance degradation in routing environment.
          pjones Peter Jones added a comment -

          Landed for 2.13

          pjones Peter Jones added a comment - Landed for 2.13

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34032/
          Subject: LU-11413 lnet: use right address for routing message
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: ad263e5d6e93e3951f3066ddec653205d6d08eae

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34032/ Subject: LU-11413 lnet: use right address for routing message Project: fs/lustre-release Branch: master Current Patch Set: Commit: ad263e5d6e93e3951f3066ddec653205d6d08eae

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34031/
          Subject: LU-11413 lnet: use right rtr address
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 3f45206081301508ce55b51c1c57027247bb0c1d

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34031/ Subject: LU-11413 lnet: use right rtr address Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3f45206081301508ce55b51c1c57027247bb0c1d

          Second patch is fixing regression introduced a Multi rail landing.

          shadow Alexey Lyashkov added a comment - Second patch is fixing regression introduced a Multi rail landing.

          Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34032
          Subject: LU-11413 lnet: use right address for routing message
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 99ce2bfbcf058314810b22fef3ea648fa748334f

          gerrit Gerrit Updater added a comment - Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34032 Subject: LU-11413 lnet: use right address for routing message Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 99ce2bfbcf058314810b22fef3ea648fa748334f

          Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34031
          Subject: LU-11413 lnet: use right rtr address
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 7cf61dc30972d5a1aba89d402f0d50b2b68a2bb9

          gerrit Gerrit Updater added a comment - Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34031 Subject: LU-11413 lnet: use right rtr address Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7cf61dc30972d5a1aba89d402f0d50b2b68a2bb9

          In fact one more problem found. LNet incorrectly hash a messages based on router NID, not a message initiator NID.

          shadow Alexey Lyashkov added a comment - In fact one more problem found. LNet incorrectly hash a messages based on router NID, not a message initiator NID.

          People

            shadow Alexey Lyashkov
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: