[LU-11413] Large performance degradation in routing environment. Created: 21/Sep/18  Updated: 03/Mar/19  Resolved: 03/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None
Environment:

any LST server/client and two routers


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Large performance degradation found with MD testing in routed env. While simplify a test case, I found an async routes was is root cause of this. Replication is quite simple,
you need LST server and client in different logical networks and ask server to be route traffic via 'roter1' while client have route a traffic via route2.
Example of results is
server have route via 73@o2ib and lst results in case different router used:
[LNet Rates of 172.18.1.4@o2ib1]
[R] Avg: 8615 RPC/s Min: 8615 RPC/s Max: 8615 RPC/s
[W] Avg: 8615 RPC/s Min: 8615 RPC/s Max: 8615 RPC/s
but once routing changed - results is much better.
[root@c-lmo069 ~]# lnetctl route del --net o2ib1 --gateway 172.18.2.76@o2ib10
[root@c-lmo069 ~]# lnetctl route add --net o2ib1 --gateway 172.18.2.73@o2ib10

[root@c-lmo069 ~]# lctl dk > log; bash /root/lnet.sh write 4k 1; lctl dk > log-r4
Performing write
SESSION: read/write FEATURES: 1 TIMEOUT: 300 FORCE: No
172.18.1.4@o2ib1 are added to session
172.18.2.69@o2ib10 are added to session
Test was added successfully
bulk_rw is running now
[LNet Rates of 172.18.1.4@o2ib1]
[R] Avg: 11349 RPC/s Min: 11349 RPC/s Max: 11349 RPC/s
[W] Avg: 11349 RPC/s Min: 11349 RPC/s Max: 11349 RPC/s

lnet.sh is a simple script to send a 4k rpc, with 1 send in parallel (concurrent sends == 1).

This issue can replicated with socklnd also.

socklnd problem is - reply is scheduled to the different thread and can be send quickly as possible.

o2ib problem - it looks need additional network messaged to distribute a credits between nodes, as o2ib protocol have send an additional credits which sending a request which assume a reply.

First part of these issues is looks easy and caused an incomplete implementation with with lnet_send() function call, it's never use a right source NID to make it preferable.

               /* NB: we probably want to use NID of msg::msg_from as 3rd
                 * parameter (router NID) if it's routed message */
                rc = lnet_send(msg->msg_ev.target.nid, msg, LNET_NID_ANY);

while second part of problem, more complex.
two problems in this area.
1) server can set just a single router and we need to route all traffic to this to avoid performance penalty.
It can be done with adding a incoming message counter as similar as lpni_seq counted as outgunning events.

2) server router is outside of router's list. did we need to add this router to make ability to choose it in lnet_find_route_locked() ?



 Comments   
Comment by Alexey Lyashkov [ 15/Jan/19 ]

In fact one more problem found. LNet incorrectly hash a messages based on router NID, not a message initiator NID.

Comment by Gerrit Updater [ 15/Jan/19 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34031
Subject: LU-11413 lnet: use right rtr address
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7cf61dc30972d5a1aba89d402f0d50b2b68a2bb9

Comment by Gerrit Updater [ 15/Jan/19 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34032
Subject: LU-11413 lnet: use right address for routing message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 99ce2bfbcf058314810b22fef3ea648fa748334f

Comment by Alexey Lyashkov [ 21/Jan/19 ]

Second patch is fixing regression introduced a Multi rail landing.

Comment by Gerrit Updater [ 03/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34031/
Subject: LU-11413 lnet: use right rtr address
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3f45206081301508ce55b51c1c57027247bb0c1d

Comment by Gerrit Updater [ 03/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34032/
Subject: LU-11413 lnet: use right address for routing message
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ad263e5d6e93e3951f3066ddec653205d6d08eae

Comment by Peter Jones [ 03/Mar/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:43:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.