[LU-11413] Large performance degradation in routing environment. Created: 21/Sep/18 Updated: 03/Mar/19 Resolved: 03/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
any LST server/client and two routers |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Large performance degradation found with MD testing in routed env. While simplify a test case, I found an async routes was is root cause of this. Replication is quite simple, [root@c-lmo069 ~]# lctl dk > log; bash /root/lnet.sh write 4k 1; lctl dk > log-r4 lnet.sh is a simple script to send a 4k rpc, with 1 send in parallel (concurrent sends == 1). This issue can replicated with socklnd also. socklnd problem is - reply is scheduled to the different thread and can be send quickly as possible. o2ib problem - it looks need additional network messaged to distribute a credits between nodes, as o2ib protocol have send an additional credits which sending a request which assume a reply. First part of these issues is looks easy and caused an incomplete implementation with with lnet_send() function call, it's never use a right source NID to make it preferable.
/* NB: we probably want to use NID of msg::msg_from as 3rd
* parameter (router NID) if it's routed message */
rc = lnet_send(msg->msg_ev.target.nid, msg, LNET_NID_ANY);
while second part of problem, more complex. 2) server router is outside of router's list. did we need to add this router to make ability to choose it in lnet_find_route_locked() ? |
| Comments |
| Comment by Alexey Lyashkov [ 15/Jan/19 ] |
|
In fact one more problem found. LNet incorrectly hash a messages based on router NID, not a message initiator NID. |
| Comment by Gerrit Updater [ 15/Jan/19 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34031 |
| Comment by Gerrit Updater [ 15/Jan/19 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34032 |
| Comment by Alexey Lyashkov [ 21/Jan/19 ] |
|
Second patch is fixing regression introduced a Multi rail landing. |
| Comment by Gerrit Updater [ 03/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34031/ |
| Comment by Gerrit Updater [ 03/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34032/ |
| Comment by Peter Jones [ 03/Mar/19 ] |
|
Landed for 2.13 |