[LU-12222] LNet should select loopback NI when possible Created: 25/Apr/19 Updated: 09/Dec/20 Resolved: 20/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.6 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Chris Horn | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HPv3 | ||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
I happened to notice that when mounting an MGT and MDT co-located on the same node that we end up sending RPCs to ourselves: 00000100:00000040:11.0:1556165595.786175:0:30688:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=0 req@ffff881f275f8000 x1631757890748560/t0(0) o502->MGC10.12.0.50@o2ib40@10.12.0.50@o2ib40:26/25 lens 272/8472 e 0 to 0 dl 1556165690 ref 3 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000001:11.0:1556165595.786177:0:30688:0:(niobuf.c:54:ptl_send_buf()) Process entered 00000100:00000040:11.0:1556165595.786177:0:30688:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-10.12.0.50@o2ib40 00000400:00000010:11.0:1556165595.786178:0:30688:0:(lib-lnet.h:221:lnet_md_alloc()) slab-alloced 'md' of size 136 at ffff881f279d26e8. 00000100:00000200:11.0:1556165595.786178:0:30688:0:(niobuf.c:85:ptl_send_buf()) Sending 272 bytes to portal 26, xid 1631757890748560, offset 0 00000400:00000010:11.0:1556165595.786179:0:30688:0:(lib-lnet.h:472:lnet_msg_alloc()) alloc '(msg)': 440 at ffff881f279c3200 (tot 331800384). 00000400:00000200:11.0:1556165595.786181:0:30688:0:(lib-move.c:4637:LNetPut()) LNetPut -> 12345-10.12.0.50@o2ib40 00000400:00000200:11.0:1556165595.786183:0:30688:0:(lib-move.c:2545:lnet_handle_send_case_locked()) Source ANY to MR: 10.12.0.50@o2ib40 local destination 00000400:00000200:11.0:1556165595.786184:0:30688:0:(lib-move.c:1619:lnet_get_best_ni()) compare ni 0@lo [c:0, d:10, s:0] with best_ni not seleced [c:-2147483648, d:-1, s:0] 00000400:00000200:11.0:1556165595.786185:0:30688:0:(lib-move.c:1662:lnet_get_best_ni()) selected best_ni 0@lo 00000400:00000200:11.0:1556165595.786186:0:30688:0:(lib-move.c:1619:lnet_get_best_ni()) compare ni 10.12.0.50@o2ib40 [c:2045, d:10, s:26] with best_ni 0@lo [c:0, d:10, s:0] 00000400:00000200:11.0:1556165595.786188:0:30688:0:(lib-move.c:1662:lnet_get_best_ni()) selected best_ni 10.12.0.50@o2ib40 00000400:00000200:11.0:1556165595.786188:0:30688:0:(lib-move.c:1367:lnet_select_peer_ni()) 10.12.0.50@o2ib40 ni_is_pref = 0 00000400:00000200:11.0:1556165595.786189:0:30688:0:(lib-move.c:1428:lnet_select_peer_ni()) sd_best_lpni = 10.12.0.50@o2ib40 00000400:00000200:11.0:1556165595.786192:0:30688:0:(lib-move.c:1837:lnet_handle_send()) TRACE: 10.12.0.50@o2ib40(10.12.0.50@o2ib40:<?>) -> 10.12.0.50@o2ib40(10.12.0.50@o2ib40:10.12.0.50@o2ib40) : PUT try# 0 In the above excerpt we can see that LNet has enough information to determine that we're sending this to ourselves (the nid of best_ni == the nid of best_lpni). Ideally we'd just use the loopback NI whenever possible. |
| Comments |
| Comment by Andreas Dilger [ 25/Apr/19 ] |
|
This seems pretty important. It could cause problems with local communication on the MDS or OSS under load. |
| Comment by Patrick Farrell (Inactive) [ 25/Apr/19 ] |
|
Chris, Are you planning a patch for this? |
| Comment by Chris Horn [ 25/Apr/19 ] |
|
I was, but Amir tells me that he relies on this behavior for recovery. If he can provide some more details on what he means by that then I can work out a patch, otherwise I'm fine to let Amir work it out himself. |
| Comment by Patrick Farrell (Inactive) [ 25/Apr/19 ] |
|
OK, thanks! Was just curious. |
| Comment by Gerrit Updater [ 12/Aug/19 ] |
|
Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35778 |
| Comment by Gerrit Updater [ 20/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35778/ |
| Comment by Peter Jones [ 20/Dec/19 ] |
|
Landed for 2.14 |
| Comment by Alexey Lyashkov [ 23/Dec/19 ] |
|
In fact this is invalid fix, with performance impact. 00000400:00000200:5.0:1577100140.229954:0:10627:0:(peer.c:1215:LNetPrimaryNID()) NID 0@lo primary NID 198.18.0.33@tcp rc 0 00000100:00000010:5.0:1577100140.229963:0:10627:0:(connection.c:55:ptlrpc_connection_get()) kmalloced '(conn)': 88 at ffff942f2df04518. 00000100:00000001:5.0:1577100140.229965:0:10627:0:(connection.c:80:ptlrpc_connection_get()) Process leaving 00000100:00000040:5.0:1577100140.229966:0:10627:0:(connection.c:84:ptlrpc_connection_get()) conn=ffff942f2df04518 refcount 2 to 198.18.0.33@tcp 00000100:00000040:5.0:1577100140.229967:0:10627:0:(client.c:107:ptlrpc_uuid_to_connection()) MGC198.18.0.33@tcp_0 -> ffff942f2df04518 ptlrpc_uuid_to_connection - don't able to understand this is local NID and provide a replacement as needs. |
| Comment by Alexey Lyashkov [ 23/Dec/19 ] |
|
In fact this MR bug. |
| Comment by Chris Horn [ 15/Jan/20 ] |
|
pjones green Since my fix for this issue has a bad interaction with the fix for |
| Comment by Oleg Drokin [ 16/Jan/20 ] |
|
since |
| Comment by Chris Horn [ 16/Jan/20 ] |
|
Okay, that sounds like a plan. Thanks. |
| Comment by Gerrit Updater [ 16/Jan/20 ] |
|
Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/37259 |
| Comment by Andreas Dilger [ 16/Jan/20 ] |
shadow - thanks for the clear analysis. We definitely don't want to be sending messages destined for ourselves out over the network. I'm not sure what the semantics for LNetPrimaryNID() are. Should the 0@lo NID be handled specifically in that function, or would it be better to skip that function in ptlrpc_connection_get() and keep 0@lo as the target NID? |
| Comment by Alexey Lyashkov [ 17/Jan/20 ] |
|
Andreas, From my view, exceptions should be on lower level as possible - to avoid similar bugs in future, in case ptlrpc code will changed and someone forget why this exceptions was made. |
| Comment by Andreas Dilger [ 17/Jan/20 ] |
|
Shadow, I was thinking that LNetPrimaryNID() might be called to map 0@lo to the primary external NID for other reasons. However, it looks like there is only one caller of this function, in ptlrpc_connection_get(), so changing the semantic to just return 0@lo immediately is pretty safe. I don't mind to add a fast path in both LNetPrimaryNID() and ptlrpc_connection_get() that special-case this NID to avoid overhead (e.g. skipping lnet_net_lock_current() and maybe avoiding the need to look up the peer connection?). It probably also makes sense to #define LNET_NID_LO_0 (or similar) like we do for LNET_NID_ANY so that we don't have to compose this every time. That could also be used in other places, like infra_ping_nid() rather than turning it into a string and comparing that much more efficiently in that loop: - if (!strcmp(libcfs_nid2str(ping.ping_buf[i].nid), - "0@lo")) + if (ping.ping_buf[i].nid == LNET_NID_LO_0)) |
| Comment by Andreas Dilger [ 22/Jan/20 ] |
|
Any comments from LNet developers on a global LNET_NID_LO_0 constant to indicate "myself"? |
| Comment by Gerrit Updater [ 01/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37259/ |
| Comment by Chris Horn [ 21/Apr/20 ] |
adilger Sounds like a good idea to me. |
| Comment by Gerrit Updater [ 22/Apr/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38312 |
| Comment by Gerrit Updater [ 22/Apr/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38313 |
| Comment by Gerrit Updater [ 27/Apr/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38388 |
| Comment by Chris Horn [ 29/Apr/20 ] |
|
Without the fix, performing llmount.sh with -1 debug: sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep PTLDEBUG cfg/local.sh
PTLDEBUG=${PTLDEBUG:--1}
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # ./llmount.sh
...
Setting lustre.sys.jobid_var from disable to procname_uid
Waiting 90 secs for update
Updated after 4s: wanted 'procname_uid' got 'procname_uid'
disable quota as required
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # lctl dk > /tmp/dk.log
Of interest are the messages with "TRACE" string. There are two types: lnet_parse() for incoming and lnet_handle_send() for outgoing: sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -m 1 lnet_parse 00000400:00000200:19.0:1587582336.656655:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : GET - for me sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -m 1 lnet_handle_send 00000400:00000200:21.0:1587582336.657784:0:12301:0:(lib-move.c:1853:lnet_handle_send()) TRACE: 192.168.2.20@tcp(192.168.2.20@tcp:<?>) -> 192.168.2.20@tcp(192.168.2.20@tcp:192.168.2.20@tcp) <?> : PUT try# 0 sles15build01:/home/hornc/lustre-filesystem/lustre/tests # Since this is a single node setup, all traffic should be sent via lolnd, but we can see that is not the case: sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep -c TRACE /tmp/dk.log 887 sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -c 0@lo 4 sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -vc 0@lo 883 sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep 0@lo 00000400:00000200:19.0:1587582336.656655:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : GET - for me 00000400:00000200:19.0:1587582336.656681:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : REPLY - for me 00000400:00000200:19.0:1587582336.656763:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : PUT - for me 00000400:00000200:19.0:1587582336.656801:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : ACK - for me sles15build01:/home/hornc/lustre-filesystem/lustre/tests # In fact only 4 messages were sent over lolnd. Doing the same thing with the fix we can see all messages use the lolnd: sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep -c TRACE /tmp/dk.log 462 sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -c 0@lo 462 sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -vc 0@lo 0 sles15build01:/home/hornc/lustre-filesystem/lustre/tests # |
| Comment by Gerrit Updater [ 07/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38312/ |
| Comment by Gerrit Updater [ 07/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38313/ |
| Comment by Amir Shehata (Inactive) [ 15/May/20 ] |
|
An exception to this is recovery messages. Recovery messages can not use the lolnd as that breaks the purpose of the attempt to ascertain the health of a local interface. |
| Comment by Gerrit Updater [ 20/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38388/ |
| Comment by Peter Jones [ 20/May/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 08/Jun/20 ] |
|
[Deleted] |
| Comment by Gerrit Updater [ 08/Jun/20 ] |
|
[Deleted] |
| Comment by Gerrit Updater [ 08/Jun/20 ] |
|
[Deleted] |
| Comment by Gerrit Updater [ 08/Jun/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38863 |
| Comment by Gerrit Updater [ 08/Jun/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38864 |
| Comment by Gerrit Updater [ 08/Jun/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38865 |
| Comment by Gerrit Updater [ 07/Aug/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38863/ |
| Comment by Gerrit Updater [ 07/Aug/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38864/ |
| Comment by Gerrit Updater [ 07/Aug/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38865/ |