[LU-12222] LNet should select loopback NI when possible Created: 25/Apr/19  Updated: 09/Dec/20  Resolved: 20/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Improvement Priority: Major
Reporter: Chris Horn Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: HPv3

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

I happened to notice that when mounting an MGT and MDT co-located on the same node that we end up sending RPCs to ourselves:

00000100:00000040:11.0:1556165595.786175:0:30688:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=0  req@ffff881f275f8000 x1631757890748560/t0(0) o502->MGC10.12.0.50@o2ib40@10.12.0.50@o2ib40:26/25 lens 272/8472 e 0 to 0 dl 1556165690 ref 3 fl Rpc:/0/ffffffff rc 0/-1
00000100:00000001:11.0:1556165595.786177:0:30688:0:(niobuf.c:54:ptl_send_buf()) Process entered
00000100:00000040:11.0:1556165595.786177:0:30688:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-10.12.0.50@o2ib40
00000400:00000010:11.0:1556165595.786178:0:30688:0:(lib-lnet.h:221:lnet_md_alloc()) slab-alloced 'md' of size 136 at ffff881f279d26e8.
00000100:00000200:11.0:1556165595.786178:0:30688:0:(niobuf.c:85:ptl_send_buf()) Sending 272 bytes to portal 26, xid 1631757890748560, offset 0
00000400:00000010:11.0:1556165595.786179:0:30688:0:(lib-lnet.h:472:lnet_msg_alloc()) alloc '(msg)': 440 at ffff881f279c3200 (tot 331800384).
00000400:00000200:11.0:1556165595.786181:0:30688:0:(lib-move.c:4637:LNetPut()) LNetPut -> 12345-10.12.0.50@o2ib40
00000400:00000200:11.0:1556165595.786183:0:30688:0:(lib-move.c:2545:lnet_handle_send_case_locked()) Source ANY to MR:  10.12.0.50@o2ib40 local destination
00000400:00000200:11.0:1556165595.786184:0:30688:0:(lib-move.c:1619:lnet_get_best_ni()) compare ni 0@lo [c:0, d:10, s:0] with best_ni not seleced [c:-2147483648, d:-1, s:0]
00000400:00000200:11.0:1556165595.786185:0:30688:0:(lib-move.c:1662:lnet_get_best_ni()) selected best_ni 0@lo
00000400:00000200:11.0:1556165595.786186:0:30688:0:(lib-move.c:1619:lnet_get_best_ni()) compare ni 10.12.0.50@o2ib40 [c:2045, d:10, s:26] with best_ni 0@lo [c:0, d:10, s:0]
00000400:00000200:11.0:1556165595.786188:0:30688:0:(lib-move.c:1662:lnet_get_best_ni()) selected best_ni 10.12.0.50@o2ib40
00000400:00000200:11.0:1556165595.786188:0:30688:0:(lib-move.c:1367:lnet_select_peer_ni()) 10.12.0.50@o2ib40 ni_is_pref = 0
00000400:00000200:11.0:1556165595.786189:0:30688:0:(lib-move.c:1428:lnet_select_peer_ni()) sd_best_lpni = 10.12.0.50@o2ib40
00000400:00000200:11.0:1556165595.786192:0:30688:0:(lib-move.c:1837:lnet_handle_send()) TRACE: 10.12.0.50@o2ib40(10.12.0.50@o2ib40:<?>) -> 10.12.0.50@o2ib40(10.12.0.50@o2ib40:10.12.0.50@o2ib40) : PUT try# 0

In the above excerpt we can see that LNet has enough information to determine that we're sending this to ourselves (the nid of best_ni == the nid of best_lpni). Ideally we'd just use the loopback NI whenever possible.



 Comments   
Comment by Andreas Dilger [ 25/Apr/19 ]

This seems pretty important. It could cause problems with local communication on the MDS or OSS under load.

Comment by Patrick Farrell (Inactive) [ 25/Apr/19 ]

Chris,

Are you planning a patch for this?

Comment by Chris Horn [ 25/Apr/19 ]

I was, but Amir tells me that he relies on this behavior for recovery. If he can provide some more details on what he means by that then I can work out a patch, otherwise I'm fine to let Amir work it out himself.

Comment by Patrick Farrell (Inactive) [ 25/Apr/19 ]

OK, thanks!  Was just curious.

Comment by Gerrit Updater [ 12/Aug/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35778
Subject: LU-12222 lnet: Check if we're sending to ourselves
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ee8e5be8951ed60d25b3aeae563a0f54cde59ea7

Comment by Gerrit Updater [ 20/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35778/
Subject: LU-12222 lnet: Check if we're sending to ourselves
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e4af756e1f428a9f7883bf883f66941defb1447f

Comment by Peter Jones [ 20/Dec/19 ]

Landed for 2.14

Comment by Alexey Lyashkov [ 23/Dec/19 ]

In fact this is invalid fix, with performance impact.
based on hits provided by Chirs, I able to replicate this issue and it looks like a ptlrpc bug.

00000400:00000200:5.0:1577100140.229954:0:10627:0:(peer.c:1215:LNetPrimaryNID()) NID 0@lo primary NID 198.18.0.33@tcp rc 0
00000100:00000010:5.0:1577100140.229963:0:10627:0:(connection.c:55:ptlrpc_connection_get()) kmalloced '(conn)': 88 at ffff942f2df04518.
00000100:00000001:5.0:1577100140.229965:0:10627:0:(connection.c:80:ptlrpc_connection_get()) Process leaving
00000100:00000040:5.0:1577100140.229966:0:10627:0:(connection.c:84:ptlrpc_connection_get()) conn=ffff942f2df04518 refcount 2 to 198.18.0.33@tcp
00000100:00000040:5.0:1577100140.229967:0:10627:0:(client.c:107:ptlrpc_uuid_to_connection()) MGC198.18.0.33@tcp_0 -> ffff942f2df04518

ptlrpc_uuid_to_connection - don't able to understand this is local NID and provide a replacement as needs.

Comment by Alexey Lyashkov [ 23/Dec/19 ]

In fact this MR bug.
ptlrpc_uuid_to_peer - correctly resolved 198.18.0.33@tcp as local address, and it replaced to 0@lo, but
ptlrpc_connection_get have call a LNetPrimaryNID on top, and it call make this replacement away.

Comment by Chris Horn [ 15/Jan/20 ]

pjones green Since my fix for this issue has a bad interaction with the fix for LU-12889 I think we should revert it. The LU-12889 bug is more serious, and per Shadow's last couple comments in this ticket, he thinks we can resolve this issue in a different way. What do you think?

Comment by Oleg Drokin [ 16/Jan/20 ]

since LU-12889 is not yet landed, you can submit a series consisting of reversal and then LU-12889 on top of it (or a single patch that reverts just the needed bits if desirable) with necessary explanations and all.

Comment by Chris Horn [ 16/Jan/20 ]

Okay, that sounds like a plan. Thanks.

Comment by Gerrit Updater [ 16/Jan/20 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/37259
Subject: Revert "LU-12222 lnet: Check if we're sending to ourselves"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ee971ecfbeb4a8192e4ab0e2762048a89a31e339

Comment by Andreas Dilger [ 16/Jan/20 ]

In fact this MR bug.
ptlrpc_uuid_to_peer - correctly resolved 198.18.0.33@tcp as local address, and it replaced to 0@lo, but
ptlrpc_connection_get have call a LNetPrimaryNID on top, and it call make this replacement away.

shadow - thanks for the clear analysis. We definitely don't want to be sending messages destined for ourselves out over the network. I'm not sure what the semantics for LNetPrimaryNID() are. Should the 0@lo NID be handled specifically in that function, or would it be better to skip that function in ptlrpc_connection_get() and keep 0@lo as the target NID?

Comment by Alexey Lyashkov [ 17/Jan/20 ]

Andreas, From my view, exceptions should be on lower level as possible - to avoid similar bugs in future, in case ptlrpc code will changed and someone forget why this exceptions was made.

Comment by Andreas Dilger [ 17/Jan/20 ]

Shadow, I was thinking that LNetPrimaryNID() might be called to map 0@lo to the primary external NID for other reasons. However, it looks like there is only one caller of this function, in ptlrpc_connection_get(), so changing the semantic to just return 0@lo immediately is pretty safe. I don't mind to add a fast path in both LNetPrimaryNID() and ptlrpc_connection_get() that special-case this NID to avoid overhead (e.g. skipping lnet_net_lock_current() and maybe avoiding the need to look up the peer connection?).

It probably also makes sense to #define LNET_NID_LO_0 (or similar) like we do for LNET_NID_ANY so that we don't have to compose this every time. That could also be used in other places, like infra_ping_nid() rather than turning it into a string and comparing that much more efficiently in that loop:

-                        if (!strcmp(libcfs_nid2str(ping.ping_buf[i].nid),
-                                    "0@lo"))
+                       if (ping.ping_buf[i].nid == LNET_NID_LO_0))
Comment by Andreas Dilger [ 22/Jan/20 ]

Any comments from LNet developers on a global LNET_NID_LO_0 constant to indicate "myself"?

Comment by Gerrit Updater [ 01/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37259/
Subject: Revert "LU-12222 lnet: Check if we're sending to ourselves"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 275016fcae729f8467dc09aa15aa2b4bb2690875

Comment by Chris Horn [ 21/Apr/20 ]

Any comments from LNet developers on a global LNET_NID_LO_0 constant to indicate "myself"?

adilger Sounds like a good idea to me.

Comment by Gerrit Updater [ 22/Apr/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38312
Subject: LU-12222 lnet: Introduce constant for the lolnd NID
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 620d738ec490c1fc475f6e2312f2290800051705

Comment by Gerrit Updater [ 22/Apr/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38313
Subject: LU-12222 lnet: Primary NID of lolnd NID is the lolnd NID
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c11c50152cffea9e05fc8124f7b988fae828f1e4

Comment by Gerrit Updater [ 27/Apr/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38388
Subject: LU-12222 ptlrpc: Check if NID is local, not just lolnd NID
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b17862a23870bc3be8f24ae9441dab316cfac611

Comment by Chris Horn [ 29/Apr/20 ]

Without the fix, performing llmount.sh with -1 debug:

sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep PTLDEBUG cfg/local.sh
PTLDEBUG=${PTLDEBUG:--1}
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # ./llmount.sh
...
Setting lustre.sys.jobid_var from disable to procname_uid
Waiting 90 secs for update
Updated after 4s: wanted 'procname_uid' got 'procname_uid'
disable quota as required
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # lctl dk > /tmp/dk.log

Of interest are the messages with "TRACE" string. There are two types: lnet_parse() for incoming and lnet_handle_send() for outgoing:

sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -m 1 lnet_parse
00000400:00000200:19.0:1587582336.656655:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : GET - for me
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -m 1 lnet_handle_send
00000400:00000200:21.0:1587582336.657784:0:12301:0:(lib-move.c:1853:lnet_handle_send()) TRACE: 192.168.2.20@tcp(192.168.2.20@tcp:<?>) -> 192.168.2.20@tcp(192.168.2.20@tcp:192.168.2.20@tcp) <?> : PUT try# 0
sles15build01:/home/hornc/lustre-filesystem/lustre/tests #

Since this is a single node setup, all traffic should be sent via lolnd, but we can see that is not the case:

sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep -c TRACE /tmp/dk.log
887
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -c 0@lo
4
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -vc 0@lo
883
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep 0@lo
00000400:00000200:19.0:1587582336.656655:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : GET - for me
00000400:00000200:19.0:1587582336.656681:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : REPLY - for me
00000400:00000200:19.0:1587582336.656763:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : PUT - for me
00000400:00000200:19.0:1587582336.656801:0:12299:0:(lib-move.c:4290:lnet_parse()) TRACE: 0@lo(0@lo) <- 0@lo : ACK - for me
sles15build01:/home/hornc/lustre-filesystem/lustre/tests #

In fact only 4 messages were sent over lolnd.

Doing the same thing with the fix we can see all messages use the lolnd:

sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep -c TRACE /tmp/dk.log
462
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -c 0@lo
462
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # grep TRACE /tmp/dk.log | grep -vc 0@lo
0
sles15build01:/home/hornc/lustre-filesystem/lustre/tests #
Comment by Gerrit Updater [ 07/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38312/
Subject: LU-12222 lnet: Introduce constant for the lolnd NID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 56203e4ba0a64789e42ea45946e8c51f1db351fb

Comment by Gerrit Updater [ 07/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38313/
Subject: LU-12222 lnet: Primary NID of lolnd NID is the lolnd NID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 33d2e44e5026f1e9162dd5e6b931085fdc035a34

Comment by Amir Shehata (Inactive) [ 15/May/20 ]

An exception to this is recovery messages. Recovery messages can not use the lolnd as that breaks the purpose of the attempt to ascertain the health of a local interface.

Comment by Gerrit Updater [ 20/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38388/
Subject: LU-12222 ptlrpc: Check if NID is local, not just lolnd NID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 95bcc24642c4b95d093407fef0947ee2f5a2c01a

Comment by Peter Jones [ 20/May/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 08/Jun/20 ]

[Deleted]

Comment by Gerrit Updater [ 08/Jun/20 ]

[Deleted]

Comment by Gerrit Updater [ 08/Jun/20 ]

[Deleted]

Comment by Gerrit Updater [ 08/Jun/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38863
Subject: LU-12222 lnet: Introduce constant for the lolnd NID
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a0bb5d5f52414dcfc95bfa699c5d800c1d7daa7a

Comment by Gerrit Updater [ 08/Jun/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38864
Subject: LU-12222 lnet: Primary NID of lolnd NID is the lolnd NID
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7e924bc9a53ad82a0c4e6da4084e4de8536e4ada

Comment by Gerrit Updater [ 08/Jun/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38865
Subject: LU-12222 ptlrpc: Check if NID is local, not just lolnd NID
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 35c6cfeec9d5b9a354bc49319694d2360c48c68c

Comment by Gerrit Updater [ 07/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38863/
Subject: LU-12222 lnet: Introduce constant for the lolnd NID
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: b5fb6f1fe6dedca0ba6e405e90d7ec72d9b97e83

Comment by Gerrit Updater [ 07/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38864/
Subject: LU-12222 lnet: Primary NID of lolnd NID is the lolnd NID
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 4494f55aa2fe7d2ae274405fc0f2314193db1c0e

Comment by Gerrit Updater [ 07/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38865/
Subject: LU-12222 ptlrpc: Check if NID is local, not just lolnd NID
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 7635e6326de88bfd035c82ac9261e90e4f9b4a1b

Generated at Sat Feb 10 02:50:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.