[LU-12955] AST replies are dropped when servers are non-MR, clients and routers are MR Created: 09/Nov/19  Updated: 04/Jan/24  Resolved: 24/Mar/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12889 Do not assume peers are MR capable Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I found this issue while testing Cray's 2.12. Based on code inspection, I believe this issue also exists in 2.13/master (and maybe 2.10/11/12).

Servers are all non-MR (lnet_peer_discovery_disabled=1), with a single NID on o2ib40.

Clients are MR with NIDs on gni4 and gni99

nid00110:~ # lctl list_nids
110@gni99
110@gni4
nid00110:~ #

Routers are MR with NIDs on gni4, gni99 and o2ib40

nid00485:~ # lctl list_nids
485@gni99
485@gni4
10.12.0.1@o2ib40
nid00485:~ #

NMR Server sends BL_AST to client 110@gni4:

00000400:00000200:3.0:1573329689.897722:0:11351:0:(lib-move.c:2429:lnet_send()) TRACE: 10.12.0.52@o2ib40(<?>:10.12.0.52@o2ib40) ->(<?>)-> 110@gni4(110@gni4:10.12.0.3@o2ib40) : PUT try# 0

Router gets message:

00000400:00000200:12.0:1573329689.898682:0:8631:0:(lib-move.c:3904:lnet_parse()) TRACE: 110@gni4(10.12.0.3@o2ib40) <- 10.12.0.52@o2ib40 : PUT - routed

Since router/client are both MR, router chooses different destination NID based on the round-robin selection of the local NI:

00000400:00000200:12.0:1573329689.898693:0:8631:0:(lib-move.c:1673:lnet_get_best_ni()) compare ni 93@gni99 [c:2048, d:10, s:22285] with best_ni not seleced [c:-2147483648, d:-1, s:0]
00000400:00000200:12.0:1573329689.898695:0:8631:0:(lib-move.c:1716:lnet_get_best_ni()) selected best_ni 93@gni99
00000400:00000200:12.0:1573329689.898695:0:8631:0:(lib-move.c:1673:lnet_get_best_ni()) compare ni 93@gni4 [c:2048, d:10, s:22285] with best_ni 93@gni99 [c:2048, d:10, s:22285]
00000400:00000200:12.0:1573329689.898697:0:8631:0:(lib-move.c:1716:lnet_get_best_ni()) selected best_ni 93@gni99
00000400:00000200:12.0:1573329689.898701:0:8631:0:(lib-move.c:1441:lnet_select_peer_ni()) Selected 110@gni99 h:[1000] p:[n] c:[16], s:[4704]
00000400:00000200:12.0:1573329689.898705:0:8631:0:(lib-move.c:2429:lnet_send()) TRACE: 10.12.0.52@o2ib40(<?>:93@gni99) ->(<?>)-> 110@gni99(110@gni4:110@gni99) : PUT try# 0

Client gets this message and passes it to to ptlrpc. PtlRPC sends a reply using the 110@gni99 as a source NI (see ptlrpc_send_reply()):

00000100:00000040:16.0:1573329689.898297:0:11036:0:(lustre_net.h:2496:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret"  req@ffff880f9682a040 x1649753532140528/t0(0) o104->LOV_OSC_UUID@10.12.0.52@o2ib40:224/0 lens 296/0 e 0 to 0 dl 1573329744 ref 1 fl New:/0/ffffffff rc 0/-1 job:''
00000100:00100000:16.0:1573329689.898302:0:11036:0:(service.c:2227:ptlrpc_server_handle_request()) Handling RPC req@ffff880f9682a040 pname:cluuid+ref:pid:xid:nid:opc:job ldlm_cb01_001:LOV_OSC_UUID+4:11351:x1649753532140528:12345-10.12.0.52@o2ib40:104:
00000100:00000200:16.0:1573329689.898305:0:11036:0:(service.c:2232:ptlrpc_server_handle_request()) got req 1649753532140528
00000100:00000040:16.0:1573329689.898316:0:11036:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff880f95fd7780 refcount 10 to 10.12.0.52@o2ib40
00000100:00000040:16.0:1573329689.898318:0:11036:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-10.12.0.52@o2ib40
00000100:00000200:16.0:1573329689.898321:0:11036:0:(niobuf.c:85:ptl_send_buf()) Sending 192 bytes to portal 16, xid 1649753532140528, offset 192
00000400:00000200:16.0:1573329689.898323:0:11036:0:(lib-move.c:4412:LNetPut()) LNetPut -> 12345-10.12.0.52@o2ib40
00000400:00000200:16.0:1573329689.898373:0:11036:0:(lib-move.c:2429:lnet_send()) TRACE: 110@gni99(110@gni99:110@gni99) ->(<?>)-> 10.12.0.52@o2ib40(10.12.0.52@o2ib40:93@gni99) : PUT try# 0

When this PUT arrives on the server, it is dropped because the server does not know about gni99 NIDs:

00000400:00000200:19.0:1573329689.898530:0:10540:0:(lib-move.c:3904:lnet_parse()) TRACE: 10.12.0.52@o2ib40(10.12.0.52@o2ib40) <- 110@gni99 : PUT - for me
00000400:00000200:19.0:1573329689.898533:0:10540:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-110@gni99 of length 192 into portal 16 MB=0x5dc712d400bf0
00000400:00000100:19.0:1573329689.898535:0:10540:0:(lib-move.c:3542:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649753532140528 offset 192 length 192: 4

Issue is pretty easy to reproduce. Just perform I/O to cause AST to get sent, and watch logs on the servers for the "Dropping PUT" message:

saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0161523 s, 64.9 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
<command hangs>
Nov  9 15:28:15 snx11922n005 kernel: LNet: 29024:0:(lib-move.c:3542:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649753595060784 offset 192 length 192: 4
Nov  9 15:28:15 snx11922n005 kernel: LNet: 29024:0:(lib-move.c:3542:lnet_parse_put()) Skipped 1 previous similar message

I will try to reproduce this master, and I'll update the affects version field as appropriate.

Lastly, I'll note that I was running with this patch https://review.whamcloud.com/#/c/36512/ because it is necessary to correctly classify the MR capabilities of peers.



 Comments   
Comment by Chris Horn [ 09/Nov/19 ]

Reproduced with master

Server version (Note the hash)

[root@snx11922n000 ~]# pdsh -g lustre lctl get_param version
snx11922n005: version=2.12.58_196_ga6be3e5
snx11922n003: version=2.12.58_196_ga6be3e5
snx11922n004: version=2.12.58_196_ga6be3e5
snx11922n002: version=2.12.58_196_ga6be3e5
[root@snx11922n000 ~]#

Client/router version (note the hash)

saturn-p2:~ # lctl get_param version
version=2.13.50_10_ga6be3e5
saturn-p2:~ #

The version is 2.13.50 w/various patches needed to build in Cray's environment + https://review.whamcloud.com/#/c/36512/

a6be3e560a (origin/task/master/test-master, task/master/test-master, Cray-master) LU-12889 lnet: Do not assume peers are MR capable
f09d3aefb9 Add new files to OBS spec files
57c8760542 Revert "LU-12140 lnet: adds checking msg len"
3c9e94d2c6 MRP-342 lnet: add config file support
70752d3944 Adjust headers for lnet-devel
96852b2f03 Disable server build for Cray-master
acb94ecda2 LUS-7509 build: Wrong o2ib path with mofed
07a4aa24b5 LUS-7463 build: Handle removal of lustre.service file
87c01437ea LUS-7010 build: cray-obs changes for Lustre 2.12
582e5c4d3a LUS-5697 build: Add OBS config and spec files
ff7ea0aecb (tag: v2_13_50, tag: 2.13.50, wc/master, es/master, wc-master) New development branch for Lustre 2.14

For some reason I don't see the 'Dropping PUT' message in the console log, but it is present in the dk log.

saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0162186 s, 64.7 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0153246 s, 68.4 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0162957 s, 64.3 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0159397 s, 65.8 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct
<hangs>
[root@snx11922n000 ~]# pdsh -g lustre 'lctl dk > /tmp/$(hostname).dklog'
[root@snx11922n000 ~]# pdsh -g lustre 'grep Drop /tmp/$(hostname).dklog'
snx11922n005: 00000400:00000100:18.0:1573340383.094645:0:20693:0:(lib-move.c:3915:lnet_parse_put()) Dropping PUT from 12345-479@gni99 portal 16 match 1649766213622528 offset 224 length 224: 4
snx11922n005: 00000400:00000100:19.0:1573340386.332208:0:20692:0:(lib-move.c:3915:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649766213622656 offset 224 length 224: 4
snx11922n005: 00000400:00000100:17.0:1573340388.903397:0:20691:0:(lib-move.c:3915:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649766213622720 offset 224 length 224: 4
snx11922n005: 00000400:00000100:7.0:1573340388.904505:0:20692:0:(lib-move.c:3915:lnet_parse_put()) Dropping PUT from 12345-479@gni99 portal 16 match 1649766213622784 offset 224 length 224: 4
[root@snx11922n000 ~]#
Comment by Chris Horn [ 09/Nov/19 ]

Maybe the fix here is to modify ptlrpc so it does not specify the source NI for this "reply" ?

Comment by Gerrit Updater [ 26/Nov/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/36879
Subject: LU-12955 lnet: Fix dest NID if sending from non-multi-rail
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1b4112889d3ba6b852869485aeac97f1c31134e5

Comment by Chris Horn [ 26/Nov/19 ]

Seems to work. None of these dd's hung, and no 'Dropping PUT' messages were in the logs.

saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.028181 s, 37.2 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0222562 s, 47.1 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0201651 s, 52.0 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0206952 s, 50.7 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.019903 s, 52.7 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0226328 s, 46.3 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0233879 s, 44.8 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0291185 s, 36.0 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0225152 s, 46.6 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0239285 s, 43.8 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0265195 s, 39.5 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0267997 s, 39.1 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0325068 s, 32.3 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0226784 s, 46.2 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0202842 s, 51.7 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0235395 s, 44.5 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0293789 s, 35.7 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0214157 s, 49.0 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0202208 s, 51.9 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0356905 s, 29.4 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0256762 s, 40.8 MB/s
saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.038618 s, 27.2 MB/s
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct'
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0210197 s, 49.9 MB/s
saturn-p2:~ #
Comment by Gerrit Updater [ 04/Dec/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36916
Subject: LU-12955 lnet: fix routed final dest selection
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: feda0c39545515a6f1b84737b52e8d2854d06a13

Comment by Gerrit Updater [ 24/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36916/
Subject: LU-12955 lnet: fix routed final dest selection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8cedafff9d46c8067385d698b645bb1d0343cd57

Comment by Peter Jones [ 24/Mar/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:57:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.