Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.13.0, Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
I found this issue while testing Cray's 2.12. Based on code inspection, I believe this issue also exists in 2.13/master (and maybe 2.10/11/12).
Servers are all non-MR (lnet_peer_discovery_disabled=1), with a single NID on o2ib40.
Clients are MR with NIDs on gni4 and gni99
nid00110:~ # lctl list_nids 110@gni99 110@gni4 nid00110:~ #
Routers are MR with NIDs on gni4, gni99 and o2ib40
nid00485:~ # lctl list_nids 485@gni99 485@gni4 10.12.0.1@o2ib40 nid00485:~ #
NMR Server sends BL_AST to client 110@gni4:
00000400:00000200:3.0:1573329689.897722:0:11351:0:(lib-move.c:2429:lnet_send()) TRACE: 10.12.0.52@o2ib40(<?>:10.12.0.52@o2ib40) ->(<?>)-> 110@gni4(110@gni4:10.12.0.3@o2ib40) : PUT try# 0
Router gets message:
00000400:00000200:12.0:1573329689.898682:0:8631:0:(lib-move.c:3904:lnet_parse()) TRACE: 110@gni4(10.12.0.3@o2ib40) <- 10.12.0.52@o2ib40 : PUT - routed
Since router/client are both MR, router chooses different destination NID based on the round-robin selection of the local NI:
00000400:00000200:12.0:1573329689.898693:0:8631:0:(lib-move.c:1673:lnet_get_best_ni()) compare ni 93@gni99 [c:2048, d:10, s:22285] with best_ni not seleced [c:-2147483648, d:-1, s:0] 00000400:00000200:12.0:1573329689.898695:0:8631:0:(lib-move.c:1716:lnet_get_best_ni()) selected best_ni 93@gni99 00000400:00000200:12.0:1573329689.898695:0:8631:0:(lib-move.c:1673:lnet_get_best_ni()) compare ni 93@gni4 [c:2048, d:10, s:22285] with best_ni 93@gni99 [c:2048, d:10, s:22285] 00000400:00000200:12.0:1573329689.898697:0:8631:0:(lib-move.c:1716:lnet_get_best_ni()) selected best_ni 93@gni99 00000400:00000200:12.0:1573329689.898701:0:8631:0:(lib-move.c:1441:lnet_select_peer_ni()) Selected 110@gni99 h:[1000] p:[n] c:[16], s:[4704] 00000400:00000200:12.0:1573329689.898705:0:8631:0:(lib-move.c:2429:lnet_send()) TRACE: 10.12.0.52@o2ib40(<?>:93@gni99) ->(<?>)-> 110@gni99(110@gni4:110@gni99) : PUT try# 0
Client gets this message and passes it to to ptlrpc. PtlRPC sends a reply using the 110@gni99 as a source NI (see ptlrpc_send_reply()):
00000100:00000040:16.0:1573329689.898297:0:11036:0:(lustre_net.h:2496:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret" req@ffff880f9682a040 x1649753532140528/t0(0) o104->LOV_OSC_UUID@10.12.0.52@o2ib40:224/0 lens 296/0 e 0 to 0 dl 1573329744 ref 1 fl New:/0/ffffffff rc 0/-1 job:'' 00000100:00100000:16.0:1573329689.898302:0:11036:0:(service.c:2227:ptlrpc_server_handle_request()) Handling RPC req@ffff880f9682a040 pname:cluuid+ref:pid:xid:nid:opc:job ldlm_cb01_001:LOV_OSC_UUID+4:11351:x1649753532140528:12345-10.12.0.52@o2ib40:104: 00000100:00000200:16.0:1573329689.898305:0:11036:0:(service.c:2232:ptlrpc_server_handle_request()) got req 1649753532140528 00000100:00000040:16.0:1573329689.898316:0:11036:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff880f95fd7780 refcount 10 to 10.12.0.52@o2ib40 00000100:00000040:16.0:1573329689.898318:0:11036:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-10.12.0.52@o2ib40 00000100:00000200:16.0:1573329689.898321:0:11036:0:(niobuf.c:85:ptl_send_buf()) Sending 192 bytes to portal 16, xid 1649753532140528, offset 192 00000400:00000200:16.0:1573329689.898323:0:11036:0:(lib-move.c:4412:LNetPut()) LNetPut -> 12345-10.12.0.52@o2ib40 00000400:00000200:16.0:1573329689.898373:0:11036:0:(lib-move.c:2429:lnet_send()) TRACE: 110@gni99(110@gni99:110@gni99) ->(<?>)-> 10.12.0.52@o2ib40(10.12.0.52@o2ib40:93@gni99) : PUT try# 0
When this PUT arrives on the server, it is dropped because the server does not know about gni99 NIDs:
00000400:00000200:19.0:1573329689.898530:0:10540:0:(lib-move.c:3904:lnet_parse()) TRACE: 10.12.0.52@o2ib40(10.12.0.52@o2ib40) <- 110@gni99 : PUT - for me 00000400:00000200:19.0:1573329689.898533:0:10540:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-110@gni99 of length 192 into portal 16 MB=0x5dc712d400bf0 00000400:00000100:19.0:1573329689.898535:0:10540:0:(lib-move.c:3542:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649753532140528 offset 192 length 192: 4
Issue is pretty easy to reproduce. Just perform I/O to cause AST to get sent, and watch logs on the servers for the "Dropping PUT" message:
saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0161523 s, 64.9 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct <command hangs>
Nov 9 15:28:15 snx11922n005 kernel: LNet: 29024:0:(lib-move.c:3542:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649753595060784 offset 192 length 192: 4 Nov 9 15:28:15 snx11922n005 kernel: LNet: 29024:0:(lib-move.c:3542:lnet_parse_put()) Skipped 1 previous similar message
I will try to reproduce this master, and I'll update the affects version field as appropriate.
Lastly, I'll note that I was running with this patch https://review.whamcloud.com/#/c/36512/ because it is necessary to correctly classify the MR capabilities of peers.