[LU-12955] AST replies are dropped when servers are non-MR, clients and routers are MR Created: 09/Nov/19 Updated: 04/Jan/24 Resolved: 24/Mar/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Chris Horn | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
I found this issue while testing Cray's 2.12. Based on code inspection, I believe this issue also exists in 2.13/master (and maybe 2.10/11/12). Servers are all non-MR (lnet_peer_discovery_disabled=1), with a single NID on o2ib40. Clients are MR with NIDs on gni4 and gni99 nid00110:~ # lctl list_nids 110@gni99 110@gni4 nid00110:~ # Routers are MR with NIDs on gni4, gni99 and o2ib40 nid00485:~ # lctl list_nids 485@gni99 485@gni4 10.12.0.1@o2ib40 nid00485:~ # NMR Server sends BL_AST to client 110@gni4: 00000400:00000200:3.0:1573329689.897722:0:11351:0:(lib-move.c:2429:lnet_send()) TRACE: 10.12.0.52@o2ib40(<?>:10.12.0.52@o2ib40) ->(<?>)-> 110@gni4(110@gni4:10.12.0.3@o2ib40) : PUT try# 0 Router gets message: 00000400:00000200:12.0:1573329689.898682:0:8631:0:(lib-move.c:3904:lnet_parse()) TRACE: 110@gni4(10.12.0.3@o2ib40) <- 10.12.0.52@o2ib40 : PUT - routed Since router/client are both MR, router chooses different destination NID based on the round-robin selection of the local NI: 00000400:00000200:12.0:1573329689.898693:0:8631:0:(lib-move.c:1673:lnet_get_best_ni()) compare ni 93@gni99 [c:2048, d:10, s:22285] with best_ni not seleced [c:-2147483648, d:-1, s:0] 00000400:00000200:12.0:1573329689.898695:0:8631:0:(lib-move.c:1716:lnet_get_best_ni()) selected best_ni 93@gni99 00000400:00000200:12.0:1573329689.898695:0:8631:0:(lib-move.c:1673:lnet_get_best_ni()) compare ni 93@gni4 [c:2048, d:10, s:22285] with best_ni 93@gni99 [c:2048, d:10, s:22285] 00000400:00000200:12.0:1573329689.898697:0:8631:0:(lib-move.c:1716:lnet_get_best_ni()) selected best_ni 93@gni99 00000400:00000200:12.0:1573329689.898701:0:8631:0:(lib-move.c:1441:lnet_select_peer_ni()) Selected 110@gni99 h:[1000] p:[n] c:[16], s:[4704] 00000400:00000200:12.0:1573329689.898705:0:8631:0:(lib-move.c:2429:lnet_send()) TRACE: 10.12.0.52@o2ib40(<?>:93@gni99) ->(<?>)-> 110@gni99(110@gni4:110@gni99) : PUT try# 0 Client gets this message and passes it to to ptlrpc. PtlRPC sends a reply using the 110@gni99 as a source NI (see ptlrpc_send_reply()): 00000100:00000040:16.0:1573329689.898297:0:11036:0:(lustre_net.h:2496:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret" req@ffff880f9682a040 x1649753532140528/t0(0) o104->LOV_OSC_UUID@10.12.0.52@o2ib40:224/0 lens 296/0 e 0 to 0 dl 1573329744 ref 1 fl New:/0/ffffffff rc 0/-1 job:'' 00000100:00100000:16.0:1573329689.898302:0:11036:0:(service.c:2227:ptlrpc_server_handle_request()) Handling RPC req@ffff880f9682a040 pname:cluuid+ref:pid:xid:nid:opc:job ldlm_cb01_001:LOV_OSC_UUID+4:11351:x1649753532140528:12345-10.12.0.52@o2ib40:104: 00000100:00000200:16.0:1573329689.898305:0:11036:0:(service.c:2232:ptlrpc_server_handle_request()) got req 1649753532140528 00000100:00000040:16.0:1573329689.898316:0:11036:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff880f95fd7780 refcount 10 to 10.12.0.52@o2ib40 00000100:00000040:16.0:1573329689.898318:0:11036:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-10.12.0.52@o2ib40 00000100:00000200:16.0:1573329689.898321:0:11036:0:(niobuf.c:85:ptl_send_buf()) Sending 192 bytes to portal 16, xid 1649753532140528, offset 192 00000400:00000200:16.0:1573329689.898323:0:11036:0:(lib-move.c:4412:LNetPut()) LNetPut -> 12345-10.12.0.52@o2ib40 00000400:00000200:16.0:1573329689.898373:0:11036:0:(lib-move.c:2429:lnet_send()) TRACE: 110@gni99(110@gni99:110@gni99) ->(<?>)-> 10.12.0.52@o2ib40(10.12.0.52@o2ib40:93@gni99) : PUT try# 0 When this PUT arrives on the server, it is dropped because the server does not know about gni99 NIDs: 00000400:00000200:19.0:1573329689.898530:0:10540:0:(lib-move.c:3904:lnet_parse()) TRACE: 10.12.0.52@o2ib40(10.12.0.52@o2ib40) <- 110@gni99 : PUT - for me 00000400:00000200:19.0:1573329689.898533:0:10540:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-110@gni99 of length 192 into portal 16 MB=0x5dc712d400bf0 00000400:00000100:19.0:1573329689.898535:0:10540:0:(lib-move.c:3542:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649753532140528 offset 192 length 192: 4 Issue is pretty easy to reproduce. Just perform I/O to cause AST to get sent, and watch logs on the servers for the "Dropping PUT" message: saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0161523 s, 64.9 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct <command hangs> Nov 9 15:28:15 snx11922n005 kernel: LNet: 29024:0:(lib-move.c:3542:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649753595060784 offset 192 length 192: 4 Nov 9 15:28:15 snx11922n005 kernel: LNet: 29024:0:(lib-move.c:3542:lnet_parse_put()) Skipped 1 previous similar message I will try to reproduce this master, and I'll update the affects version field as appropriate. Lastly, I'll note that I was running with this patch https://review.whamcloud.com/#/c/36512/ because it is necessary to correctly classify the MR capabilities of peers. |
| Comments |
| Comment by Chris Horn [ 09/Nov/19 ] |
|
Reproduced with master Server version (Note the hash) [root@snx11922n000 ~]# pdsh -g lustre lctl get_param version snx11922n005: version=2.12.58_196_ga6be3e5 snx11922n003: version=2.12.58_196_ga6be3e5 snx11922n004: version=2.12.58_196_ga6be3e5 snx11922n002: version=2.12.58_196_ga6be3e5 [root@snx11922n000 ~]# Client/router version (note the hash) saturn-p2:~ # lctl get_param version version=2.13.50_10_ga6be3e5 saturn-p2:~ # The version is 2.13.50 w/various patches needed to build in Cray's environment + https://review.whamcloud.com/#/c/36512/ a6be3e560a (origin/task/master/test-master, task/master/test-master, Cray-master) LU-12889 lnet: Do not assume peers are MR capable f09d3aefb9 Add new files to OBS spec files 57c8760542 Revert "LU-12140 lnet: adds checking msg len" 3c9e94d2c6 MRP-342 lnet: add config file support 70752d3944 Adjust headers for lnet-devel 96852b2f03 Disable server build for Cray-master acb94ecda2 LUS-7509 build: Wrong o2ib path with mofed 07a4aa24b5 LUS-7463 build: Handle removal of lustre.service file 87c01437ea LUS-7010 build: cray-obs changes for Lustre 2.12 582e5c4d3a LUS-5697 build: Add OBS config and spec files ff7ea0aecb (tag: v2_13_50, tag: 2.13.50, wc/master, es/master, wc-master) New development branch for Lustre 2.14 For some reason I don't see the 'Dropping PUT' message in the console log, but it is present in the dk log. saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0162186 s, 64.7 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0153246 s, 68.4 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0162957 s, 64.3 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0159397 s, 65.8 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/text2.txt bs=1024k count=1 oflag=direct <hangs> [root@snx11922n000 ~]# pdsh -g lustre 'lctl dk > /tmp/$(hostname).dklog' [root@snx11922n000 ~]# pdsh -g lustre 'grep Drop /tmp/$(hostname).dklog' snx11922n005: 00000400:00000100:18.0:1573340383.094645:0:20693:0:(lib-move.c:3915:lnet_parse_put()) Dropping PUT from 12345-479@gni99 portal 16 match 1649766213622528 offset 224 length 224: 4 snx11922n005: 00000400:00000100:19.0:1573340386.332208:0:20692:0:(lib-move.c:3915:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649766213622656 offset 224 length 224: 4 snx11922n005: 00000400:00000100:17.0:1573340388.903397:0:20691:0:(lib-move.c:3915:lnet_parse_put()) Dropping PUT from 12345-110@gni99 portal 16 match 1649766213622720 offset 224 length 224: 4 snx11922n005: 00000400:00000100:7.0:1573340388.904505:0:20692:0:(lib-move.c:3915:lnet_parse_put()) Dropping PUT from 12345-479@gni99 portal 16 match 1649766213622784 offset 224 length 224: 4 [root@snx11922n000 ~]# |
| Comment by Chris Horn [ 09/Nov/19 ] |
|
Maybe the fix here is to modify ptlrpc so it does not specify the source NI for this "reply" ? |
| Comment by Gerrit Updater [ 26/Nov/19 ] |
|
|
| Comment by Chris Horn [ 26/Nov/19 ] |
|
Seems to work. None of these dd's hung, and no 'Dropping PUT' messages were in the logs. saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.028181 s, 37.2 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0222562 s, 47.1 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0201651 s, 52.0 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0206952 s, 50.7 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.019903 s, 52.7 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0226328 s, 46.3 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0233879 s, 44.8 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0291185 s, 36.0 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0225152 s, 46.6 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0239285 s, 43.8 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0265195 s, 39.5 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0267997 s, 39.1 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0325068 s, 32.3 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0226784 s, 46.2 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0202842 s, 51.7 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0235395 s, 44.5 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0293789 s, 35.7 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0214157 s, 49.0 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0202208 s, 51.9 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0356905 s, 29.4 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0256762 s, 40.8 MB/s saturn-p2:~ # dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.038618 s, 27.2 MB/s saturn-p2:~ # ssh nid00110 'dd if=/dev/zero of=/lus/snx11922/hornc/test.txt bs=1024k count=1 oflag=direct' 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0210197 s, 49.9 MB/s saturn-p2:~ # |
| Comment by Gerrit Updater [ 04/Dec/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36916 |
| Comment by Gerrit Updater [ 24/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36916/ |
| Comment by Peter Jones [ 24/Mar/20 ] |
|
Landed for 2.14 |