[LU-17476] lnet: only report mismatched nid in ME if bits match Created: 26/Jan/24  Updated: 08/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17379 try MGS NIDs more quickly at initial ... In Progress
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There are rare cases where a client-to-server AST reply was being dropped by the server:

lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
:
request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs02-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0)
lnet_is_health_check()) health check = 0, status = -2, hstatus = 0

As a part of MD matching for incoming GET or PUT from a peer with multiple NIDs, use "matchbits" only if they are available and only report an error on NID/PID mismatch. If can't use "matchbits" for matching, fail on NID/PID mismatch as before.



 Comments   
Comment by Gerrit Updater [ 27/Jan/24 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53843
Subject: LU-17476 lnet: prefer to use bits only to match ME
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6087ac44e0eb29c6129ee59e97363c33575a9cfb

Comment by Chris Horn [ 29/Jan/24 ]

Can you say more about the cases where this issue occurs? I've seen it when there is some sort of mismatch between expected and actual primary NID of a peer.

Comment by Gerrit Updater [ 29/Jan/24 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53851
Subject: LU-17476 lnet: force non-primary NID in LNetPut
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 953ff57510b1a6f10175f08bedc7a63b67fedebe

Comment by Gerrit Updater [ 31/Jan/24 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53872
Subject: LU-17476 tests: add test for MR match NIDs code
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 878db375153071006346ea132d452f759cb14bd7

Merged into https://review.whamcloud.com/53843

Comment by Andreas Dilger [ 01/Feb/24 ]

Chris, this issue has been observed in the case of a Lustre server-to-client blocking AST request that cannot be replied by the client (neither client nor server have patch https://review.whamcloud.com/50530 "LU-16709 lnet: fix locking multiple peer NIDs").

The following is Oleg's analysis of the kernel debug logs on the server (with "+rpctrace+dlmtrace+lnet" enabled):

ugh, so the network seems to deliver everything fast, but lnet just does not like it?

Client side, speedy handling:

00000100:00100000:84.0:1706140870.571121:0:340548:0:(service.c:2136:ptlrpc_server_handle_req_in())
     got req x1788044801687552
00000100:00100000:84.0:1706140870.571138:0:340548:0:(nrs_fifo.c:177:nrs_fifo_req_get())
     NRS start fifo request from 12345-10.85.153.100@tcp, seq: 155892
00000100:00100000:84.0:1706140870.571140:0:340548:0:(service.c:2283:ptlrpc_server_handle_request())
     Handling RPC req@00000000a35a640a pname:cluuid+ref:pid:xid:nid:opc:job
     ldlm_cb01_010:c63d2a6a-1533-4998-b176-4f630f667a59+5:3678513:x1788044801687552:12345-10.85.153.100@tcp:104:
00000100:00000200:84.0:1706140870.571142:0:340548:0:(service.c:2298:ptlrpc_server_handle_request())
     got req 1788044801687552
00000100:00000200:84.0:1706140870.571151:0:340548:0:(niobuf.c:86:ptl_send_buf())
     Sending 224 bytes to portal 16, xid 1788044801687552, offset 224
00000400:00000200:84.0:1706140870.571153:0:340548:0:(lib-move.c:4904:LNetPut()) LNetPut -> 12345-10.85.153.101@tcp
00000400:00000200:84.0:1706140870.571154:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash())
     Match nid 10.85.153.101@tcp to cpt 1
00000400:00000200:84.0:1706140870.571156:0:340548:0:(lib-move.c:2731:lnet_handle_send_case_locked())
     Source Specified: 10.31.3.108@tcp to MR:  10.85.153.101@tcp local destination
00000400:00000200:84.0:1706140870.571157:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash())
     Match nid 10.85.153.101@tcp to cpt 1
00000400:00000200:84.0:1706140870.571160:0:340548:0:(lib-move.c:1884:lnet_handle_send())
     TRACE: 10.31.3.108@tcp(10.31.3.108@tcp:10.31.3.108@tcp) ->
     10.85.153.101@tcp(10.85.153.101@tcp:10.85.153.101@tcp) <?> : PUT try# 0
00000800:00000200:84.0:1706140870.571161:0:340548:0:(socklnd_cb.c:992:ksocknal_send())
     sending 224 bytes in 1 frags to 12345-10.85.153.101@tcp
00000800:00000200:84.0:1706140870.571162:0:340548:0:(socklnd.c:231:ksocknal_find_peer_locked())
     got peer_ni [000000007915095f] -> 12345-10.85.153.101@tcp (5)
00000800:00000200:84.0:1706140870.571166:0:340548:0:(socklnd_cb.c:758:ksocknal_queue_tx_locked())
     Sending to 12345-10.85.153.101@tcp ip 10.85.153.101:988
00000800:00000200:84.0:1706140870.571167:0:340548:0:(socklnd_cb.c:777:ksocknal_queue_tx_locked())
     Packet 000000005a9eb1fb type 1, nob 320 niov 1 nkiov 1

Server side, starting with the send completion (clocks are in sync betweem client and server):

00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:''
 

I think this is the first (instant) attempt at receiving the response on the server:

00000400:00000100:19.0:1706140870.571652:0:18948:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4

But somehow it does not really lead to anything useful server side, as we see this message being repeatedly arriving at increasing intervals:

00000100:00000200:17.0:1706140908.945592:0:18949:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0)
     o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946
     ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
00000400:00000100:16.0:1706140908.945848:0:18950:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1038:lnet_is_health_check())
     Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0)
00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1043:lnet_is_health_check())
     health check = 0, status = -2, hstatus = 0

inconsistent state? because we already received it once?

00000100:00000200:18.0:1706140946.314936:0:18950:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706140984 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
00000100:00000200:17.0:1706140984.714683:0:18948:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706141022 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
00000400:00000200:17.0:1706140984.715079:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4
00000400:00000200:17.0:1706140984.715079:0:18948:0:(lib-ptl.c:571:lnet_ptl_match_md())
     Request from 12345-10.31.3.108@tcp of length 224 into portal 16 MB=0x65a379f3f4000
00000400:00000200:17.0:1706140984.715081:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4
00000400:00000100:17.0:1706140984.715082:0:18948:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
00000400:00000200:17.0:1706140984.715083:0:18948:0:(lib-msg.c:1038:lnet_is_health_check())
     Msg 0000000051e9a6cd is in inconsistent state, don't perform health checking (-2, 0)

and so on until the request timed out

Jan 25 00:01:48 mds000 kernel: Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request())
     @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870]
      req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''

So with this in mind I think we might need somebody with good LNet understanding to tell us what is going on and why the seemingly received message does not get processed and passed up to Lustre?

It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm.

Comment by Gerrit Updater [ 05/Feb/24 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53929
Subject: LU-17476 tests: check for sanity/13a failures
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b411c5c7b817e4914ecfaa596c5e0b82106fc23c

Generated at Sat Feb 10 03:35:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.