[LU-17476] lnet: only report mismatched nid in ME if bits match - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0, Lustre 2.15.5
Affects Version/s: Lustre 2.15.5
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

There are rare cases where a client-to-server AST reply was being dropped by the server, with messages similar to the following with o104, o105, or o106 as the RPC type:

Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request())
     @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870]
      req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''

and in the kernel debug logs it shows that LNet is dropping the RPC due to no matching request:

lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
:
request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs02-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0)
lnet_is_health_check()) health check = 0, status = -2, hstatus = 0

As a part of MD matching for incoming GET or PUT from a peer with multiple NIDs, use "matchbits" only if they are available and only report an error on NID/PID mismatch. If can't use "matchbits" for matching, fail on NID/PID mismatch as before.

Attachments

Issue Links

is related to

LU-17958 Clients can't mount after server upgrade

Resolved

is related to

LU-17379 try MGS NIDs more quickly at initial mount

Resolved

mentioned in: Page Loading...; Page Loading...

Activity

[LU-17476] lnet: only report mismatched nid in ME if bits match

Gerrit Updater added a comment - 20/Jun/24 11:05 PM

"Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55488
Subject: ~~LU-17476~~ lnet: prefer to use bits only to match ME
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: e10fa830fe3313768df31eee657fe1b02792c1ab

Gerrit Updater added a comment - 20/Jun/24 11:05 PM "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55488 Subject: LU-17476 lnet: prefer to use bits only to match ME Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: e10fa830fe3313768df31eee657fe1b02792c1ab

Chris Horn added a comment - 22/Mar/24 8:58 PM - edited

adilger I root caused a NID mismatch bug that is due to a bad back-port that landed in 2.15.4. Details in ~~LU-17664~~

Chris Horn added a comment - 22/Mar/24 8:58 PM - edited adilger I root caused a NID mismatch bug that is due to a bad back-port that landed in 2.15.4. Details in LU-17664

Chris Horn added a comment - 19/Mar/24 10:33 PM

Do you have the complete log(s) from Oleg's analysis? Maybe there is a clue there.

I think maybe there is some race between discovery and ptlrpc connection setup. The reply buffer (always?) uses whatever NID is stored in the import's (or export's reverse import) ptlrpc_connection.c_peer.nid. This is populated by a call to LNetPrimaryNID(). Therefore we can infer that at some point LNetPrimaryNID() returned the .109, but then discovery later set .108 as the primary. This shouldn't happen with the primary NID locking feature, but there is no other explanation I can think of.

Chris Horn added a comment - 19/Mar/24 10:33 PM Do you have the complete log(s) from Oleg's analysis? Maybe there is a clue there. I think maybe there is some race between discovery and ptlrpc connection setup. The reply buffer (always?) uses whatever NID is stored in the import's (or export's reverse import) ptlrpc_connection.c_peer.nid. This is populated by a call to LNetPrimaryNID(). Therefore we can infer that at some point LNetPrimaryNID() returned the .109, but then discovery later set .108 as the primary. This shouldn't happen with the primary NID locking feature, but there is no other explanation I can think of.

Andreas Dilger added a comment - 19/Mar/24 8:03 PM

This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID.

hornc, thanks for looking into this. Can you see if this is something wrong with how the RPC is generated at the PtlRPC/LDLM layer, maybe where LDLM is getting the peer NID for the AST? We only really saw this with LDLM requests, but that might also have been self selective based on the fact that is one of the few server-to-client RPC types.

Andreas Dilger added a comment - 19/Mar/24 8:03 PM This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID. hornc , thanks for looking into this. Can you see if this is something wrong with how the RPC is generated at the PtlRPC/LDLM layer, maybe where LDLM is getting the peer NID for the AST? We only really saw this with LDLM requests, but that might also have been self selective based on the fact that is one of the few server-to-client RPC types.

Chris Horn added a comment - 19/Mar/24 7:48 PM - edited

It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm.

From the logging you provided, this is definitely the case.

00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:''

This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID.

00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4

This message tells us the lnet_peer object for this peer has .108 as the primary NID. Thus, message drop due to NID mismatch. I suspect there is a bug with the primary NID locking, or somewhere in ptlrpc connection management where it stores the primary NID.

Chris Horn added a comment - 19/Mar/24 7:48 PM - edited It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm. From the logging you provided, this is definitely the case. 00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:'' This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID. 00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 This message tells us the lnet_peer object for this peer has .108 as the primary NID. Thus, message drop due to NID mismatch. I suspect there is a bug with the primary NID locking, or somewhere in ptlrpc connection management where it stores the primary NID.

Peter Jones added a comment - 04/Mar/24 9:48 PM

Landed for 2.16

Peter Jones added a comment - 04/Mar/24 9:48 PM Landed for 2.16

Gerrit Updater added a comment - 04/Mar/24 8:04 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54082/
Subject: ~~LU-17476~~ lnet: use bits only to match ME in all cases
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a7ae2e5515879dc31e87106314d35dc439a2c50d

Gerrit Updater added a comment - 04/Mar/24 8:04 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54082/ Subject: LU-17476 lnet: use bits only to match ME in all cases Project: fs/lustre-release Branch: master Current Patch Set: Commit: a7ae2e5515879dc31e87106314d35dc439a2c50d

Gerrit Updater added a comment - 04/Mar/24 8:04 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53843/
Subject: ~~LU-17476~~ lnet: prefer to use bits only to match ME
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0b61b7d6d7940f67b75db2f4747169478512dd09

Gerrit Updater added a comment - 04/Mar/24 8:04 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53843/ Subject: LU-17476 lnet: prefer to use bits only to match ME Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0b61b7d6d7940f67b75db2f4747169478512dd09

Gerrit Updater added a comment - 16/Feb/24 7:15 PM

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54082
Subject: ~~LU-17476~~ lnet: use bits only to match ME in all cases
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 92c43a0653d63fa275d06acee66c3068e2e7580b

Gerrit Updater added a comment - 16/Feb/24 7:15 PM "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54082 Subject: LU-17476 lnet: use bits only to match ME in all cases Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 92c43a0653d63fa275d06acee66c3068e2e7580b

Gerrit Updater added a comment - 05/Feb/24 10:43 PM - edited

~~"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch:~~ https://review.whamcloud.com/c/fs/lustre-release/+/53929
~~Subject: ~~LU-17476~~ tests: check for sanity/13a failures~~
~~Project: fs/lustre-release~~
~~Branch: master~~
~~Current Patch Set: 1~~
~~Commit: b411c5c7b817e4914ecfaa596c5e0b82106fc23c~~

Gerrit Updater added a comment - 05/Feb/24 10:43 PM - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53929 Subject: LU-17476 tests: check for sanity/13a failures Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b411c5c7b817e4914ecfaa596c5e0b82106fc23c

Andreas Dilger added a comment - 01/Feb/24 1:25 AM

Chris, this issue has been observed in the case of a Lustre server-to-client blocking AST request that cannot be replied by the client (neither client nor server have patch https://review.whamcloud.com/50530 "LU-16709 lnet: fix locking multiple peer NIDs").

The following is Oleg's analysis of the kernel debug logs on the server (with "+rpctrace+dlmtrace+lnet" enabled):

ugh, so the network seems to deliver everything fast, but lnet just does not like it?

Client side, speedy handling:

00000100:00100000:84.0:1706140870.571121:0:340548:0:(service.c:2136:ptlrpc_server_handle_req_in())
     got req x1788044801687552
00000100:00100000:84.0:1706140870.571138:0:340548:0:(nrs_fifo.c:177:nrs_fifo_req_get())
     NRS start fifo request from 12345-10.85.153.100@tcp, seq: 155892
00000100:00100000:84.0:1706140870.571140:0:340548:0:(service.c:2283:ptlrpc_server_handle_request())
     Handling RPC req@00000000a35a640a pname:cluuid+ref:pid:xid:nid:opc:job
     ldlm_cb01_010:c63d2a6a-1533-4998-b176-4f630f667a59+5:3678513:x1788044801687552:12345-10.85.153.100@tcp:104:
00000100:00000200:84.0:1706140870.571142:0:340548:0:(service.c:2298:ptlrpc_server_handle_request())
     got req 1788044801687552
00000100:00000200:84.0:1706140870.571151:0:340548:0:(niobuf.c:86:ptl_send_buf())
     Sending 224 bytes to portal 16, xid 1788044801687552, offset 224
00000400:00000200:84.0:1706140870.571153:0:340548:0:(lib-move.c:4904:LNetPut()) LNetPut -> 12345-10.85.153.101@tcp
00000400:00000200:84.0:1706140870.571154:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash())
     Match nid 10.85.153.101@tcp to cpt 1
00000400:00000200:84.0:1706140870.571156:0:340548:0:(lib-move.c:2731:lnet_handle_send_case_locked())
     Source Specified: 10.31.3.108@tcp to MR:  10.85.153.101@tcp local destination
00000400:00000200:84.0:1706140870.571157:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash())
     Match nid 10.85.153.101@tcp to cpt 1
00000400:00000200:84.0:1706140870.571160:0:340548:0:(lib-move.c:1884:lnet_handle_send())
     TRACE: 10.31.3.108@tcp(10.31.3.108@tcp:10.31.3.108@tcp) ->
     10.85.153.101@tcp(10.85.153.101@tcp:10.85.153.101@tcp) <?> : PUT try# 0
00000800:00000200:84.0:1706140870.571161:0:340548:0:(socklnd_cb.c:992:ksocknal_send())
     sending 224 bytes in 1 frags to 12345-10.85.153.101@tcp
00000800:00000200:84.0:1706140870.571162:0:340548:0:(socklnd.c:231:ksocknal_find_peer_locked())
     got peer_ni [000000007915095f] -> 12345-10.85.153.101@tcp (5)
00000800:00000200:84.0:1706140870.571166:0:340548:0:(socklnd_cb.c:758:ksocknal_queue_tx_locked())
     Sending to 12345-10.85.153.101@tcp ip 10.85.153.101:988
00000800:00000200:84.0:1706140870.571167:0:340548:0:(socklnd_cb.c:777:ksocknal_queue_tx_locked())
     Packet 000000005a9eb1fb type 1, nob 320 niov 1 nkiov 1

Server side, starting with the send completion (clocks are in sync betweem client and server):

00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:''

I think this is the first (instant) attempt at receiving the response on the server:

00000400:00000100:19.0:1706140870.571652:0:18948:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4

But somehow it does not really lead to anything useful server side, as we see this message being repeatedly arriving at increasing intervals:

00000100:00000200:17.0:1706140908.945592:0:18949:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0)
     o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946
     ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
00000400:00000100:16.0:1706140908.945848:0:18950:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1038:lnet_is_health_check())
     Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0)
00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1043:lnet_is_health_check())
     health check = 0, status = -2, hstatus = 0

inconsistent state? because we already received it once?

00000100:00000200:18.0:1706140946.314936:0:18950:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706140984 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4

00000100:00000200:17.0:1706140984.714683:0:18948:0:(events.c:65:request_out_callback())
     @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 0 dl 1706141022 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
00000400:00000200:17.0:1706140984.715079:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4
00000400:00000200:17.0:1706140984.715079:0:18948:0:(lib-ptl.c:571:lnet_ptl_match_md())
     Request from 12345-10.31.3.108@tcp of length 224 into portal 16 MB=0x65a379f3f4000
00000400:00000200:17.0:1706140984.715081:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4
00000400:00000100:17.0:1706140984.715082:0:18948:0:(lib-move.c:4092:lnet_parse_put())
     Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
00000400:00000200:17.0:1706140984.715083:0:18948:0:(lib-msg.c:1038:lnet_is_health_check())
     Msg 0000000051e9a6cd is in inconsistent state, don't perform health checking (-2, 0)

and so on until the request timed out

Jan 25 00:01:48 mds000 kernel: Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request())
     @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870]
      req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
     lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''

So with this in mind I think we might need somebody with good LNet understanding to tell us what is going on and why the seemingly received message does not get processed and passed up to Lustre?

It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm.

Andreas Dilger added a comment - 01/Feb/24 1:25 AM Chris, this issue has been observed in the case of a Lustre server-to-client blocking AST request that cannot be replied by the client (neither client nor server have patch https://review.whamcloud.com/50530 " LU-16709 lnet: fix locking multiple peer NIDs "). The following is Oleg's analysis of the kernel debug logs on the server (with " +rpctrace+dlmtrace+lnet " enabled): ugh, so the network seems to deliver everything fast, but lnet just does not like it? Client side, speedy handling: 00000100:00100000:84.0:1706140870.571121:0:340548:0:(service.c:2136:ptlrpc_server_handle_req_in()) got req x1788044801687552 00000100:00100000:84.0:1706140870.571138:0:340548:0:(nrs_fifo.c:177:nrs_fifo_req_get()) NRS start fifo request from 12345-10.85.153.100@tcp, seq: 155892 00000100:00100000:84.0:1706140870.571140:0:340548:0:(service.c:2283:ptlrpc_server_handle_request()) Handling RPC req@00000000a35a640a pname:cluuid+ref:pid:xid:nid:opc:job ldlm_cb01_010:c63d2a6a-1533-4998-b176-4f630f667a59+5:3678513:x1788044801687552:12345-10.85.153.100@tcp:104: 00000100:00000200:84.0:1706140870.571142:0:340548:0:(service.c:2298:ptlrpc_server_handle_request()) got req 1788044801687552 00000100:00000200:84.0:1706140870.571151:0:340548:0:(niobuf.c:86:ptl_send_buf()) Sending 224 bytes to portal 16, xid 1788044801687552, offset 224 00000400:00000200:84.0:1706140870.571153:0:340548:0:(lib-move.c:4904:LNetPut()) LNetPut -> 12345-10.85.153.101@tcp 00000400:00000200:84.0:1706140870.571154:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash()) Match nid 10.85.153.101@tcp to cpt 1 00000400:00000200:84.0:1706140870.571156:0:340548:0:(lib-move.c:2731:lnet_handle_send_case_locked()) Source Specified: 10.31.3.108@tcp to MR: 10.85.153.101@tcp local destination 00000400:00000200:84.0:1706140870.571157:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash()) Match nid 10.85.153.101@tcp to cpt 1 00000400:00000200:84.0:1706140870.571160:0:340548:0:(lib-move.c:1884:lnet_handle_send()) TRACE: 10.31.3.108@tcp(10.31.3.108@tcp:10.31.3.108@tcp) -> 10.85.153.101@tcp(10.85.153.101@tcp:10.85.153.101@tcp) <?> : PUT try# 0 00000800:00000200:84.0:1706140870.571161:0:340548:0:(socklnd_cb.c:992:ksocknal_send()) sending 224 bytes in 1 frags to 12345-10.85.153.101@tcp 00000800:00000200:84.0:1706140870.571162:0:340548:0:(socklnd.c:231:ksocknal_find_peer_locked()) got peer_ni [000000007915095f] -> 12345-10.85.153.101@tcp (5) 00000800:00000200:84.0:1706140870.571166:0:340548:0:(socklnd_cb.c:758:ksocknal_queue_tx_locked()) Sending to 12345-10.85.153.101@tcp ip 10.85.153.101:988 00000800:00000200:84.0:1706140870.571167:0:340548:0:(socklnd_cb.c:777:ksocknal_queue_tx_locked()) Packet 000000005a9eb1fb type 1, nob 320 niov 1 nkiov 1 Server side, starting with the send completion (clocks are in sync betweem client and server): 00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:'' I think this is the first (instant) attempt at receiving the response on the server: 00000400:00000100:19.0:1706140870.571652:0:18948:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 But somehow it does not really lead to anything useful server side, as we see this message being repeatedly arriving at increasing intervals: 00000100:00000200:17.0:1706140908.945592:0:18949:0:(events.c:65:request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:'' 00000400:00000100:16.0:1706140908.945848:0:18950:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1038:lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0) 00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1043:lnet_is_health_check()) health check = 0, status = -2, hstatus = 0 inconsistent state? because we already received it once? 00000100:00000200:18.0:1706140946.314936:0:18950:0:(events.c:65:request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140984 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:'' 00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 00000100:00000200:17.0:1706140984.714683:0:18948:0:(events.c:65:request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706141022 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:'' 00000400:00000200:17.0:1706140984.715079:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4 00000400:00000200:17.0:1706140984.715079:0:18948:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.31.3.108@tcp of length 224 into portal 16 MB=0x65a379f3f4000 00000400:00000200:17.0:1706140984.715081:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4 00000400:00000100:17.0:1706140984.715082:0:18948:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 00000400:00000200:17.0:1706140984.715083:0:18948:0:(lib-msg.c:1038:lnet_is_health_check()) Msg 0000000051e9a6cd is in inconsistent state, don't perform health checking (-2, 0) and so on until the request timed out Jan 25 00:01:48 mds000 kernel: Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870] req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' So with this in mind I think we might need somebody with good LNet understanding to tell us what is going on and why the seemingly received message does not get processed and passed up to Lustre? It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm.

People

Assignee:: Serguei Smirnov

Reporter:: Serguei Smirnov

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 26/Jan/24 7:42 PM

Updated:: 29/May/25 11:31 PM

Resolved:: 04/Mar/24 9:48 PM