Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17476

lnet: only report mismatched nid in ME if bits match

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.5
    • Lustre 2.15.5
    • None
    • 3
    • 9223372036854775807

    Description

      There are rare cases where a client-to-server AST reply was being dropped by the server, with messages similar to the following with o104, o105, or o106 as the RPC type:

      Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request())
           @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870]
            req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
           lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''
      

      and in the kernel debug logs it shows that LNet is dropping the RPC due to no matching request:

      lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
      :
      request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs02-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
      lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
      lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0)
      lnet_is_health_check()) health check = 0, status = -2, hstatus = 0
      

      As a part of MD matching for incoming GET or PUT from a peer with multiple NIDs, use "matchbits" only if they are available and only report an error on NID/PID mismatch. If can't use "matchbits" for matching, fail on NID/PID mismatch as before.

      Attachments

        Issue Links

          Activity

            [LU-17476] lnet: only report mismatched nid in ME if bits match

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55488
            Subject: LU-17476 lnet: prefer to use bits only to match ME
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: e10fa830fe3313768df31eee657fe1b02792c1ab

            gerrit Gerrit Updater added a comment - "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55488 Subject: LU-17476 lnet: prefer to use bits only to match ME Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: e10fa830fe3313768df31eee657fe1b02792c1ab
            hornc Chris Horn added a comment - - edited

            adilger I root caused a NID mismatch bug that is due to a bad back-port that landed in 2.15.4. Details in LU-17664

            hornc Chris Horn added a comment - - edited adilger I root caused a NID mismatch bug that is due to a bad back-port that landed in 2.15.4. Details in LU-17664
            hornc Chris Horn added a comment -

            Do you have the complete log(s) from Oleg's analysis? Maybe there is a clue there.

            I think maybe there is some race between discovery and ptlrpc connection setup. The reply buffer (always?) uses whatever NID is stored in the import's (or export's reverse import) ptlrpc_connection.c_peer.nid. This is populated by a call to LNetPrimaryNID(). Therefore we can infer that at some point LNetPrimaryNID() returned the .109, but then discovery later set .108 as the primary. This shouldn't happen with the primary NID locking feature, but there is no other explanation I can think of.

            hornc Chris Horn added a comment - Do you have the complete log(s) from Oleg's analysis? Maybe there is a clue there. I think maybe there is some race between discovery and ptlrpc connection setup. The reply buffer (always?) uses whatever NID is stored in the import's (or export's reverse import) ptlrpc_connection.c_peer.nid. This is populated by a call to LNetPrimaryNID(). Therefore we can infer that at some point LNetPrimaryNID() returned the .109, but then discovery later set .108 as the primary. This shouldn't happen with the primary NID locking feature, but there is no other explanation I can think of.

            This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID.

            hornc, thanks for looking into this. Can you see if this is something wrong with how the RPC is generated at the PtlRPC/LDLM layer, maybe where LDLM is getting the peer NID for the AST? We only really saw this with LDLM requests, but that might also have been self selective based on the fact that is one of the few server-to-client RPC types.

            adilger Andreas Dilger added a comment - This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID. hornc , thanks for looking into this. Can you see if this is something wrong with how the RPC is generated at the PtlRPC/LDLM layer, maybe where LDLM is getting the peer NID for the AST? We only really saw this with LDLM requests, but that might also have been self selective based on the fact that is one of the few server-to-client RPC types.
            hornc Chris Horn added a comment - - edited

            It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm.

            From the logging you provided, this is definitely the case.

            00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback())
                 @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
                 lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:''
            

            This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID.

            00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put())
                 Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
            

            This message tells us the lnet_peer object for this peer has .108 as the primary NID. Thus, message drop due to NID mismatch. I suspect there is a bug with the primary NID locking, or somewhere in ptlrpc connection management where it stores the primary NID.

            hornc Chris Horn added a comment - - edited It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm. From the logging you provided, this is definitely the case. 00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:'' This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID. 00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 This message tells us the lnet_peer object for this peer has .108 as the primary NID. Thus, message drop due to NID mismatch. I suspect there is a bug with the primary NID locking, or somewhere in ptlrpc connection management where it stores the primary NID.
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54082/
            Subject: LU-17476 lnet: use bits only to match ME in all cases
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a7ae2e5515879dc31e87106314d35dc439a2c50d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54082/ Subject: LU-17476 lnet: use bits only to match ME in all cases Project: fs/lustre-release Branch: master Current Patch Set: Commit: a7ae2e5515879dc31e87106314d35dc439a2c50d

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53843/
            Subject: LU-17476 lnet: prefer to use bits only to match ME
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0b61b7d6d7940f67b75db2f4747169478512dd09

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53843/ Subject: LU-17476 lnet: prefer to use bits only to match ME Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0b61b7d6d7940f67b75db2f4747169478512dd09

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54082
            Subject: LU-17476 lnet: use bits only to match ME in all cases
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 92c43a0653d63fa275d06acee66c3068e2e7580b

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54082 Subject: LU-17476 lnet: use bits only to match ME in all cases Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 92c43a0653d63fa275d06acee66c3068e2e7580b
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53929
            Subject: LU-17476 tests: check for sanity/13a failures
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b411c5c7b817e4914ecfaa596c5e0b82106fc23c

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53929 Subject: LU-17476 tests: check for sanity/13a failures Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b411c5c7b817e4914ecfaa596c5e0b82106fc23c

            Chris, this issue has been observed in the case of a Lustre server-to-client blocking AST request that cannot be replied by the client (neither client nor server have patch https://review.whamcloud.com/50530 "LU-16709 lnet: fix locking multiple peer NIDs").

            The following is Oleg's analysis of the kernel debug logs on the server (with "+rpctrace+dlmtrace+lnet" enabled):

            ugh, so the network seems to deliver everything fast, but lnet just does not like it?

            Client side, speedy handling:

            00000100:00100000:84.0:1706140870.571121:0:340548:0:(service.c:2136:ptlrpc_server_handle_req_in())
                 got req x1788044801687552
            00000100:00100000:84.0:1706140870.571138:0:340548:0:(nrs_fifo.c:177:nrs_fifo_req_get())
                 NRS start fifo request from 12345-10.85.153.100@tcp, seq: 155892
            00000100:00100000:84.0:1706140870.571140:0:340548:0:(service.c:2283:ptlrpc_server_handle_request())
                 Handling RPC req@00000000a35a640a pname:cluuid+ref:pid:xid:nid:opc:job
                 ldlm_cb01_010:c63d2a6a-1533-4998-b176-4f630f667a59+5:3678513:x1788044801687552:12345-10.85.153.100@tcp:104:
            00000100:00000200:84.0:1706140870.571142:0:340548:0:(service.c:2298:ptlrpc_server_handle_request())
                 got req 1788044801687552
            00000100:00000200:84.0:1706140870.571151:0:340548:0:(niobuf.c:86:ptl_send_buf())
                 Sending 224 bytes to portal 16, xid 1788044801687552, offset 224
            00000400:00000200:84.0:1706140870.571153:0:340548:0:(lib-move.c:4904:LNetPut()) LNetPut -> 12345-10.85.153.101@tcp
            00000400:00000200:84.0:1706140870.571154:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash())
                 Match nid 10.85.153.101@tcp to cpt 1
            00000400:00000200:84.0:1706140870.571156:0:340548:0:(lib-move.c:2731:lnet_handle_send_case_locked())
                 Source Specified: 10.31.3.108@tcp to MR:  10.85.153.101@tcp local destination
            00000400:00000200:84.0:1706140870.571157:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash())
                 Match nid 10.85.153.101@tcp to cpt 1
            00000400:00000200:84.0:1706140870.571160:0:340548:0:(lib-move.c:1884:lnet_handle_send())
                 TRACE: 10.31.3.108@tcp(10.31.3.108@tcp:10.31.3.108@tcp) ->
                 10.85.153.101@tcp(10.85.153.101@tcp:10.85.153.101@tcp) <?> : PUT try# 0
            00000800:00000200:84.0:1706140870.571161:0:340548:0:(socklnd_cb.c:992:ksocknal_send())
                 sending 224 bytes in 1 frags to 12345-10.85.153.101@tcp
            00000800:00000200:84.0:1706140870.571162:0:340548:0:(socklnd.c:231:ksocknal_find_peer_locked())
                 got peer_ni [000000007915095f] -> 12345-10.85.153.101@tcp (5)
            00000800:00000200:84.0:1706140870.571166:0:340548:0:(socklnd_cb.c:758:ksocknal_queue_tx_locked())
                 Sending to 12345-10.85.153.101@tcp ip 10.85.153.101:988
            00000800:00000200:84.0:1706140870.571167:0:340548:0:(socklnd_cb.c:777:ksocknal_queue_tx_locked())
                 Packet 000000005a9eb1fb type 1, nob 320 niov 1 nkiov 1
            

            Server side, starting with the send completion (clocks are in sync betweem client and server):

            00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback())
                 @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
                 lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:''
             

            I think this is the first (instant) attempt at receiving the response on the server:

            00000400:00000100:19.0:1706140870.571652:0:18948:0:(lib-move.c:4092:lnet_parse_put())
                 Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
            

            But somehow it does not really lead to anything useful server side, as we see this message being repeatedly arriving at increasing intervals:

            00000100:00000200:17.0:1706140908.945592:0:18949:0:(events.c:65:request_out_callback())
                 @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0)
                 o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946
                 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
            00000400:00000100:16.0:1706140908.945848:0:18950:0:(lib-move.c:4092:lnet_parse_put())
                 Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
            00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1038:lnet_is_health_check())
                 Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0)
            00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1043:lnet_is_health_check())
                 health check = 0, status = -2, hstatus = 0
            

            inconsistent state? because we already received it once?

            00000100:00000200:18.0:1706140946.314936:0:18950:0:(events.c:65:request_out_callback())
                 @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
                 lens 328/224 e 0 to 0 dl 1706140984 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
            00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put())
                 Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
            
            00000100:00000200:17.0:1706140984.714683:0:18948:0:(events.c:65:request_out_callback())
                 @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
                 lens 328/224 e 0 to 0 dl 1706141022 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
            00000400:00000200:17.0:1706140984.715079:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4
            00000400:00000200:17.0:1706140984.715079:0:18948:0:(lib-ptl.c:571:lnet_ptl_match_md())
                 Request from 12345-10.31.3.108@tcp of length 224 into portal 16 MB=0x65a379f3f4000
            00000400:00000200:17.0:1706140984.715081:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4
            00000400:00000100:17.0:1706140984.715082:0:18948:0:(lib-move.c:4092:lnet_parse_put())
                 Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
            00000400:00000200:17.0:1706140984.715083:0:18948:0:(lib-msg.c:1038:lnet_is_health_check())
                 Msg 0000000051e9a6cd is in inconsistent state, don't perform health checking (-2, 0)
            

            and so on until the request timed out

            Jan 25 00:01:48 mds000 kernel: Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request())
                 @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870]
                  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
                 lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''
            

            So with this in mind I think we might need somebody with good LNet understanding to tell us what is going on and why the seemingly received message does not get processed and passed up to Lustre?

            It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm.

            adilger Andreas Dilger added a comment - Chris, this issue has been observed in the case of a Lustre server-to-client blocking AST request that cannot be replied by the client (neither client nor server have patch https://review.whamcloud.com/50530 " LU-16709 lnet: fix locking multiple peer NIDs "). The following is Oleg's analysis of the kernel debug logs on the server (with " +rpctrace+dlmtrace+lnet " enabled): ugh, so the network seems to deliver everything fast, but lnet just does not like it? Client side, speedy handling: 00000100:00100000:84.0:1706140870.571121:0:340548:0:(service.c:2136:ptlrpc_server_handle_req_in()) got req x1788044801687552 00000100:00100000:84.0:1706140870.571138:0:340548:0:(nrs_fifo.c:177:nrs_fifo_req_get()) NRS start fifo request from 12345-10.85.153.100@tcp, seq: 155892 00000100:00100000:84.0:1706140870.571140:0:340548:0:(service.c:2283:ptlrpc_server_handle_request()) Handling RPC req@00000000a35a640a pname:cluuid+ref:pid:xid:nid:opc:job ldlm_cb01_010:c63d2a6a-1533-4998-b176-4f630f667a59+5:3678513:x1788044801687552:12345-10.85.153.100@tcp:104: 00000100:00000200:84.0:1706140870.571142:0:340548:0:(service.c:2298:ptlrpc_server_handle_request()) got req 1788044801687552 00000100:00000200:84.0:1706140870.571151:0:340548:0:(niobuf.c:86:ptl_send_buf()) Sending 224 bytes to portal 16, xid 1788044801687552, offset 224 00000400:00000200:84.0:1706140870.571153:0:340548:0:(lib-move.c:4904:LNetPut()) LNetPut -> 12345-10.85.153.101@tcp 00000400:00000200:84.0:1706140870.571154:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash()) Match nid 10.85.153.101@tcp to cpt 1 00000400:00000200:84.0:1706140870.571156:0:340548:0:(lib-move.c:2731:lnet_handle_send_case_locked()) Source Specified: 10.31.3.108@tcp to MR:  10.85.153.101@tcp local destination 00000400:00000200:84.0:1706140870.571157:0:340548:0:(api-ni.c:1404:lnet_nid_cpt_hash()) Match nid 10.85.153.101@tcp to cpt 1 00000400:00000200:84.0:1706140870.571160:0:340548:0:(lib-move.c:1884:lnet_handle_send()) TRACE: 10.31.3.108@tcp(10.31.3.108@tcp:10.31.3.108@tcp) -> 10.85.153.101@tcp(10.85.153.101@tcp:10.85.153.101@tcp) <?> : PUT try# 0 00000800:00000200:84.0:1706140870.571161:0:340548:0:(socklnd_cb.c:992:ksocknal_send()) sending 224 bytes in 1 frags to 12345-10.85.153.101@tcp 00000800:00000200:84.0:1706140870.571162:0:340548:0:(socklnd.c:231:ksocknal_find_peer_locked()) got peer_ni [000000007915095f] -> 12345-10.85.153.101@tcp (5) 00000800:00000200:84.0:1706140870.571166:0:340548:0:(socklnd_cb.c:758:ksocknal_queue_tx_locked()) Sending to 12345-10.85.153.101@tcp ip 10.85.153.101:988 00000800:00000200:84.0:1706140870.571167:0:340548:0:(socklnd_cb.c:777:ksocknal_queue_tx_locked()) Packet 000000005a9eb1fb type 1, nob 320 niov 1 nkiov 1 Server side, starting with the send completion (clocks are in sync betweem client and server): 00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:'' I think this is the first (instant) attempt at receiving the response on the server: 00000400:00000100:19.0:1706140870.571652:0:18948:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 But somehow it does not really lead to anything useful server side, as we see this message being repeatedly arriving at increasing intervals: 00000100:00000200:17.0:1706140908.945592:0:18949:0:(events.c:65:request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:'' 00000400:00000100:16.0:1706140908.945848:0:18950:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1038:lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0) 00000400:00000200:16.0:1706140908.945850:0:18950:0:(lib-msg.c:1043:lnet_is_health_check()) health check = 0, status = -2, hstatus = 0 inconsistent state? because we already received it once? 00000100:00000200:18.0:1706140946.314936:0:18950:0:(events.c:65:request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140984 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:'' 00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 00000100:00000200:17.0:1706140984.714683:0:18948:0:(events.c:65:request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706141022 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:'' 00000400:00000200:17.0:1706140984.715079:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4 00000400:00000200:17.0:1706140984.715079:0:18948:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.31.3.108@tcp of length 224 into portal 16 MB=0x65a379f3f4000 00000400:00000200:17.0:1706140984.715081:0:18948:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 10.31.3.108@tcp to cpt 4 00000400:00000100:17.0:1706140984.715082:0:18948:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 00000400:00000200:17.0:1706140984.715083:0:18948:0:(lib-msg.c:1038:lnet_is_health_check()) Msg 0000000051e9a6cd is in inconsistent state, don't perform health checking (-2, 0) and so on until the request timed out Jan 25 00:01:48 mds000 kernel: Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870]  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' So with this in mind I think we might need somebody with good LNet understanding to tell us what is going on and why the seemingly received message does not get processed and passed up to Lustre? It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm.

            People

              ssmirnov Serguei Smirnov
              ssmirnov Serguei Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: