Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.15.5
-
None
-
3
-
9223372036854775807
Description
There are rare cases where a client-to-server AST reply was being dropped by the server, with messages similar to the following with o104, o105, or o106 as the RPC type:
Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870] req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''
and in the kernel debug logs it shows that LNet is dropping the RPC due to no matching request:
lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 : request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs02-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:'' lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0) lnet_is_health_check()) health check = 0, status = -2, hstatus = 0
As a part of MD matching for incoming GET or PUT from a peer with multiple NIDs, use "matchbits" only if they are available and only report an error on NID/PID mismatch. If can't use "matchbits" for matching, fail on NID/PID mismatch as before.