[LU-17476] lnet: only report mismatched nid in ME if bits match Created: 26/Jan/24 Updated: 08/Feb/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Serguei Smirnov | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
There are rare cases where a client-to-server AST reply was being dropped by the server: lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 : request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs02-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:'' lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0) lnet_is_health_check()) health check = 0, status = -2, hstatus = 0 As a part of MD matching for incoming GET or PUT from a peer with multiple NIDs, use "matchbits" only if they are available and only report an error on NID/PID mismatch. If can't use "matchbits" for matching, fail on NID/PID mismatch as before. |
| Comments |
| Comment by Gerrit Updater [ 27/Jan/24 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53843 |
| Comment by Chris Horn [ 29/Jan/24 ] |
|
Can you say more about the cases where this issue occurs? I've seen it when there is some sort of mismatch between expected and actual primary NID of a peer. |
| Comment by Gerrit Updater [ 29/Jan/24 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53851 |
| Comment by Gerrit Updater [ 31/Jan/24 ] |
|
Merged into https://review.whamcloud.com/53843 |
| Comment by Andreas Dilger [ 01/Feb/24 ] |
|
Chris, this issue has been observed in the case of a Lustre server-to-client blocking AST request that cannot be replied by the client (neither client nor server have patch https://review.whamcloud.com/50530 " The following is Oleg's analysis of the kernel debug logs on the server (with "+rpctrace+dlmtrace+lnet" enabled):
It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm. |
| Comment by Gerrit Updater [ 05/Feb/24 ] |
|
|