[LU-14943] LNet health recovery of peer NIs on remote networks does not work correctly Created: 16/Aug/21 Updated: 30/Aug/21 Resolved: 30/Aug/21 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
this is a reply to a recovery ping: 00000400:00000200:0.0:1628809321.584402:0:1949:0:(lib-move.c:3854:lnet_mt_event_handler()) Received event: 5 status: 0 00000400:00000200:0.0:1628809321.584404:0:1949:0:(lib-move.c:3869:lnet_mt_event_handler()) 192.168.2.35@tcp1 recovery message sent successfully:0 00000400:00000200:0.0:1628809321.585814:0:1949:0:(lib-move.c:4434:lnet_parse()) TRACE: 192.168.2.39@tcp2(192.168.2.39@tcp2) <- 192.168.2.35@tcp1 : REPLY - for me 00000400:00000200:0.0:1628809321.585821:0:1949:0:(lib-move.c:4199:lnet_parse_reply()) 192.168.2.39@tcp2: Reply from 12345-192.168.2.35@tcp1 of length 64/64 into md 0x25 00000400:00000200:0.0:1628809321.585827:0:1949:0:(lib-msg.c:1062:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0 00000400:00000200:0.0:1628809321.585830:0:1949:0:(lib-msg.c:836:lnet_health_check()) health check: 192.168.2.39@tcp2->192.168.2.33@tcp2: REPLY: OK Note, the reply is from 192.168.2.35@tcp1, but lnet_health_check is looking at the router NID that forwarded the message, 192.168.2.33@tcp2. |
| Comments |
| Comment by Gerrit Updater [ 23/Aug/21 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/44727 |
| Comment by Gerrit Updater [ 23/Aug/21 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/44731 |
| Comment by Chris Horn [ 30/Aug/21 ] |
|
After further review of the LNet health architecture it was determined that this issue is not a bug. |