[LU-14943] LNet health recovery of peer NIs on remote networks does not work correctly Created: 16/Aug/21  Updated: 30/Aug/21  Resolved: 30/Aug/21

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

this is a reply to a recovery ping:

00000400:00000200:0.0:1628809321.584402:0:1949:0:(lib-move.c:3854:lnet_mt_event_handler()) Received event: 5 status: 0
00000400:00000200:0.0:1628809321.584404:0:1949:0:(lib-move.c:3869:lnet_mt_event_handler()) 192.168.2.35@tcp1 recovery message sent successfully:0
00000400:00000200:0.0:1628809321.585814:0:1949:0:(lib-move.c:4434:lnet_parse()) TRACE: 192.168.2.39@tcp2(192.168.2.39@tcp2) <- 192.168.2.35@tcp1 : REPLY - for me
00000400:00000200:0.0:1628809321.585821:0:1949:0:(lib-move.c:4199:lnet_parse_reply()) 192.168.2.39@tcp2: Reply from 12345-192.168.2.35@tcp1 of length 64/64 into md 0x25
00000400:00000200:0.0:1628809321.585827:0:1949:0:(lib-msg.c:1062:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0
00000400:00000200:0.0:1628809321.585830:0:1949:0:(lib-msg.c:836:lnet_health_check()) health check: 192.168.2.39@tcp2->192.168.2.33@tcp2: REPLY: OK

Note, the reply is from 192.168.2.35@tcp1, but lnet_health_check is looking at the router NID that forwarded the message, 192.168.2.33@tcp2.



 Comments   
Comment by Gerrit Updater [ 23/Aug/21 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/44727
Subject: LU-14943 lnet: Allow specifying a source NID for lnetctl ping
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2ac4dac5a08bdd14cf77c2d7705e04ac48cee007

Comment by Gerrit Updater [ 23/Aug/21 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/44731
Subject: LU-14943 lnet: Update health of message originator on receive
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6763e07de7c69f28204e37f352788d9b2ea3ca17

Comment by Chris Horn [ 30/Aug/21 ]

After further review of the LNet health architecture it was determined that this issue is not a bug.

Generated at Sat Feb 10 03:14:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.