[LU-14941] Remote NI recovery pings do not work in routed environment Created: 16/Aug/21  Updated: 30/Nov/21  Resolved: 30/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Issue impacts all Lustre versions with LNet health.

The routing algorithm does not ensure that recovery pings go to correct NI. In fact, they are actively avoided since, by definition, the health of that NI is sub-optimal. Here's a simple test showing the problem (test requires LU-14939):

sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show --nid 192.168.2.35@tcp1
peer:
    - primary nid: 192.168.2.34@tcp1
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.34@tcp1
          state: NA
        - nid: 192.168.2.35@tcp1
          state: NA
sles15c01:/home/hornc/lustre-filesystem # lctl list_nids
192.168.2.38@tcp2
192.168.2.39@tcp2
sles15c01:/home/hornc/lustre-filesystem # lctl show_route
net               tcp1 hops 4294967295 gw                192.168.2.33@tcp2 up pri 0
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
    - primary nid: 192.168.2.34@tcp1
        - nid: 192.168.2.34@tcp1
          health stats:
              health value: 1000
        - nid: 192.168.2.35@tcp1
          health stats:
              health value: 1000
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer set --health 0 --nid 192.168.2.35@tcp1
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
    - primary nid: 192.168.2.34@tcp1
        - nid: 192.168.2.34@tcp1
          health stats:
              health value: 1000
        - nid: 192.168.2.35@tcp1
          health stats:
              health value: 0
sles15c01:/home/hornc/lustre-filesystem # sleep 30
sles15c01:/home/hornc/lustre-filesystem # lnetctl peer show -v 2 --nid 192.168.2.35@tcp1 | egrep -e nid -e health
    - primary nid: 192.168.2.34@tcp1
        - nid: 192.168.2.34@tcp1
          health stats:
              health value: 1000
        - nid: 192.168.2.35@tcp1
          health stats:
              health value: 0
sles15c01:/home/hornc/lustre-filesystem # lctl dk > /tmp/dk.log
sles15c01:/home/hornc/lustre-filesystem # grep lnet_handle_send /tmp/dk.log | grep -c TRACE
513
sles15c01:/home/hornc/lustre-filesystem # grep lnet_handle_send /tmp/dk.log | grep TRACE | grep -- '-> 192.168.2.35@tcp1'
sles15c01:/home/hornc/lustre-filesystem #


 Comments   
Comment by Gerrit Updater [ 23/Aug/21 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/44730
Subject: LU-14941 lnet: Fix source specified to routed destination
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ebea8f20faf5ad20aace451de657452ad56aa023

Comment by Chris Horn [ 18/Oct/21 ]

Test config:

Router:

sles15s02:~ # lctl list_nids
192.168.2.32@tcp1
192.168.2.33@tcp2
sles15s02:~ #

Clients:

sles15s01:~ # lctl list_nids
192.168.2.30@tcp2
192.168.2.31@tcp2
sles15s01:~ # lctl show_route
net               tcp1 hops 4294967295 gw                192.168.2.33@tcp2 up pri 0
sles15s01:~ #

sles15c01:~ # lctl list_nids
192.168.2.38@tcp1
192.168.2.39@tcp1
sles15c01:~ # lctl show_route
net               tcp2 hops 4294967295 gw                192.168.2.32@tcp1 up pri 0
sles15c01:~ #

Send a couple messages from c01 to s01:

sles15c01:~ # lnetctl discover 192.168.2.30@tcp2
discover:
    - primary nid: 192.168.2.30@tcp2
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.30@tcp2
        - nid: 192.168.2.31@tcp2
sles15c01:~ # lnetctl discover 192.168.2.30@tcp2
discover:
    - primary nid: 192.168.2.30@tcp2
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.30@tcp2
        - nid: 192.168.2.31@tcp2
sles15c01:~ # lnetctl discover 192.168.2.30@tcp2
discover:
    - primary nid: 192.168.2.30@tcp2
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.30@tcp2
        - nid: 192.168.2.31@tcp2

Check that we've received messages from both s01's NIs:

sles15c01:~ # lctl dk | grep lnet_parse | egrep -e 192.168.2.30@tcp2 -e 192.168.2.31@tcp2
00000400:00000200:0.0:1629482198.234301:0:16222:0:(lib-move.c:4429:lnet_parse()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1) <- 192.168.2.30@tcp2 : REPLY - for me
00000400:00000200:0.0:1629482198.234307:0:16222:0:(lib-move.c:4194:lnet_parse_reply()) 192.168.2.39@tcp1: Reply from 12345-192.168.2.30@tcp2 of length 64/64 into md 0x19
00000400:00000200:0.0:1629482198.235691:0:16222:0:(lib-move.c:4429:lnet_parse()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1) <- 192.168.2.31@tcp2 : ACK - for me
00000400:00000200:0.0:1629482198.235695:0:16222:0:(lib-move.c:4247:lnet_parse_ack()) 192.168.2.39@tcp1: ACK from 12345-192.168.2.31@tcp2 into md 0x1d
00000400:00000200:0.0:1629482198.971600:0:16222:0:(lib-move.c:4429:lnet_parse()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1) <- 192.168.2.30@tcp2 : REPLY - for me
00000400:00000200:0.0:1629482198.971606:0:16222:0:(lib-move.c:4194:lnet_parse_reply()) 192.168.2.38@tcp1: Reply from 12345-192.168.2.30@tcp2 of length 64/64 into md 0x21
00000400:00000200:0.0:1629482199.940773:0:16222:0:(lib-move.c:4429:lnet_parse()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1) <- 192.168.2.31@tcp2 : REPLY - for me
00000400:00000200:0.0:1629482199.940782:0:16222:0:(lib-move.c:4194:lnet_parse_reply()) 192.168.2.39@tcp1: Reply from 12345-192.168.2.31@tcp2 of length 64/64 into md 0x25

Set health to 0:

sles15c01:~ # lnetctl peer set --health 0 --nid 192.168.2.30@tcp2
sles15c01:~ # lctl dk | grep recovery
00000400:00000200:0.0:1629482216.841269:0:16310:0:(peer.c:4035:lnet_peer_ni_add_to_recoveryq_locked()) 192.168.2.30@tcp2 added to recovery queue. ping count: 0 next ping: 671368 last alive: 671349 health: 0
00000400:00000200:0.0:1629482217.032143:0:16202:0:(peer.c:4035:lnet_peer_ni_add_to_recoveryq_locked()) 192.168.2.30@tcp2 added to recovery queue. ping count: 1 next ping: 671370 last alive: 671349 health: 0
00000400:00000200:0.0:1629482217.032203:0:16222:0:(lib-move.c:3864:lnet_mt_event_handler()) 192.168.2.30@tcp2 recovery message sent successfully:0
00000400:00000200:0.0:1629482219.080116:0:16202:0:(peer.c:4035:lnet_peer_ni_add_to_recoveryq_locked()) 192.168.2.30@tcp2 added to recovery queue. ping count: 2 next ping: 671374 last alive: 671349 health: 0
00000400:00000200:0.0:1629482219.080213:0:16222:0:(lib-move.c:3864:lnet_mt_event_handler()) 192.168.2.30@tcp2 recovery message sent successfully:0
sles15c01:~ #

We can see NI is now queued for recovery correctly. However, after some time the health still has not recovered:

sles15c01:~ # sleep 30
sles15c01:~ # lnetctl peer show -v 2 --nid 192.168.2.30@tcp2 | egrep -e nid -e health
    - primary nid: 192.168.2.30@tcp2
        - nid: 192.168.2.30@tcp2
          health stats:
              health value: 0
        - nid: 192.168.2.31@tcp2
          health stats:
              health value: 1000
sles15c01:~ #

From debug log we can see that no recovery pings are being sent to the NI in recovery. Instead they are being sent to the healthy NI:

sles15c01:~ # grep lnet_handle_send /tmp/dk.log | grep TRACE | grep -- '-> 192.168.2.30@tcp2'
sles15c01:~ # grep lnet_handle_send /tmp/dk.log | grep TRACE | grep -- '-> 192.168.2.31@tcp2'
00000400:00000200:0.0:1629482844.232128:0:17222:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1:<?>) -> 192.168.2.31@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629482846.280128:0:17222:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1:<?>) -> 192.168.2.31@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629482850.376103:0:17222:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1:<?>) -> 192.168.2.31@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629482858.568104:0:17222:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1:<?>) -> 192.168.2.31@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629482874.952126:0:17222:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1:<?>) -> 192.168.2.31@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629482906.696122:0:17222:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1:<?>) -> 192.168.2.31@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
sles15c01:~ #

Re-test with the fix applied

sles15c01:~ # lnetctl discover 192.168.2.30@tcp2
discover:
    - primary nid: 192.168.2.30@tcp2
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.30@tcp2
        - nid: 192.168.2.31@tcp2
sles15c01:~ # lnetctl discover 192.168.2.30@tcp2
discover:
    - primary nid: 192.168.2.30@tcp2
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.30@tcp2
        - nid: 192.168.2.31@tcp2
sles15c01:~ # lnetctl discover 192.168.2.30@tcp2
discover:
    - primary nid: 192.168.2.30@tcp2
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.30@tcp2
        - nid: 192.168.2.31@tcp2
sles15c01:~ # lctl dk | grep lnet_parse | egrep -e 192.168.2.30@tcp2 -e 192.168.2.31@tcp2
00000400:00000200:0.0:1629483126.305212:0:17591:0:(lib-move.c:4434:lnet_parse()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1) <- 192.168.2.30@tcp2 : REPLY - for me
00000400:00000200:0.0:1629483126.305218:0:17591:0:(lib-move.c:4199:lnet_parse_reply()) 192.168.2.39@tcp1: Reply from 12345-192.168.2.30@tcp2 of length 64/64 into md 0x19
00000400:00000200:0.0:1629483126.306293:0:17591:0:(lib-move.c:4434:lnet_parse()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1) <- 192.168.2.30@tcp2 : ACK - for me
00000400:00000200:0.0:1629483126.306299:0:17591:0:(lib-move.c:4252:lnet_parse_ack()) 192.168.2.39@tcp1: ACK from 12345-192.168.2.30@tcp2 into md 0x1d
00000400:00000200:0.0:1629483127.213996:0:17591:0:(lib-move.c:4434:lnet_parse()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1) <- 192.168.2.31@tcp2 : REPLY - for me
00000400:00000200:0.0:1629483127.214003:0:17591:0:(lib-move.c:4199:lnet_parse_reply()) 192.168.2.38@tcp1: Reply from 12345-192.168.2.31@tcp2 of length 64/64 into md 0x21
00000400:00000200:0.0:1629483128.216060:0:17591:0:(lib-move.c:4434:lnet_parse()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1) <- 192.168.2.30@tcp2 : REPLY - for me
00000400:00000200:0.0:1629483128.216067:0:17591:0:(lib-move.c:4199:lnet_parse_reply()) 192.168.2.39@tcp1: Reply from 12345-192.168.2.30@tcp2 of length 64/64 into md 0x25
sles15c01:~ # lnetctl peer set --health 0 --nid 192.168.2.30@tcp2
sles15c01:~ # lnetctl peer show -v 2 --nid 192.168.2.30@tcp2 | egrep -e nid -e health
    - primary nid: 192.168.2.30@tcp2
        - nid: 192.168.2.30@tcp2
          health stats:
              health value: 0
        - nid: 192.168.2.31@tcp2
          health stats:
              health value: 1000
sles15c01:~ # sleep 30
sles15c01:~ # lnetctl peer show -v 2 --nid 192.168.2.30@tcp2 | egrep -e nid -e health
    - primary nid: 192.168.2.30@tcp2
        - nid: 192.168.2.30@tcp2
          health stats:
              health value: 0
        - nid: 192.168.2.31@tcp2
          health stats:
              health value: 1000
sles15c01:~ # lctl dk > /tmp/dk.log
sles15c01:~ # grep lnet_handle_send /tmp/dk.log | grep TRACE | grep -- '-> 192.168.2.30@tcp2'
00000400:00000200:0.0:1629483137.480098:0:17571:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1:<?>) -> 192.168.2.30@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629483139.528115:0:17571:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1:<?>) -> 192.168.2.30@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629483143.624105:0:17571:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1:<?>) -> 192.168.2.30@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629483151.816108:0:17571:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.39@tcp1(192.168.2.39@tcp1:<?>) -> 192.168.2.30@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
00000400:00000200:0.0:1629483167.176149:0:17571:0:(lib-move.c:1907:lnet_handle_send()) TRACE: 192.168.2.38@tcp1(192.168.2.38@tcp1:<?>) -> 192.168.2.30@tcp2(192.168.2.30@tcp2:192.168.2.32@tcp1) <?> : GET try# 0
sles15c01:~ # grep lnet_handle_send /tmp/dk.log | grep TRACE | grep -- '-> 192.168.2.31@tcp2'
sles15c01:~ #

We can see that the recovery pings are now going to the correct NI.

Comment by Gerrit Updater [ 30/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44730/
Subject: LU-14941 lnet: Fix source specified to routed destination
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 98da4ace43a6c4c59e7981bf0fb649005237d12f

Comment by Peter Jones [ 30/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:14:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.