[LU-14555] lnet_check_route_inconsistency() complains when hops == -1 Created: 26/Mar/21  Updated: 13/Jan/23  Resolved: 13/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Gian-Carlo Defazio
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

RHEL 8
multi-hop network
hops not set


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have the following configuration:

2.14_servers == o2ib100 == 2.12_routers == tcp129 == 2.12_routers == o2ib18 == 2.12_clients

Discovery is disabled, and the routes are configured statically, on all the systems.

This causes LNet to complain vociferously on the console from lnet_check_route_inconsistency()

LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1

If LNet is configured so that there is only one route to any given endpoint, even on a multi-hop network, there is no value to spending sysadmin time determining and setting the hop counts as far as I can tell.  And setting hops is optional according to the Lustre Operations Manual.

Is hop count actually required in 2.14 due to LU-13029 and LU-13785?



 Comments   
Comment by Gerrit Updater [ 26/Mar/21 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/43127
Subject: LU-14555 lnet: do not complain if hops == -1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4ebbbb068cf9d2f53b6923ffe7744dc562bc94fe

Comment by Olaf Faaland [ 26/Mar/21 ]

I don't actually know if hops == 0 is either valid or possible, so in the patch I checked for that and reported it as well as reporting if hops == 1.

Comment by Olaf Faaland [ 26/Mar/21 ]

Peter,

There appears to be more to hop count than I realized.  Please assign this to an engineer.

thanks

Comment by Peter Jones [ 26/Mar/21 ]

Serguei

Could you please assist?

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 27/Mar/21 ]

Olaf, the check for inconsistency should happen only when we ping the route for aliveness.

Do you see it happening more often?

We can probably reduce the severity of the debug message. But would like to make sure that it's not being printed more frequently than it should.

Comment by Olaf Faaland [ 27/Mar/21 ]

Hi Amir,

I'm not certain if this is only occurring when the route is pinged for aliveness.  I'll look.  But if setting the hop count is not required, then a console message is inappropriate.   And if setting the hop count is required, then shouldn't that be enforced at the time routes are created?

[root@garteri:~]# pdsh -N -w e1 'dmesg -T | grep -w hop | fgrep 1.54 | tail' | sed 's/is detected to be multi-hop.*$//'
[Thu Mar 25 15:21:37 2021] LNet: 29145:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
[Thu Mar 25 15:23:07 2021] LNet: 29145:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
[Thu Mar 25 15:23:37 2021] LNet: 29143:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
[Thu Mar 25 15:24:08 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
[Thu Mar 25 15:24:37 2021] LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
[Thu Mar 25 15:25:38 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
[Thu Mar 25 15:27:07 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
[Thu Mar 25 15:29:37 2021] LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
[Thu Mar 25 15:34:08 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 

thanks
 

Comment by Olaf Faaland [ 27/Mar/21 ]

Amir,
The timing of the message does not seem to correlate well with the timing of the call to lnet_check_routers(). If I'm looking at the wrong code, let me know.

The debug log "discover" messages from lnet_check_routers for one router, .1.54:

2021-03-26 18:12:19.043797 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:12:49.798562 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:13:19.494570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:13:49.190567 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:14:19.910568 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:14:49.606563 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:15:19.302567 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:15:50.022572 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:16:20.742564 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:16:50.438577 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:17:20.134570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:17:50.854558 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:18:20.550570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:18:50.246560 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:19:20.966560 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:19:50.662570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
2021-03-26 18:20:20.358566 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 

and the console log multi-hop messages from that period:

[Fri Mar 26 18:12:18 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:12:48 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:13:18 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:13:48 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:14:19 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:14:48 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:15:18 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:16:19 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:17:50 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
[Fri Mar 26 18:20:19 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
Comment by Amir Shehata (Inactive) [ 31/Mar/21 ]

Looks like it's getting the keep alive every 30 seconds and that's when we do the check route consistency. I think it'll be enough to reduce the message severity to just "net". It is not mandatory to set the hop count. However, the reason we have the check is to verify configuration consistency. However, if it's not standard to explicitly specify the hop count when configuring the route, then the check becomes less effective.

Comment by Gerrit Updater [ 23/Mar/22 ]

"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46918
Subject: LU-14555 lnet: change route inconsistency warnings
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 572b52a488ac7a186be2f808668e163e9ac850b2

Comment by Gerrit Updater [ 11/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46918/
Subject: LU-14555 lnet: asym route inconsistency warning
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6ab060e58e6b3f38b0c8d57b56fec887c6fe9fb6

Comment by Peter Jones [ 11/Jul/22 ]

Landed for 2.16

Comment by Gian-Carlo Defazio [ 08/Dec/22 ]

Just adding the requirement that avoid_asym_router_failure be true doesn't prevent the warning from beaing logged because we have avoid_asym_router_failure=1.

However, due to changes in the code since ticket was made, I've made a change which further reduces the conditions in which the warning is logged: having hops undefined for a multi-hop route is no longer considered an inconsistent configuration.

Comment by Gerrit Updater [ 08/Dec/22 ]

"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49352
Subject: LU-14555 lnet: asym route inconsistency warning
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: af572e46f883f8736af8055a2030eb640792130c

Comment by Gerrit Updater [ 13/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49352/
Subject: LU-14555 lnet: asym route inconsistency warning
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6aed5df1771c299b527251b0e18ff9f6cb95dd75

Comment by Peter Jones [ 13/Jan/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:10:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.