[LU-14555] lnet_check_route_inconsistency() complains when hops == -1 Created: 26/Mar/21 Updated: 13/Jan/23 Resolved: 13/Jan/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Gian-Carlo Defazio |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
RHEL 8 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We have the following configuration: 2.14_servers == o2ib100 == 2.12_routers == tcp129 == 2.12_routers == o2ib18 == 2.12_clients Discovery is disabled, and the routes are configured statically, on all the systems. This causes LNet to complain vociferously on the console from lnet_check_route_inconsistency() LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 If LNet is configured so that there is only one route to any given endpoint, even on a multi-hop network, there is no value to spending sysadmin time determining and setting the hop counts as far as I can tell. And setting hops is optional according to the Lustre Operations Manual. Is hop count actually required in 2.14 due to |
| Comments |
| Comment by Gerrit Updater [ 26/Mar/21 ] |
|
Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/43127 |
| Comment by Olaf Faaland [ 26/Mar/21 ] |
|
I don't actually know if hops == 0 is either valid or possible, so in the patch I checked for that and reported it as well as reporting if hops == 1. |
| Comment by Olaf Faaland [ 26/Mar/21 ] |
|
Peter, There appears to be more to hop count than I realized. Please assign this to an engineer. thanks |
| Comment by Peter Jones [ 26/Mar/21 ] |
|
Serguei Could you please assist? Thanks Peter |
| Comment by Amir Shehata (Inactive) [ 27/Mar/21 ] |
|
Olaf, the check for inconsistency should happen only when we ping the route for aliveness. Do you see it happening more often? We can probably reduce the severity of the debug message. But would like to make sure that it's not being printed more frequently than it should. |
| Comment by Olaf Faaland [ 27/Mar/21 ] |
|
Hi Amir, I'm not certain if this is only occurring when the route is pinged for aliveness. I'll look. But if setting the hop count is not required, then a console message is inappropriate. And if setting the hop count is required, then shouldn't that be enforced at the time routes are created? [root@garteri:~]# pdsh -N -w e1 'dmesg -T | grep -w hop | fgrep 1.54 | tail' | sed 's/is detected to be multi-hop.*$//' [Thu Mar 25 15:21:37 2021] LNet: 29145:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:23:07 2021] LNet: 29145:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:23:37 2021] LNet: 29143:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:24:08 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:24:37 2021] LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:25:38 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:27:07 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:29:37 2021] LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:34:08 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 thanks |
| Comment by Olaf Faaland [ 27/Mar/21 ] |
|
Amir, The debug log "discover" messages from lnet_check_routers for one router, .1.54: 2021-03-26 18:12:19.043797 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:12:49.798562 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:13:19.494570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:13:49.190567 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:14:19.910568 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:14:49.606563 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:15:19.302567 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:15:50.022572 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:16:20.742564 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:16:50.438577 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:17:20.134570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:17:50.854558 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:18:20.550570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:18:50.246560 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:19:20.966560 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:19:50.662570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:20:20.358566 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 and the console log multi-hop messages from that period: [Fri Mar 26 18:12:18 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:12:48 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:13:18 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:13:48 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:14:19 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:14:48 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:15:18 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:16:19 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:17:50 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:20:19 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 |
| Comment by Amir Shehata (Inactive) [ 31/Mar/21 ] |
|
Looks like it's getting the keep alive every 30 seconds and that's when we do the check route consistency. I think it'll be enough to reduce the message severity to just "net". It is not mandatory to set the hop count. However, the reason we have the check is to verify configuration consistency. However, if it's not standard to explicitly specify the hop count when configuring the route, then the check becomes less effective. |
| Comment by Gerrit Updater [ 23/Mar/22 ] |
|
"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46918 |
| Comment by Gerrit Updater [ 11/Jul/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46918/ |
| Comment by Peter Jones [ 11/Jul/22 ] |
|
Landed for 2.16 |
| Comment by Gian-Carlo Defazio [ 08/Dec/22 ] |
|
Just adding the requirement that avoid_asym_router_failure be true doesn't prevent the warning from beaing logged because we have avoid_asym_router_failure=1. However, due to changes in the code since ticket was made, I've made a change which further reduces the conditions in which the warning is logged: having hops undefined for a multi-hop route is no longer considered an inconsistent configuration. |
| Comment by Gerrit Updater [ 08/Dec/22 ] |
|
"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49352 |
| Comment by Gerrit Updater [ 13/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49352/ |
| Comment by Peter Jones [ 13/Jan/23 ] |
|
Landed for 2.16 |