Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14555

lnet_check_route_inconsistency() complains when hops == -1

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.14.0
    • RHEL 8
      multi-hop network
      hops not set
    • 3
    • 9223372036854775807

    Description

      We have the following configuration:

      2.14_servers == o2ib100 == 2.12_routers == tcp129 == 2.12_routers == o2ib18 == 2.12_clients

      Discovery is disabled, and the routes are configured statically, on all the systems.

      This causes LNet to complain vociferously on the console from lnet_check_route_inconsistency()

      LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1

      If LNet is configured so that there is only one route to any given endpoint, even on a multi-hop network, there is no value to spending sysadmin time determining and setting the hop counts as far as I can tell.  And setting hops is optional according to the Lustre Operations Manual.

      Is hop count actually required in 2.14 due to LU-13029 and LU-13785?

      Attachments

        Activity

          [LU-14555] lnet_check_route_inconsistency() complains when hops == -1
          pjones Peter Jones added a comment -

          Landed for 2.16

          pjones Peter Jones added a comment - Landed for 2.16

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49352/
          Subject: LU-14555 lnet: asym route inconsistency warning
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 6aed5df1771c299b527251b0e18ff9f6cb95dd75

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49352/ Subject: LU-14555 lnet: asym route inconsistency warning Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6aed5df1771c299b527251b0e18ff9f6cb95dd75

          "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49352
          Subject: LU-14555 lnet: asym route inconsistency warning
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: af572e46f883f8736af8055a2030eb640792130c

          gerrit Gerrit Updater added a comment - "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49352 Subject: LU-14555 lnet: asym route inconsistency warning Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: af572e46f883f8736af8055a2030eb640792130c

          Just adding the requirement that avoid_asym_router_failure be true doesn't prevent the warning from beaing logged because we have avoid_asym_router_failure=1.

          However, due to changes in the code since ticket was made, I've made a change which further reduces the conditions in which the warning is logged: having hops undefined for a multi-hop route is no longer considered an inconsistent configuration.

          defazio Gian-Carlo Defazio added a comment - Just adding the requirement that avoid_asym_router_failure be true doesn't prevent the warning from beaing logged because we have avoid_asym_router_failure=1. However, due to changes in the code since ticket was made, I've made a change which further reduces the conditions in which the warning is logged: having hops undefined for a multi-hop route is no longer considered an inconsistent configuration.
          pjones Peter Jones added a comment -

          Landed for 2.16

          pjones Peter Jones added a comment - Landed for 2.16

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46918/
          Subject: LU-14555 lnet: asym route inconsistency warning
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 6ab060e58e6b3f38b0c8d57b56fec887c6fe9fb6

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46918/ Subject: LU-14555 lnet: asym route inconsistency warning Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6ab060e58e6b3f38b0c8d57b56fec887c6fe9fb6

          "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46918
          Subject: LU-14555 lnet: change route inconsistency warnings
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 572b52a488ac7a186be2f808668e163e9ac850b2

          gerrit Gerrit Updater added a comment - "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46918 Subject: LU-14555 lnet: change route inconsistency warnings Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 572b52a488ac7a186be2f808668e163e9ac850b2

          Looks like it's getting the keep alive every 30 seconds and that's when we do the check route consistency. I think it'll be enough to reduce the message severity to just "net". It is not mandatory to set the hop count. However, the reason we have the check is to verify configuration consistency. However, if it's not standard to explicitly specify the hop count when configuring the route, then the check becomes less effective.

          ashehata Amir Shehata (Inactive) added a comment - Looks like it's getting the keep alive every 30 seconds and that's when we do the check route consistency. I think it'll be enough to reduce the message severity to just "net". It is not mandatory to set the hop count. However, the reason we have the check is to verify configuration consistency. However, if it's not standard to explicitly specify the hop count when configuring the route, then the check becomes less effective.
          ofaaland Olaf Faaland added a comment - - edited

          Amir,
          The timing of the message does not seem to correlate well with the timing of the call to lnet_check_routers(). If I'm looking at the wrong code, let me know.

          The debug log "discover" messages from lnet_check_routers for one router, .1.54:

          2021-03-26 18:12:19.043797 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:12:49.798562 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:13:19.494570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:13:49.190567 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:14:19.910568 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:14:49.606563 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:15:19.302567 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:15:50.022572 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:16:20.742564 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:16:50.438577 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:17:20.134570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:17:50.854558 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:18:20.550570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:18:50.246560 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:19:20.966560 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:19:50.662570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2
          2021-03-26 18:20:20.358566 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 

          and the console log multi-hop messages from that period:

          [Fri Mar 26 18:12:18 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:12:48 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:13:18 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:13:48 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:14:19 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:14:48 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:15:18 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:16:19 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:17:50 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          [Fri Mar 26 18:20:19 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1
          ofaaland Olaf Faaland added a comment - - edited Amir, The timing of the message does not seem to correlate well with the timing of the call to lnet_check_routers(). If I'm looking at the wrong code, let me know. The debug log "discover" messages from lnet_check_routers for one router, .1.54: 2021-03-26 18:12:19.043797 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:12:49.798562 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:13:19.494570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:13:49.190567 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:14:19.910568 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:14:49.606563 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:15:19.302567 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:15:50.022572 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:16:20.742564 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o22021-03-26 18:16:50.438577 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:17:20.134570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:17:50.854558 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:18:20.550570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:18:50.246560 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:19:20.966560 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:19:50.662570 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 2021-03-26 18:20:20.358566 00000400:00000200:3.0::0:30243:0:(router.c:1231:lnet_check_routers()) discover 172.19.1.54@o2 and the console log multi-hop messages from that period: [Fri Mar 26 18:12:18 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:12:48 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:13:18 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:13:48 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:14:19 2021] LNet: 30238:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:14:48 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:15:18 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:16:19 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:17:50 2021] LNet: 30240:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1 [Fri Mar 26 18:20:19 2021] LNet: 30237:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 is detected to be multi-hop but hop count is set to -1

          Hi Amir,

          I'm not certain if this is only occurring when the route is pinged for aliveness.  I'll look.  But if setting the hop count is not required, then a console message is inappropriate.   And if setting the hop count is required, then shouldn't that be enforced at the time routes are created?

          [root@garteri:~]# pdsh -N -w e1 'dmesg -T | grep -w hop | fgrep 1.54 | tail' | sed 's/is detected to be multi-hop.*$//'
          [Thu Mar 25 15:21:37 2021] LNet: 29145:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          [Thu Mar 25 15:23:07 2021] LNet: 29145:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          [Thu Mar 25 15:23:37 2021] LNet: 29143:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          [Thu Mar 25 15:24:08 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          [Thu Mar 25 15:24:37 2021] LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          [Thu Mar 25 15:25:38 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          [Thu Mar 25 15:27:07 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          [Thu Mar 25 15:29:37 2021] LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          [Thu Mar 25 15:34:08 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 
          

          thanks
           

          ofaaland Olaf Faaland added a comment - Hi Amir, I'm not certain if this is only occurring when the route is pinged for aliveness.  I'll look.  But if setting the hop count is not required, then a console message is inappropriate.   And if setting the hop count is required, then shouldn't that be enforced at the time routes are created? [root@garteri:~]# pdsh -N -w e1 'dmesg -T | grep -w hop | fgrep 1.54 | tail' | sed 's/is detected to be multi-hop.*$//' [Thu Mar 25 15:21:37 2021] LNet: 29145:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:23:07 2021] LNet: 29145:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:23:37 2021] LNet: 29143:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:24:08 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:24:37 2021] LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:25:38 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:27:07 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:29:37 2021] LNet: 29144:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 [Thu Mar 25 15:34:08 2021] LNet: 29146:0:(router.c:384:lnet_check_route_inconsistency()) route o2ib18->172.19.1.54@o2ib100 thanks  

          People

            defazio Gian-Carlo Defazio
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: