Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15478

Regression in 005bd7075c LU-10391 lnet: Change lnet_send() to take large-addr nids

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.15.0
    • Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      Routed, source-any sends were broken by https://review.whamcloud.com/43599

      lnet_handle_find_routed_path() calls lnet_find_route_locked() passing LNET_NID_NET(src_nid) as an argument.

                      best_route = lnet_find_route_locked(best_rnet,
                                                          LNET_NID_NET(src_nid),
                                                          sd->sd_best_lpni,
                                                          &last_route, &gwni);
      

      This network ID is in turn passed to lnet_find_best_lpni() where it is compared against LNET_NET_ANY:

      static inline struct lnet_peer_ni *
      lnet_find_best_lpni(struct lnet_ni *lni, lnet_nid_t dst_nid,
                          struct lnet_peer *peer, __u32 net_id)
      {
              struct lnet_peer_net *peer_net;
      
              /* find the best_lpni on any local network */
              if (net_id == LNET_NET_ANY) {
      

      Where

      #define LNET_NET_ANY LNET_NIDNET(LNET_NID_ANY)
       == LNET_NIDNET(-1)
       == 0xffffffff
      

      In the case where a source NID was not specified, the network id passed to lnet_find_best_lpni() is equal to

      LNET_NID_NET(LNET_ANY_NID)
      

      Where:

      static inline __u32 LNET_NID_NET(const struct lnet_nid *nid)
      {
              return LNET_MKNET(nid->nid_type, __be16_to_cpu(nid->nid_num));
      }
      

      I think we need an "extended nid" version of LNET_NET_ANY (call it LNET_ANY_NET) such that

      #define LNET_ANY_NET LNET_NID_NET(&LNET_ANY_NID)
      

      or LNET_NID_NET could be modified to check for LNET_ANY_NID and return LNET_NET_ANY.

      Attachments

        Issue Links

          Activity

            [LU-15478] Regression in 005bd7075c LU-10391 lnet: Change lnet_send() to take large-addr nids
            hornc Chris Horn added a comment -

            Test report for LU-15478

            Build w/o the fix:

            [hornc@ct7-adm lustre-filesystem]$ git reset --hard 78be823f33
            HEAD is now at 78be823f33 LU-15218 quota: delete unused quota ID
            [hornc@ct7-adm lustre-filesystem]$ make -j 32
            ...
            

            Show bug:

            [root@ct7-adm tests]# lctl list_nids
            10.73.10.10@tcp
            [root@ct7-adm tests]# lctl show_route
            net               tcp1 hops 4294967295 gw                  10.73.10.11@tcp up pri 0
            [root@ct7-adm tests]# lctl ping 10.73.10.12@tcp1
            failed to ping 10.73.10.12@tcp1: Input/output error
            [root@ct7-adm tests]# dmesg | tail
            [11269.739430] Lustre: DEBUG MARKER: == sanity-lnet test complete, duration 7 sec ============= 01:17:25 (1643419045)
            [11270.753951] LNet: Removed LNI 10.73.10.10@tcp
            [11271.691683] LNet: Removed LNI 10.73.10.10@tcp1
            [12797.050427] LNet: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 1
            [12797.066287] alg: No test for adler32 (adler32-zlib)
            [12797.853703] Lustre: DEBUG MARKER: Starting lnet
            [12797.902831] LNet: Added LNI 10.73.10.10@tcp [8/256/0/180]
            [12797.903911] LNet: Accept secure, port 988
            [12797.920874] LNet: 30464:0:(router.c:798:lnet_add_route()) Use hops = 1 for a single-hop route when avoid_asym_router_failure feature is enabled
            [12815.937019] LNetError: 30522:0:(lib-move.c:2285:lnet_handle_find_routed_path()) no route to 10.73.10.12@tcp1 from <?>
            [root@ct7-adm tests]#
            

            Apply fix:

            [hornc@ct7-adm lustre-filesystem]$ git reset --hard fbbc1258a0
            HEAD is now at fbbc1258a0 LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET
            [hornc@ct7-adm lustre-filesystem]$ make -j 32
            ...
            

            Now ping works as expected:

            [root@ct7-adm tests]# lctl ping 10.73.10.12@tcp1
            12345-0@lo
            12345-10.73.10.12@tcp1
            [root@ct7-adm tests]#
            
            hornc Chris Horn added a comment - Test report for LU-15478 Build w/o the fix: [hornc@ct7-adm lustre-filesystem]$ git reset --hard 78be823f33 HEAD is now at 78be823f33 LU-15218 quota: delete unused quota ID [hornc@ct7-adm lustre-filesystem]$ make -j 32 ... Show bug: [root@ct7-adm tests]# lctl list_nids 10.73.10.10@tcp [root@ct7-adm tests]# lctl show_route net tcp1 hops 4294967295 gw 10.73.10.11@tcp up pri 0 [root@ct7-adm tests]# lctl ping 10.73.10.12@tcp1 failed to ping 10.73.10.12@tcp1: Input/output error [root@ct7-adm tests]# dmesg | tail [11269.739430] Lustre: DEBUG MARKER: == sanity-lnet test complete, duration 7 sec ============= 01:17:25 (1643419045) [11270.753951] LNet: Removed LNI 10.73.10.10@tcp [11271.691683] LNet: Removed LNI 10.73.10.10@tcp1 [12797.050427] LNet: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 1 [12797.066287] alg: No test for adler32 (adler32-zlib) [12797.853703] Lustre: DEBUG MARKER: Starting lnet [12797.902831] LNet: Added LNI 10.73.10.10@tcp [8/256/0/180] [12797.903911] LNet: Accept secure, port 988 [12797.920874] LNet: 30464:0:(router.c:798:lnet_add_route()) Use hops = 1 for a single-hop route when avoid_asym_router_failure feature is enabled [12815.937019] LNetError: 30522:0:(lib-move.c:2285:lnet_handle_find_routed_path()) no route to 10.73.10.12@tcp1 from <?> [root@ct7-adm tests]# Apply fix: [hornc@ct7-adm lustre-filesystem]$ git reset --hard fbbc1258a0 HEAD is now at fbbc1258a0 LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET [hornc@ct7-adm lustre-filesystem]$ make -j 32 ... Now ping works as expected: [root@ct7-adm tests]# lctl ping 10.73.10.12@tcp1 12345-0@lo 12345-10.73.10.12@tcp1 [root@ct7-adm tests]#
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46292/
            Subject: LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fbbc1258a057ff718dd9ba41dc32faf2aadc3a90

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46292/ Subject: LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET Project: fs/lustre-release Branch: master Current Patch Set: Commit: fbbc1258a057ff718dd9ba41dc32faf2aadc3a90

            "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46292
            Subject: LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7e21df0eaaf29326e51a2dc5dfccff1689adb9e1

            gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46292 Subject: LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7e21df0eaaf29326e51a2dc5dfccff1689adb9e1

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: