Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12132

Lustre causes ARP request for Lnet Routers to be sent on wrong PKey IPoIB interface.

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.11.0
    • None
    • Ubuntu 16.04 with HWE kernel (4.15.0-46), MOFED 4.4, Lustre 2.11.56-32
    • 3
    • 9223372036854775807

    Description

      On some of our systems we see ARP requests being sent over the wrong PKey IPoIB interface when mounting a lustre filesystem.

      The ARP requests are for the LNet routers.

       

      Setup (historical reasons...):

      Two PKeys on the server side onto the same physical IB fabric (2 separate cards):

       - ib0: PKey 8001 -> IP net 172.27.1.21/24 (one of the MDS:es)

       - ib1: PKey 8002 -> IP net 172.27.2.21/24 (same MDS)

      MDS01:

      options lnet routes="o2ib240 172.27.1.[201-214]@o2ib1; o2ib244 172.27.2.[201-214]@o2ib2; o2ib4 172.27.1.[101-108]@o2ib1; o2ib8 172.27.2.[101-108]@o2ib2; tcp1 172.27.1.[91-92]@o2ib1; tcp2 172.27.2.[91-92]@o2ib2"

      options lnet networks="o2ib1(ib0.8001), o2ib2(ib1.8002)"

       

      OSS01:

      options lnet routes="o2ib240 172.27.1.[201-214]@o2ib1; o2ib244 172.27.2.[201-214]@o2ib2; o2ib4 172.27.1.[101-108]@o2ib1; o2ib8 172.27.2.[101-108]@o2ib2; tcp1 172.27.1.[91-92]@o2ib1; tcp2 172.27.2.[91-92]@o2ib2"

      options lnet networks="o2ib1(ib1.8001), o2ib2(ib3.8002)"

       

      LNet routers Lustre server side:

       - ib1: PKey 8001 -> IP net 172.27.1.101/24

       - ib1: PKey 8002 -> IP net 172.27.2.101/24

      Same Lnet router client side:

       - ib0: PKey 8004 -> IP net 172.27.4.11/23

       - ib0: PKey 8008 -> IP net 172.27.8.11/23

      options lnet networks="o2ib1(ib1.8001), o2ib2(ib1.8002),o2ib4(ib0.8004), o2ib8(ib0.8008)"

       

      Clients:172.27.9.198/23

       - ib0: PKey 8004 -> IP net 172.27.5.198/23

       - ib0: PKey 8008 -> IP net 172.27.9.198/23

      options lnet routes="o2ib1 172.27.4.[11-18]@o2ib4; o2ib2 172.27.8.[11-18]@o2ib8"
      options lnet networks="o2ib4(ib0.8004), o2ib8(ib0.8008)"

       

      We have 2 set of client nodes, that are (as far as we can see) identical in configuration.

      When doing "modprobe lustre" everything is ok, the arp table is created with correct mappings. (ibdump of this shows everything working as it should, file available if needed)

      But when mounting the file systems there are some problems.

      One set works perfectly, the other generates ARP requests for 172.27.4.x with source IP 172.27.8.x (or 172.27.9.x depending on the hosts IP)

       

      The bad ARPs are generated over the 8008 PKey interface, never over the 8004 interface.

      We have a ibdump of this problem available if it helps.

       

      The end result (since we currently have arp_ignore=1 on the ib0, ib0.8004, and ib0.8008 interfaces on the LNet routers) are that we get the following ARP table on the problematic clients:

      172.27.4.11 dev ib0.8004 lladdr a0:00:02:20:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:61 STALE
      172.27.4.11 dev ib0.8008 FAILED
      172.27.8.11 dev ib0.8008 lladdr a0:00:03:00:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:61 STALE
      172.27.4.12 dev ib0.8004 lladdr a0:00:02:20:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:11 STALE
      172.27.4.12 dev ib0.8008 FAILED
      172.27.8.12 dev ib0.8008 lladdr a0:00:03:00:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:11 STALE
      172.27.4.13 dev ib0.8004 lladdr a0:00:02:20:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:a1 STALE
      172.27.4.13 dev ib0.8008 FAILED
      172.27.8.13 dev ib0.8008 lladdr a0:00:03:00:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:1d:26:a1 STALE
      ...

       

      This results in severe problems with the lustre file system accesses.

       

      The bad arp seem to be generated when trying to read a file, not when actually mounting the filesystem.

      Attachments

        Issue Links

          Activity

            [LU-12132] Lustre causes ARP request for Lnet Routers to be sent on wrong PKey IPoIB interface.

            Yes, we knew this when we started using it, but it was the first available version which supported the kernel version we had, and 2.12 wasn't out yet.

            ake_s Åke Sandgren added a comment - Yes, we knew this when we started using it, but it was the first available version which supported the kernel version we had, and 2.12 wasn't out yet.
            pjones Peter Jones added a comment -

            BTW I hope that it was clearly conveyed to you that 2.1.2.56 is a dev build not intended for production usage.

            pjones Peter Jones added a comment - BTW I hope that it was clearly conveyed to you that 2.1.2.56 is a dev build not intended for production usage.
            pjones Peter Jones added a comment -

            ok

            pjones Peter Jones added a comment - ok

            Problem appears to have been some incopatibility between the 2.7(DDN)  based servers and the 2.11.56-ish based Lnet routers.

            Downgrading the Lnet routers to 2.7(DDN) made the problem go away (the clients are still at 2.11.56)

             

            So close this.

            ake_s Åke Sandgren added a comment - Problem appears to have been some incopatibility between the 2.7(DDN)  based servers and the 2.11.56-ish based Lnet routers. Downgrading the Lnet routers to 2.7(DDN) made the problem go away (the clients are still at 2.11.56)   So close this.

            And no, I haven't tried using the 2.12 version on the clients yet.

            We got caught by a inconvenient upgrade to the Ubuntu Xenial kernel (4.4.0-143) that changed the definition of get_user_pages so both the NVidia driver and MOFED failed to build. Forcing us to make an unplanned upgrade to the HWE kernel (4.15), MOFED 4.4, and Lustre 2.11 (community version). The reason for 2.11 was that we already had made a deb package of it.

             

            Will test with 2.12 (and MOFED 4.5) later when we've finished upgrading everything...

            ake_s Åke Sandgren added a comment - And no, I haven't tried using the 2.12 version on the clients yet. We got caught by a inconvenient upgrade to the Ubuntu Xenial kernel (4.4.0-143) that changed the definition of get_user_pages so both the NVidia driver and MOFED failed to build. Forcing us to make an unplanned upgrade to the HWE kernel (4.15), MOFED 4.4, and Lustre 2.11 (community version). The reason for 2.11 was that we already had made a deb package of it.   Will test with 2.12 (and MOFED 4.5) later when we've finished upgrading everything...

            This started out as a problem on the client side which is running the community version of Lustre (and so are the Lnet routers).

            I.e. the fact that the client is generating weird and incorrect ARP requests.

             

            Only when digging into the ibdump packets did I notice the odd behaviour on the server side.

            Those packets may very well be the reason that the client gets confused and sends the incorrect ARPs, but they shouldn't be sent even if the server is sending replies in an odd way.

             

            So there is likely two different problems interacting here, a client bug that is generating ARP requests on the wrong IPoIB interface, and a server one sending replies over an impossible route.

             

            I've tried to follow the code on the client side to see where and why the ARP requests are generated this way but I've failed completely.

            ake_s Åke Sandgren added a comment - This started out as a problem on the client side which is running the community version of Lustre (and so are the Lnet routers). I.e. the fact that the client is generating weird and incorrect ARP requests.   Only when digging into the ibdump packets did I notice the odd behaviour on the server side. Those packets may very well be the reason that the client gets confused and sends the incorrect ARPs, but they shouldn't be sent even if the server is sending replies in an odd way.   So there is likely two different problems interacting here, a client bug that is generating ARP requests on the wrong IPoIB interface, and a server one sending replies over an impossible route.   I've tried to follow the code on the client side to see where and why the ARP requests are generated this way but I've failed completely.
            pjones Peter Jones added a comment -

            ake_s this project is for tracking issues affecting the community releases of Lustre. Have you reproduced this same behaviour on 2.12 or other newer releases? If not then please can you direct this enquiry through your DDN support channels and we'll handle the interaction with the community releases.

            pjones Peter Jones added a comment - ake_s this project is for tracking issues affecting the community releases of Lustre. Have you reproduced this same behaviour on 2.12 or other newer releases? If not then please can you direct this enquiry through your DDN support channels and we'll handle the interaction with the community releases.

            In the ibdump I can see more than one packet begin sent from the servers not following the routing info. For instance a LDLM_ENQUEUE reply from 172.27.1.20@o2ib1 to 172.27.9.198@o2ib8 through 172.27.8.16@o2ib8

            This looks wrong to me.

             

            The servers are running DDN Exascaler 3.3, MOFED 4.3, Lustre 2.7.21.3-268.ddn24.ga49e28a

            ake_s Åke Sandgren added a comment - In the ibdump I can see more than one packet begin sent from the servers not following the routing info. For instance a LDLM_ENQUEUE reply from 172.27.1.20@o2ib1 to 172.27.9.198@o2ib8 through 172.27.8.16@o2ib8 This looks wrong to me.   The servers are running DDN Exascaler 3.3, MOFED 4.3, Lustre 2.7.21.3-268.ddn24.ga49e28a
            ake_s Åke Sandgren added a comment - - edited

            And this is an example of the error message we get from lustre due to this.

            [Fri Mar 29 13:33:20 2019] LNet: 5266:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 172.27.4.18@o2ib4: 32 seconds
            [Fri Mar 29 13:33:24 2019] LNet: 4835:0:(router.c:1762:lnet_notify()) Ignoring notification of 172.27.4.18@o2ib4 death by 172.27.9.198@o2ib8 (different net)

             

            This time the problem didn't show up until I tried to read a file from the file system. And only one Lnet router had a bad ARP entry.

            ake_s Åke Sandgren added a comment - - edited And this is an example of the error message we get from lustre due to this. [Fri Mar 29 13:33:20 2019] LNet: 5266:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 172.27.4.18@o2ib4: 32 seconds [Fri Mar 29 13:33:24 2019] LNet: 4835:0:(router.c:1762:lnet_notify()) Ignoring notification of 172.27.4.18@o2ib4 death by 172.27.9.198@o2ib8 (different net)   This time the problem didn't show up until I tried to read a file from the file system. And only one Lnet router had a bad ARP entry.

            Ignore my last sentence, this time the bad arps happened during mounting of the filesystem

            ake_s Åke Sandgren added a comment - Ignore my last sentence, this time the bad arps happened during mounting of the filesystem

            People

              wc-triage WC Triage
              ake_s Åke Sandgren
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: