Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17889

Autotest nodes should not use 'fe80:' link local (loopback) IPv6 address

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.16.0
    • None
    • Maloo test bed nodes
    • 3
    • 9223372036854775807

    Description

      Pushing patches to run sanity-lnet test with IPv6 which are working locally for me to maloo revealed that the IPv6 address setup is only local link (fe80:.....). These addresses are only visible to the local node so the more complex LNet test can't run. Also LNet ignores these addresses. Proper IPv6 addresses need to be setup.

      Attachments

        Issue Links

          Activity

            [LU-17889] Autotest nodes should not use 'fe80:' link local (loopback) IPv6 address
            pjones Peter Jones added a comment -

            Excellent news - thanks for the update James!

            pjones Peter Jones added a comment - Excellent news - thanks for the update James!

            Just pushed the patch and its looking good with maloo. I do see one test failing which is valid. Just need to figure out a solution.

            simmonsja James A Simmons added a comment - Just pushed the patch and its looking good with maloo. I do see one test failing which is valid. Just need to figure out a solution.

            I've asked that a more clear error message is added to mount.lustre when a link-local fe80: address is used, but something should likely be added into lnetctl or similar to print a very clear message like "Unsupported link-local IPv6 NID type 'fe80:...'" instead of just returning "-22" in that case, so this can be caught very early during configuration instead of later on when these addresses are being used.

            adilger Andreas Dilger added a comment - I've asked that a more clear error message is added to mount.lustre when a link-local fe80: address is used, but something should likely be added into lnetctl or similar to print a very clear message like " Unsupported link-local IPv6 NID type 'fe80:...' " instead of just returning " -22 " in that case, so this can be caught very early during configuration instead of later on when these addresses are being used.

            simmonsja

            [root@onyx-79vm20 ~]# ip a          
            1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
                link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
                inet 127.0.0.1/8 scope host lo
                   valid_lft forever preferred_lft forever
                inet6 ::1/128 scope host 
                   valid_lft forever preferred_lft forever
            2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
                link/ether 52:54:00:ce:6b:b5 brd ff:ff:ff:ff:ff:ff
                altname enp0s3f0
                altname ens3f0
                inet 10.240.26.175/20 brd 10.240.31.255 scope global noprefixroute eth0
                   valid_lft forever preferred_lft forever
                inet6 fd33:3981:3213:f010:0:5254:ce:6bb5/64 scope global noprefixroute 
                   valid_lft forever preferred_lft forever
                inet6 fe80::5054:ff:fece:6bb5/64 scope link noprefixroute 
                   valid_lft forever preferred_lft forever
            [root@onyx-79vm20 ~]# lctl net up -l
            Writer error: failed to resolve Netlink family id
            LNET configure error 22: (null)
            [root@onyx-79vm20 ~]# lnetctl lnet configure -l
            Writer error: failed to resolve Netlink family id
            ---
            configure:
            -     lnet:
                  errno: 0
                  descr: ! "LNet configure error: (null)"
            ...
            
            mkvardakov Michael Kvardakov added a comment - simmonsja [root@onyx-79vm20 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:ce:6b:b5 brd ff:ff:ff:ff:ff:ff altname enp0s3f0 altname ens3f0 inet 10.240.26.175/20 brd 10.240.31.255 scope global noprefixroute eth0 valid_lft forever preferred_lft forever inet6 fd33:3981:3213:f010:0:5254:ce:6bb5/64 scope global noprefixroute valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fece:6bb5/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@onyx-79vm20 ~]# lctl net up -l Writer error: failed to resolve Netlink family id LNET configure error 22: ( null ) [root@onyx-79vm20 ~]# lnetctl lnet configure -l Writer error: failed to resolve Netlink family id --- configure: - lnet: errno: 0 descr: ! "LNet configure error: ( null )" ...

            LU-16822 hasn't landed yet so no IPv6 by default. You can do lnetctl lnet configure -l or lctl net up -l to see if fd33:XXX ends up in lnetctl net  show

            simmonsja James A Simmons added a comment - LU-16822 hasn't landed yet so no IPv6 by default. You can do lnetctl lnet configure -l or lctl net up -l to see if fd33:XXX ends up in lnetctl net  show

            Michael pushed a patch to correct the IPV6 issue (https://review.whamcloud.com/#/c/private/lab/+/55584/)

            I retested one of the sessions from patch https://review.whamcloud.com/#/c/fs/lustre-release/+/55435/3 and the fd33 address is now listed:

            /usr/sbin/lnetctl net show
            net:
            -     net type: lo
                  local NI(s):
                  -     nid: 0@lo
                        status: up
            -     net type: tcp
                  local NI(s):
                  -     nid: 10.240.24.248@tcp
                        status: up
                        interfaces:
                              0: eth0
            1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
                link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
                inet 127.0.0.1/8 scope host lo
                   valid_lft forever preferred_lft forever
                inet6 ::1/128 scope host 
                   valid_lft forever preferred_lft forever
            2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
                link/ether 52:54:00:72:a5:1e brd ff:ff:ff:ff:ff:ff
                altname enp0s3f0
                altname ens3f0
                inet 10.240.24.248/20 brd 10.240.31.255 scope global noprefixroute eth0
                   valid_lft forever preferred_lft forever
                inet6 fd33:3981:3213:f010:0:5254:72:a51e/64 scope global noprefixroute 
                   valid_lft forever preferred_lft forever
                inet6 fe80::5054:ff:fe72:a51e/64 scope link noprefixroute 
                   valid_lft forever preferred_lft forever
            3: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
                link/ether 3e:a2:28:ad:93:2b brd ff:ff:ff:ff:ff:ff link-netns test_ns
                inet6 fe80::3ca2:28ff:fead:932b/64 scope link tentative 
                   valid_lft forever preferred_lft forever 

             

            colmstea Charlie Olmstead added a comment - Michael pushed a patch to correct the IPV6 issue ( https://review.whamcloud.com/#/c/private/lab/+/55584/ ) I retested one of the sessions from patch https://review.whamcloud.com/#/c/fs/lustre-release/+/55435/3 and the fd33 address is now listed: /usr/sbin/lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp local NI(s): - nid: 10.240.24.248@tcp status: up interfaces: 0: eth0 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:72:a5:1e brd ff:ff:ff:ff:ff:ff altname enp0s3f0 altname ens3f0 inet 10.240.24.248/20 brd 10.240.31.255 scope global noprefixroute eth0 valid_lft forever preferred_lft forever inet6 fd33:3981:3213:f010:0:5254:72:a51e/64 scope global noprefixroute valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe72:a51e/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 3e:a2:28:ad:93:2b brd ff:ff:ff:ff:ff:ff link-netns test_ns inet6 fe80::3ca2:28ff:fead:932b/64 scope link tentative valid_lft forever preferred_lft forever  

            Michael, I'm not an IPv6 expert, I was only passing on information given to me from another call on this topic.  I think the main interest for this ticket from James & Chris is that the test nodes have proper IPv6 addresses configured, and they can work out issues in the test script itself.

            I would suggest to use the most recent version of the IPv6 test patch.

            As for why Janitor is running tests on your patch, this is normal for all patches submitted to Gerrit.

            adilger Andreas Dilger added a comment - Michael, I'm not an IPv6 expert, I was only passing on information given to me from another call on this topic.  I think the main interest for this ticket from James & Chris is that the test nodes have proper IPv6 addresses configured, and they can work out issues in the test script itself. I would suggest to use the most recent version of the IPv6 test patch. As for why Janitor is running tests on your patch, this is normal for all patches submitted to Gerrit.

            mkvardakov - chef does run on all test nodes. AT deletes all chef data after ljb runs.

            colmstea Charlie Olmstead added a comment - mkvardakov - chef does run on all test nodes. AT deletes all chef data after ljb runs.

            The more I diving to the issue the more questions I get.
            First, adilger I updated the patch in line with your comments, and lsanity-test does not seem to run anymore. Probably that caused by whole block handling LARGE_ADDR_ENABLE lustre/tests/test-framework.sh is missing in master somehow. So it run without -l option now:

            /usr/sbin/lnetctl lnet configure -l
            

            Second, after comparing successful and failed runs, I noticed that

            /usr/sbin/lnetctl net add --nid @tcp
            ---
            add:
            -     net:
                  errno: -22
                  descr: ! "unsupported NID"
            ...
            

            is wrong. It should take valid ip address like:

            /lnet/utils/lnetctl net add --nid 192.168.202.8@tcp
            

            Third, I don't quite understand why tests started by my commits run on some oleg* servers in 192.168.* network

            Fourth, colmstea last but not least, how come chef is not configured at target nodes throughout all runs? They just don't have /etc/chef/client.rb and I cannot check why ipv6 is not configured:
            a) ipv6 is disabled completely in initial test runs
            b) as expected in recent test runs

            mkvardakov Michael Kvardakov added a comment - The more I diving to the issue the more questions I get. First, adilger I updated the patch in line with your comments, and lsanity-test does not seem to run anymore. Probably that caused by whole block handling LARGE_ADDR_ENABLE lustre/tests/test-framework.sh is missing in master somehow. So it run without -l option now: /usr/sbin/lnetctl lnet configure -l Second, after comparing successful and failed runs, I noticed that /usr/sbin/lnetctl net add --nid @tcp --- add: - net: errno: -22 descr: ! "unsupported NID" ... is wrong. It should take valid ip address like: /lnet/utils/lnetctl net add --nid 192.168.202.8@tcp Third, I don't quite understand why tests started by my commits run on some oleg* servers in 192.168.* network Fourth, colmstea last but not least, how come chef is not configured at target nodes throughout all runs? They just don't have /etc/chef/client.rb and I cannot check why ipv6 is not configured: a) ipv6 is disabled completely in initial test runs b) as expected in recent test runs

            thanks adilger, now I can see that test node did not get ipv6 address as expected. Have retriggered test session, investigating

            mkvardakov Michael Kvardakov added a comment - thanks adilger , now I can see that test node did not get ipv6 address as expected. Have retriggered test session, investigating

            People

              mkvardakov Michael Kvardakov
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: