Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18199

Bad ethernet network after LNet is loaded

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.17.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      In my test env (eth0 for ssh and eth1 for LNet), when LNet is loaded, replying to ping becomes impossible.

      After first analysis, ksocklnd-config seems to be the culprit (in commit v2_15_57-117-g7f60b2b558).

      Attachments

        Issue Links

          Activity

            [LU-18199] Bad ethernet network after LNet is loaded

            adilger thanks for clarifying, it looks like I should have excluded the leading "g" when searching

            ssmirnov Serguei Smirnov added a comment - adilger thanks for clarifying, it looks like I should have excluded the leading "g" when searching

            ssmirnov, FYI the "git describe" version is very descriptive of the change. This is 18 patches beyond the 2.15.91 tag (which contains the 56321 patch), and the last part is the commit hash 1a4df98 "LU-18217 build: Ensure LINUX_RELEASE is defined". For patches that have never landed to the git repo, it is usually possible to find intermediate/in-progress patches via the commit hash.

            adilger Andreas Dilger added a comment - ssmirnov , FYI the "git describe" version is very descriptive of the change. This is 18 patches beyond the 2.15.91 tag (which contains the 56321 patch), and the last part is the commit hash 1a4df98 " LU-18217 build: Ensure LINUX_RELEASE is defined ". For patches that have never landed to the git repo, it is usually possible to find intermediate/in-progress patches via the commit hash.

            scherementsev I'm not sure which version is 2.15.91_18_g1a4df98. I'm assuming it includes 56321 change (and the outputs you provided are taken before LNet is loaded or ksocklnd-config is run)

            Are you using console session or ssh to enp0s3 in your reproducer? 

             

            ssmirnov Serguei Smirnov added a comment - scherementsev I'm not sure which version is 2.15.91_18_g1a4df98. I'm assuming it includes 56321 change (and the outputs you provided are taken before LNet is loaded or ksocklnd-config is run) Are you using console session or ssh to enp0s3 in your reproducer?   

            Hi ssmirnov

            [root@vm1 ~]# uname -a
            Linux vm1 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Fri Jun 17 18:46:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
            [root@vm1 ~]# cat /etc/modprobe.d/lustre.conf
            cat: /etc/modprobe.d/lustre.conf: No such file or directory
            [root@vm1 ~]# cat /etc/modprobe.d/lnet.conf
            cat: /etc/modprobe.d/lnet.conf: No such file or directory
            [root@vm1 ~]# ip a
            1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
                link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
                inet 127.0.0.1/8 scope host lo
                   valid_lft forever preferred_lft forever
                inet6 ::1/128 scope host 
                   valid_lft forever preferred_lft forever
            2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
                link/ether 08:00:27:b2:02:f0 brd ff:ff:ff:ff:ff:ff
                inet 192.168.1.82/24 brd 192.168.1.255 scope global noprefixroute enp0s3
                   valid_lft forever preferred_lft forever
                inet6 fe80::a00:27ff:feb2:2f0/64 scope link 
                   valid_lft forever preferred_lft forever
            [root@vm1 ~]# ip route show table all
            default via 192.168.1.1 dev enp0s3 proto dhcp metric 100 
            192.168.1.0/24 dev enp0s3 proto kernel scope link src 192.168.1.82 metric 100 
            broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 
            local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 
            local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 
            broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 
            broadcast 192.168.1.0 dev enp0s3 table local proto kernel scope link src 192.168.1.82 
            local 192.168.1.82 dev enp0s3 table local proto kernel scope host src 192.168.1.82 
            broadcast 192.168.1.255 dev enp0s3 table local proto kernel scope link src 192.168.1.82 
            unreachable default dev lo proto kernel metric 4294967295 error -101 pref medium
            fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
            unreachable default dev lo proto kernel metric 4294967295 error -101 pref medium
            local ::1 dev lo table local proto unspec metric 0 pref medium
            local fe80::a00:27ff:feb2:2f0 dev lo table local proto unspec metric 0 pref medium
            ff00::/8 dev enp0s3 table local metric 256 pref medium
            unreachable default dev lo proto kernel metric 4294967295 error -101 pref medium
            [root@vm1 ~]# ip -4 rule list
            0:	from all lookup local 
            32765:	from 192.168.1.82 lookup enp0s3 
            32766:	from all lookup main 
            32767:	from all lookup default 
            [root@vm1 ~]# ip -6 rule list
            0:	from all lookup local 
            32765:	from fe80::a00:27ff:feb2:2f0 lookup enp0s3 
            32766:	from all lookup main 
            [root@vm1 ~]#
            [root@vm1 ~]# lctl get_param version
            version=2.15.91_18_g1a4df98

            It is a virtual box virtual machine, bridged connection.

            scherementsev Sergey Cheremencev added a comment - Hi ssmirnov ,  [root@vm1 ~]# uname -a Linux vm1 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Fri Jun 17 18:46:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux [root@vm1 ~]# cat /etc/modprobe.d/lustre.conf cat: /etc/modprobe.d/lustre.conf: No such file or directory [root@vm1 ~]# cat /etc/modprobe.d/lnet.conf cat: /etc/modprobe.d/lnet.conf: No such file or directory [root@vm1 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00     inet 127.0.0.1/8 scope host lo        valid_lft forever preferred_lft forever     inet6 ::1/128 scope host        valid_lft forever preferred_lft forever 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000     link/ether 08:00:27:b2:02:f0 brd ff:ff:ff:ff:ff:ff     inet 192.168.1.82/24 brd 192.168.1.255 scope global noprefixroute enp0s3        valid_lft forever preferred_lft forever     inet6 fe80::a00:27ff:feb2:2f0/64 scope link        valid_lft forever preferred_lft forever [root@vm1 ~]# ip route show table all default via 192.168.1.1 dev enp0s3 proto dhcp metric 100 192.168.1.0/24 dev enp0s3 proto kernel scope link src 192.168.1.82 metric 100 broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 broadcast 192.168.1.0 dev enp0s3 table local proto kernel scope link src 192.168.1.82 local 192.168.1.82 dev enp0s3 table local proto kernel scope host src 192.168.1.82 broadcast 192.168.1.255 dev enp0s3 table local proto kernel scope link src 192.168.1.82 unreachable default dev lo proto kernel metric 4294967295 error -101 pref medium fe80::/64 dev enp0s3 proto kernel metric 256 pref medium unreachable default dev lo proto kernel metric 4294967295 error -101 pref medium local ::1 dev lo table local proto unspec metric 0 pref medium local fe80::a00:27ff:feb2:2f0 dev lo table local proto unspec metric 0 pref medium ff00::/8 dev enp0s3 table local metric 256 pref medium unreachable default dev lo proto kernel metric 4294967295 error -101 pref medium [root@vm1 ~]# ip -4 rule list 0: from all lookup local 32765: from 192.168.1.82 lookup enp0s3 32766: from all lookup main 32767: from all lookup default [root@vm1 ~]# ip -6 rule list 0: from all lookup local 32765: from fe80::a00:27ff:feb2:2f0 lookup enp0s3 32766: from all lookup main [root@vm1 ~]# [root@vm1 ~]# lctl get_param version version=2.15.91_18_g1a4df98 It is a virtual box virtual machine, bridged connection.

            scherementsev, could you please provide the outputs of 

            uname -a
            cat /etc/modprobe.d/lustre.conf
            cat /etc/modprobe.d/lnet.conf
            ip a
            ip route show table all
            ip -4 rule list
            ip -6 rule list

            from your system?

            ssmirnov Serguei Smirnov added a comment - scherementsev , could you please provide the outputs of  uname -a cat /etc/modprobe.d/lustre.conf cat /etc/modprobe.d/lnet.conf ip a ip route show table all ip -4 rule list ip -6 rule list from your system?

            https://review.whamcloud.com/c/fs/lustre-release/+/56344 is the only one thing helped in my case: master on 3.10.0-1160.49.1.el7

            scherementsev Sergey Cheremencev added a comment - https://review.whamcloud.com/c/fs/lustre-release/+/56344 is the only one thing helped in my case: master on 3.10.0-1160.49.1.el7
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56321/
            Subject: LU-18199 scripts: fix ksocklnd-config gateway selection logic
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b4802e3ee3389a9715ffaa34239e4f4b28446edb

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56321/ Subject: LU-18199 scripts: fix ksocklnd-config gateway selection logic Project: fs/lustre-release Branch: master Current Patch Set: Commit: b4802e3ee3389a9715ffaa34239e4f4b28446edb

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56344
            Subject: LU-18199 socklnd: change skip_mr_route_setup default to 1
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6de2f07a02aa7217499a754a1ac52fdc5106a06c

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56344 Subject: LU-18199 socklnd: change skip_mr_route_setup default to 1 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6de2f07a02aa7217499a754a1ac52fdc5106a06c

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56321
            Subject: LU-18199 scripts: fix ksocklnd-config gateway selection logic
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 392307945c256702727590d5f5d0e1372a7fa230

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56321 Subject: LU-18199 scripts: fix ksocklnd-config gateway selection logic Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 392307945c256702727590d5f5d0e1372a7fa230

            OK, with the correct version ksocklnd-config-1 it works!

            sebastien Sebastien Buisson added a comment - OK, with the correct version ksocklnd-config-1 it works!

            People

              ssmirnov Serguei Smirnov
              cbordage Cyril Bordage
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: