Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
In my test env (eth0 for ssh and eth1 for LNet), when LNet is loaded, replying to ping becomes impossible.
After first analysis, ksocklnd-config seems to be the culprit (in commit v2_15_57-117-g7f60b2b558).
Attachments
- ksocklnd-config
- 12 kB
- ksocklnd-config-1
- 13 kB
Issue Links
- is related to
-
LU-17006 socklnd: modify ksocklnd-config
-
- Resolved
-
Activity
Hi scherementsev , could you please try ksocklnd-config-ignore_ipv6_link_local_addr_no_flush?
(checking if flushing the routing table for the interface causes the issue)
Hey ssmirnov , probably I forgot to assign it to the ticket. Please check again bash-xv-ksocklnd-config.
Hi scherementsev,
You mentioned the attachment "bash-xv-ksocklnd-config" but I can't find it. Can you please point me where to look?
Thanks,
Serguei
Hi ssmirnov ,
The output of bash -vx ksocklnd-config enp0s3 is in attachment - bash-xv-ksocklnd-config. Lustre version is 2.16.0_RC1_8_g13fd5eb, kernel is the same (3.10.0-1160.49.1.el7_lustre.x86_64). If start bas -vx ksocklnd-config enp0s3 through ssh the latest printed commands would be:
+ [[ 192.168.1.1 == \0\.\0\.\0\.\0 ]] + routecmd_ipv4=(/sbin/ip route add default via ${gw_ipv4} dev ${i} table ${i}) + ruledelcmd_ipv4=(/sbin/ip rule del from ${addr_ipv4[0]} table ${i} '&>/dev/null') + ruleaddcmd_ipv4=(/sbin/ip rule add from ${addr_ipv4[0]} table ${i}) ++ eval /sbin/ip route add default via 192.168.1.1 dev enp0s3 table enp0s3 + routeerr_ipv4='+++ /sbin/ip route add default via 192.168.1.1 dev enp0s3 table enp0s3' ++ eval /sbin/ip rule del from 192.168.1.82 table enp0s3 '&>/dev/null' + ruledelerr_ipv4='+++ /sbin/ip rule del from 192.168.1.82 table enp0s3' ++ eval /sbin/ip rule add from 192.168.1.82 table enp0s3
I tried ksocklnd-config-ignore_ipv6_link_local_addr but it didn't help.
ssmirnov I have the same problem. "LU-18199 scripts: fix ksocklnd-config gateway selection logic" exists in my code tree.
$ uname -a Linux node04.local 5.4.83-v8-64k #10 SMP PREEMPT Sat Aug 31 21:13:30 BST 2024 aarch64 GNU/Linux $ cat /etc/modprobe.d/lustre.conf cat: /etc/modprobe.d/lustre.conf: No such file or directory $ cat /etc/modprobe.d/lnet.conf cat: /etc/modprobe.d/lnet.conf: No such file or directory $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether d8:3a:dd:66:98:dd brd ff:ff:ff:ff:ff:ff inet 192.168.1.252/24 brd 192.168.1.255 scope global dynamic noprefixroute eth0 valid_lft 79939sec preferred_lft 79939sec inet6 fdd4:5c6c:215f:ce59:8d2d:656b:7ee9:5e62/64 scope global dynamic noprefixroute valid_lft 1791sec preferred_lft 1791sec inet6 fe80::cff2:b31b:1815:1ad5/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: wlan0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000 link/ether 86:c8:d8:3d:02:7c brd ff:ff:ff:ff:ff:ff 4: brint: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether d2:06:fc:a7:30:60 brd ff:ff:ff:ff:ff:ff inet 172.19.180.254/24 brd 172.19.180.255 scope global brint valid_lft forever preferred_lft forever inet6 fe80::d006:fcff:fea7:3060/64 scope link valid_lft forever preferred_lft forever 5: br0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 link/ether d8:3a:dd:66:98:dd brd ff:ff:ff:ff:ff:ff $ ip route show table all default via 192.168.1.63 dev eth0 table eth0 default via 192.168.1.63 dev eth0 proto dhcp metric 100 default via 192.168.1.63 dev eth0 proto dhcp src 192.168.1.252 metric 202 172.19.180.0/24 dev brint proto kernel scope link src 172.19.180.254 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.252 metric 100 192.168.1.0/24 dev eth0 proto dhcp scope link src 192.168.1.252 metric 202 broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 broadcast 172.19.180.0 dev brint table local proto kernel scope link src 172.19.180.254 local 172.19.180.254 dev brint table local proto kernel scope host src 172.19.180.254 broadcast 172.19.180.255 dev brint table local proto kernel scope link src 172.19.180.254 broadcast 192.168.1.0 dev eth0 table local proto kernel scope link src 192.168.1.252 local 192.168.1.252 dev eth0 table local proto kernel scope host src 192.168.1.252 broadcast 192.168.1.255 dev eth0 table local proto kernel scope link src 192.168.1.252 ::1 dev lo proto kernel metric 256 pref medium fd9e:fb91:6fa0:1::/64 via fe80::9239:5fff:feb9:e737 dev eth0 proto ra metric 100 pref medium fdd4:5c6c:215f:ce59::/64 dev eth0 proto ra metric 100 pref medium fdd4:5c6c:215f:ce59::/64 dev eth0 proto ra metric 202 pref medium fe80::/64 dev brint proto kernel metric 256 pref medium fe80::/64 dev eth0 proto kernel metric 256 pref medium local ::1 dev lo table local proto kernel metric 0 pref medium local fdd4:5c6c:215f:ce59:8d2d:656b:7ee9:5e62 dev eth0 table local proto kernel metric 0 pref medium local fe80::cff2:b31b:1815:1ad5 dev eth0 table local proto kernel metric 0 pref medium local fe80::d006:fcff:fea7:3060 dev brint table local proto kernel metric 0 pref medium ff00::/8 dev brint table local metric 256 pref medium ff00::/8 dev br0 table local metric 256 linkdown pref medium ff00::/8 dev eth0 table local metric 256 pref medium $ ip -4 rule list 0: from all lookup local 32765: from 192.168.1.252 lookup eth0 32766: from all lookup main 32767: from all lookup default $ ip -6 rule list 0: from all lookup local 32766: from all lookup main
scherementsev could you please try reproducing with "bash -vx ksocklnd-config enp0s3" and provide output?
The only issue I found so far is that the script doesn't make difference between link-local and global IPv6 addresses. In case it matters in your environment, I'm attaching a modified version of the script which doesn't attempt setting up routes for link-local IPv6 addresses: ksocklnd-config-ignore_ipv6_link_local_addr
ssmirnov , it happens with ssh as well as console version directly in my vm. Here is the last thing I see if start thought ssh:
[root@vm1 tests]# bash llmount.sh ... Stopping clients: vm1 /mnt/lustre (opts:-f) Stopping clients: vm1 /mnt/lustre2 (opts:-f) vm1: executing set_hostid Loading modules from /root/src/lustre-release/lustre/tests/.. detected 6 online CPUs by sysfs libcfs will create CPU partition based on online CPUs ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1' gss/krb5 is not supported quota/lquota options: 'hash_lqs_cur_bits=3' Formatting mgs, mds, osts Format mds1: /tmp/lustre-mdt1
I.e. load_modules has ben called to this moment. I also checked after the netwrok failure - lnet and ksocklnd were loaded.
adilger thanks for clarifying, it looks like I should have excluded the leading "g" when searching
ssmirnov, FYI the "git describe" version is very descriptive of the change. This is 18 patches beyond the 2.15.91 tag (which contains the 56321 patch), and the last part is the commit hash 1a4df98 "LU-18217 build: Ensure LINUX_RELEASE is defined". For patches that have never landed to the git repo, it is usually possible to find intermediate/in-progress patches via the commit hash.
scherementsev I'm not sure which version is 2.15.91_18_g1a4df98. I'm assuming it includes 56321 change (and the outputs you provided are taken before LNet is loaded or ksocklnd-config is run)
Are you using console session or ssh to enp0s3 in your reproducer?
Hi ssmirnov, sorry but ksocklnd-config-ignore_ipv6_link_local_addr_no_flush didn't help. Attaching bash-xv-ksocklnd-config-ignore_ipv6_link_local_addr_no_flush.