[LU-15512] Infinite loop in lnet_discover_peer_locked() Created: 02/Feb/22  Updated: 26/Aug/22  Resolved: 23/Feb/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Blocker
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The fix from LU-13895 was incomplete. There is a case where lnet_discover_peer_locked() can enter an infinite loop. We need to check if the peer NI undergoing discovery has been deleted.



 Comments   
Comment by Gerrit Updater [ 02/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46429
Subject: LU-15512 lnet: Stop discovery on deleted peer NI
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b2d8c8c10f560426864407d8bf0f1e84aa431aef

Comment by Chris Horn [ 02/Feb/22 ]

Test notes for LU-15512

Execute test case:

[hornc@ct7-adm lustre-filesystem]$ git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/29/46429/1 && git checkout FETCH_HEAD
From https://review.whamcloud.com/fs/lustre-release
 * branch                  refs/changes/29/46429/1 -> FETCH_HEAD
Note: switching to 'FETCH_HEAD'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at b2d8c8c10f LU-15512 lnet: Stop discovery on deleted peer NI
[hornc@ct7-adm lustre-filesystem]$ git reset --soft HEAD^
[hornc@ct7-adm lustre-filesystem]$ git status
HEAD detached from FETCH_HEAD
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	modified:   lnet/lnet/peer.c
	modified:   lustre/tests/sanity-lnet.sh

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	lustre/tests/lutf/Makefile.in
	lustre/tests/lutf/src/Makefile.in

[hornc@ct7-adm lustre-filesystem]$ git reset HEAD lnet/lnet/peer.c
Unstaged changes after reset:
M	lnet/lnet/peer.c
[hornc@ct7-adm lustre-filesystem]$ git checkout lnet/lnet/peer.c
Updated 1 path from the index
[hornc@ct7-adm lustre-filesystem]$ make -j 32
...
[root@ct7-mds1 tests]# ./auster -N -v sanity-lnet --only 219
Started at Wed Feb  2 18:57:04 UTC 2022
ct7-mds1: executing check_logdir /tmp/test_logs/2022-02-02/185704
Logging to shared log directory: /tmp/test_logs/2022-02-02/185704
ct7-mds1: executing yml_node
IOC_LIBCFS_GET_NI error 22: Invalid argument
Client: 2.14.56.37
MDS: 2.14.56.37
OSS: 2.14.56.37
running: sanity-lnet ONLY=219
run_suite sanity-lnet /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
-----============= acceptance-small: sanity-lnet ============----- Wed Feb  2 18:57:06 UTC 2022
Running: bash /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
excepting tests:
opening /dev/obd failed: No such file or directory
hint: the kernel modules may not be loaded
Stopping clients: ct7-mds1 /mnt/lustre (opts:-f)
Stopping clients: ct7-mds1 /mnt/lustre2 (opts:-f)
modules unloaded.
ip netns exec test_ns ip addr add 10.1.2.3/31 dev test1pg
ip netns exec test_ns ip link set test1pg up
Loading modules from /home/hornc/lustre-filesystem/lustre
detected 1 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../lnet/lnet/lnet options: 'networks=tcp(eth2) accept=all'
ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
quota/lquota options: 'hash_lqs_cur_bits=3'
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.73.20.11@tcp
          status: up
          interfaces:
              0: eth2
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
       valid_lft 80909sec preferred_lft 80909sec
    inet6 fe80::5054:ff:fe4d:77d3/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:3d:67:a8 brd ff:ff:ff:ff:ff:ff
    inet 10.73.10.11/24 brd 10.73.10.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe3d:67a8/64 scope link
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:00:c6:d1 brd ff:ff:ff:ff:ff:ff
    inet 10.73.20.11/24 brd 10.73.20.255 scope global noprefixroute eth2
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe00:c6d1/64 scope link
       valid_lft forever preferred_lft forever
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:69:cc:a5 brd ff:ff:ff:ff:ff:ff
    inet 10.73.230.11/24 brd 10.73.230.255 scope global noprefixroute eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe69:cca5/64 scope link
       valid_lft forever preferred_lft forever
8: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether f6:00:25:a7:3e:ba brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::f400:25ff:fea7:3eba/64 scope link
       valid_lft forever preferred_lft forever
Cleaning up LNet
modules unloaded.

== sanity-lnet test 219: Consolidate peer entries ======== 18:57:11 (1643828231)
Loading LNet and configuring DLC
../lnet/lnet/lnet options: 'networks=tcp(eth2) accept=all'
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl lnet configure
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth2
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp1 --if eth2
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.73.20.11@tcp
ping:
    - primary nid: 10.73.20.11@tcp
      Multi-Rail: False
      peer ni:
        - nid: 10.73.20.11@tcp
        - nid: 10.73.20.11@tcp1
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.73.20.11@tcp1
ping:
    - primary nid: 10.73.20.11@tcp1
      Multi-Rail: False
      peer ni:
        - nid: 10.73.20.11@tcp
        - nid: 10.73.20.11@tcp1
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl discover 10.73.20.11@tcp1
Connection to 127.0.0.1 closed by remote host.
Connection to 127.0.0.1 closed.

The test hangs indefinitely until node is stopped.

Apply fix and re-test:

[hornc@ct7-adm lustre-filesystem]$ git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/29/46429/1 && git checkout FETCH_HEAD
From https://review.whamcloud.com/fs/lustre-release
 * branch                  refs/changes/29/46429/1 -> FETCH_HEAD
Previous HEAD position was b79f82c23c LU-15446 lnet: Don't use pref NI for reserved portal
HEAD is now at b2d8c8c10f LU-15512 lnet: Stop discovery on deleted peer NI
[hornc@ct7-adm lustre-filesystem]$ make -j 32
...
[root@ct7-mds1 tests]# ./auster -N -v sanity-lnet --only 219
Started at Wed Feb  2 19:07:03 UTC 2022
ct7-mds1: executing check_logdir /tmp/test_logs/2022-02-02/190702
Logging to shared log directory: /tmp/test_logs/2022-02-02/190702
ct7-mds1: executing yml_node
IOC_LIBCFS_GET_NI error 22: Invalid argument
Client: 2.14.56.37
MDS: 2.14.56.37
OSS: 2.14.56.37
running: sanity-lnet ONLY=219
run_suite sanity-lnet /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
-----============= acceptance-small: sanity-lnet ============----- Wed Feb  2 19:07:05 UTC 2022
Running: bash /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
excepting tests:
opening /dev/obd failed: No such file or directory
hint: the kernel modules may not be loaded
Stopping clients: ct7-mds1 /mnt/lustre (opts:-f)
Stopping clients: ct7-mds1 /mnt/lustre2 (opts:-f)
modules unloaded.
ip netns exec test_ns ip addr add 10.1.2.3/31 dev test1pg
ip netns exec test_ns ip link set test1pg up
Loading modules from /home/hornc/lustre-filesystem/lustre
detected 1 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../lnet/lnet/lnet options: 'networks=tcp(eth2) accept=all'
ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
quota/lquota options: 'hash_lqs_cur_bits=3'
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.73.20.11@tcp
          status: up
          interfaces:
              0: eth2
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
       valid_lft 85931sec preferred_lft 85931sec
    inet6 fe80::5054:ff:fe4d:77d3/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:3d:67:a8 brd ff:ff:ff:ff:ff:ff
    inet 10.73.10.11/24 brd 10.73.10.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe3d:67a8/64 scope link
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:00:c6:d1 brd ff:ff:ff:ff:ff:ff
    inet 10.73.20.11/24 brd 10.73.20.255 scope global noprefixroute eth2
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe00:c6d1/64 scope link
       valid_lft forever preferred_lft forever
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:69:cc:a5 brd ff:ff:ff:ff:ff:ff
    inet 10.73.230.11/24 brd 10.73.230.255 scope global noprefixroute eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe69:cca5/64 scope link
       valid_lft forever preferred_lft forever
6: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:64:da:f2:d4:fd brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::8c64:daff:fef2:d4fd/64 scope link
       valid_lft forever preferred_lft forever
Cleaning up LNet
modules unloaded.

== sanity-lnet test 219: Consolidate peer entries ======== 19:07:10 (1643828830)
Loading LNet and configuring DLC
../lnet/lnet/lnet options: 'networks=tcp(eth2) accept=all'
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl lnet configure
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth2
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp1 --if eth2
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.73.20.11@tcp
ping:
    - primary nid: 10.73.20.11@tcp
      Multi-Rail: False
      peer ni:
        - nid: 10.73.20.11@tcp
        - nid: 10.73.20.11@tcp1
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.73.20.11@tcp1
ping:
    - primary nid: 10.73.20.11@tcp1
      Multi-Rail: False
      peer ni:
        - nid: 10.73.20.11@tcp
        - nid: 10.73.20.11@tcp1
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl discover 10.73.20.11@tcp1
discover:
    - primary nid: 10.73.20.11@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.73.20.11@tcp
        - nid: 10.73.20.11@tcp1
PASS 219 (2s)
== sanity-lnet test complete, duration 7 sec ============= 19:07:12 (1643828832)
Cleaning up LNet
opening /dev/obd failed: No such file or directory
hint: the kernel modules may not be loaded
modules unloaded.
sanity-lnet returned 0
Finished at Wed Feb  2 19:07:14 UTC 2022 in 12s
./auster: completed with rc 0
[root@ct7-mds1 tests]#
Comment by Cory Spitz [ 09/Feb/22 ]

Raising as a Blocker for 2.15.0.

Comment by Gerrit Updater [ 23/Feb/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46429/
Subject: LU-15512 lnet: Stop discovery on deleted peer NI
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 94f4e1f517d71ffd6662fb4a82e3dee9aa8f6796

Comment by Peter Jones [ 23/Feb/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:18:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.