[LU-15440] Memory leak in discovery Created: 11/Jan/22  Updated: 29/Jul/23  Resolved: 07/Feb/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Unlikely to be hit in real world, but there's a potential memory leak in lnet_peer_data_present. If the ping buffer has nnis <= 1 then function is exited without dropping the ref on the ping buffer causing this memory to leak:

        if (pbuf->pb_info.pi_nnis <= 1)
                goto out;


 Comments   
Comment by Gerrit Updater [ 11/Jan/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46052
Subject: LU-15440 lnet: lnet_peer_data_present() memory leak
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b7a11fde4d687a92d31313f4c815599c6bcdbf2a

Comment by Chris Horn [ 31/Jan/22 ]

Test report for LU-15440:

Build/execute test case from patch:

[hornc@ct7-adm lustre-filesystem]$ git le HEAD^..HEAD
fbbc1258a0 (HEAD) LU-15478 lnet: Check LNET_NID_IS_ANY in LNET_NID_NET
[hornc@ct7-adm lustre-filesystem]$ git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/52/46052/2 && git cherry-pick FETCH_HEAD
From https://review.whamcloud.com/fs/lustre-release
 * branch                  refs/changes/52/46052/2 -> FETCH_HEAD
Auto-merging lnet/lnet/peer.c
[detached HEAD 6c7815e9e1] LU-15440 lnet: lnet_peer_data_present() memory leak
 Date: Tue Jan 11 16:19:16 2022 -0600
 2 files changed, 16 insertions(+), 1 deletion(-)
[hornc@ct7-adm lustre-filesystem]$ git reset --soft HEAD^
[hornc@ct7-adm lustre-filesystem]$ git status
HEAD detached from fb5d7036ec
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	modified:   lnet/lnet/peer.c
	modified:   lustre/tests/sanity-lnet.sh

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	lustre/tests/lutf/Makefile.in
	lustre/tests/lutf/src/Makefile.in

[hornc@ct7-adm lustre-filesystem]$ git reset HEAD lnet/lnet/peer.c
Unstaged changes after reset:
M	lnet/lnet/peer.c
[hornc@ct7-adm lustre-filesystem]$ g co lnet/lnet/peer.c
Updated 1 path from the index
[hornc@ct7-adm lustre-filesystem]$ git --no-pager diff --cached
diff --git a/lustre/tests/sanity-lnet.sh b/lustre/tests/sanity-lnet.sh
index 860f712da7..72e28eb497 100755
--- a/lustre/tests/sanity-lnet.sh
+++ b/lustre/tests/sanity-lnet.sh
@@ -2335,6 +2335,19 @@ test_216() {
 }
 run_test 216 "Failed send to peer NI owned by local host should not trigger peer NI recovery"

+test_217() {
+	reinit_dlc || return $?
+
+	[[ $($LNETCTL net show | grep -c nid) -ne 1 ]] &&
+		error "Unexpected number of NIs after initalizing DLC"
+
+	do_lnetctl discover 0@lo ||
+		error "Failed to discover 0@lo"
+
+	unload_modules
+}
+run_test 217 "Don't leak memory when discovering peer with nnis <= 1"
+
 test_230() {
 	# LU-12815
 	echo "Check valid values; Should succeed"
[hornc@ct7-adm lustre-filesystem]$ make -j 32
...
[root@ct7-adm tests]# ./auster -N -v sanity-lnet --only 217
Started at Sat Jan 29 01:55:10 UTC 2022
ct7-adm: executing check_logdir /tmp/test_logs/2022-01-29/015510
Logging to shared log directory: /tmp/test_logs/2022-01-29/015510
ct7-adm: executing yml_node
IOC_LIBCFS_GET_NI error 22: Invalid argument
Client: 2.14.57.60
MDS: 2.14.57.60
OSS: 2.14.57.60
running: sanity-lnet ONLY=217
run_suite sanity-lnet /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
-----============= acceptance-small: sanity-lnet ============----- Sat Jan 29 01:55:12 UTC 2022
Running: bash /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
excepting tests:
opening /dev/obd failed: No such file or directory
hint: the kernel modules may not be loaded
Stopping clients: ct7-adm /mnt/lustre (opts:-f)
Stopping clients: ct7-adm /mnt/lustre2 (opts:-f)
modules unloaded.
ip netns exec test_ns ip addr add 10.1.2.3/31 dev test1pg
ip netns exec test_ns ip link set test1pg up
Loading modules from /home/hornc/lustre-filesystem/lustre
detected 2 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
../lnet/lnet/lnet options: 'networks=tcp(eth1) accept=all'
ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
quota/lquota options: 'hash_lqs_cur_bits=3'
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.73.10.10@tcp
          status: up
          interfaces:
              0: eth1
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
       valid_lft 73891sec preferred_lft 73891sec
    inet6 fe80::5054:ff:fe4d:77d3/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:27:de:86 brd ff:ff:ff:ff:ff:ff
    inet 10.73.10.10/24 brd 10.73.10.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe27:de86/64 scope link
       valid_lft forever preferred_lft forever
6: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 92:fb:87:4d:80:ea brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::90fb:87ff:fe4d:80ea/64 scope link
       valid_lft forever preferred_lft forever
Cleaning up LNet
modules unloaded.

== sanity-lnet test 217: Don't leak memory when discovering peer with nnis <= 1 ========================================================== 01:55:17 (1643421317)
Loading LNet and configuring DLC
../lnet/lnet/lnet options: 'networks=tcp(eth1) accept=all'
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl lnet configure
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl discover 0@lo
discover:
    - primary nid: 0@lo
      Multi-Rail: True
      peer ni:
opening /dev/obd failed: No such file or directory
hint: the kernel modules may not be loaded

[13367.438649] LNetError: 12268:0:(module.c:919:libcfs_exit()) Portals memory leaked: 322 bytes
mv: cannot stat '/tmp/debug': No such file or directory
Memory leaks detected
 sanity-lnet test_217: @@@@@@ FAIL: test_217 failed with 254
  Trace dump:
  = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6386:error()
  = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6690:run_one()
  = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6737:run_one_logged()
  = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6563:run_test()
  = /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh:2349:main()
Dumping lctl log to /tmp/test_logs/2022-01-29/015510/sanity-lnet.test_217.*.1643421319.log
Dumping logs only on local client.
test_217 returned 1
FAIL 217 (2s)
Cleaning up LNet

[13367.438649] LNetError: 12268:0:(module.c:919:libcfs_exit()) Portals memory leaked: 322 bytes
mv: cannot stat '/tmp/debug': No such file or directory
Memory leaks detected
sanity-lnet returned 254
Finished at Sat Jan 29 01:55:19 UTC 2022 in 9s
./auster: completed with rc 0
[root@ct7-adm tests]#

Apply fix and re-test:

[hornc@ct7-adm lustre-filesystem]$ git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/52/46052/2 && git reset --hard FETCH_HEAD
From https://review.whamcloud.com/fs/lustre-release
 * branch                  refs/changes/52/46052/2 -> FETCH_HEAD
HEAD is now at 1eecd524de LU-15440 lnet: lnet_peer_data_present() memory leak
[hornc@ct7-adm lustre-filesystem]$ make -j 32
...
[root@ct7-adm tests]# ./auster -N -v sanity-lnet --only 217
Started at Sat Jan 29 02:00:49 UTC 2022
ct7-adm: executing check_logdir /tmp/test_logs/2022-01-29/020048
Logging to shared log directory: /tmp/test_logs/2022-01-29/020048
ct7-adm: executing yml_node
IOC_LIBCFS_GET_NI error 22: Invalid argument
Client: 2.14.57.60
MDS: 2.14.57.60
OSS: 2.14.57.60
running: sanity-lnet ONLY=217
run_suite sanity-lnet /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
-----============= acceptance-small: sanity-lnet ============----- Sat Jan 29 02:00:51 UTC 2022
Running: bash /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
excepting tests:
opening /dev/obd failed: No such file or directory
hint: the kernel modules may not be loaded
Stopping clients: ct7-adm /mnt/lustre (opts:-f)
Stopping clients: ct7-adm /mnt/lustre2 (opts:-f)
modules unloaded.
ip netns exec test_ns ip addr add 10.1.2.3/31 dev test1pg
ip netns exec test_ns ip link set test1pg up
Loading modules from /home/hornc/lustre-filesystem/lustre
detected 2 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
../lnet/lnet/lnet options: 'networks=tcp(eth1) accept=all'
ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
quota/lquota options: 'hash_lqs_cur_bits=3'
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.73.10.10@tcp
          status: up
          interfaces:
              0: eth1
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
       valid_lft 73552sec preferred_lft 73552sec
    inet6 fe80::5054:ff:fe4d:77d3/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:27:de:86 brd ff:ff:ff:ff:ff:ff
    inet 10.73.10.10/24 brd 10.73.10.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe27:de86/64 scope link
       valid_lft forever preferred_lft forever
7: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 5e:ee:6e:41:3f:ff brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::5cee:6eff:fe41:3fff/64 scope link
       valid_lft forever preferred_lft forever
Cleaning up LNet
modules unloaded.

== sanity-lnet test 217: Don't leak memory when discovering peer with nnis <= 1 ========================================================== 02:00:55 (1643421655)
Loading LNet and configuring DLC
../lnet/lnet/lnet options: 'networks=tcp(eth1) accept=all'
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl lnet configure
/home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl discover 0@lo
discover:
    - primary nid: 0@lo
      Multi-Rail: True
      peer ni:
opening /dev/obd failed: No such file or directory
hint: the kernel modules may not be loaded
modules unloaded.
PASS 217 (2s)
== sanity-lnet test complete, duration 6 sec ============= 02:00:57 (1643421657)
Cleaning up LNet
modules unloaded.
sanity-lnet returned 0
Finished at Sat Jan 29 02:00:58 UTC 2022 in 10s
./auster: completed with rc 0
[root@ct7-adm tests]#
Comment by Gerrit Updater [ 07/Feb/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46052/
Subject: LU-15440 lnet: lnet_peer_data_present() memory leak
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 56384a4fc39ff99c8abb3538f93d303f2be6ab45

Comment by Peter Jones [ 07/Feb/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:18:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.