[LU-9913] conf-sanity tests 31 and 35a fail with “LNetError: 8653:0:(module.c:689:libcfs_exit()) Portals memory leaked: 184 bytes” Created: 24/Aug/17  Updated: 25/Aug/17  Resolved: 25/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A variety of conf-sanity tests are failing with an LNET memory leak.

conf-sanity test_31 fails with the following error found in the test_log:

[13800.637071] LNetError: 14072:0:(module.c:689:libcfs_exit()) Portals memory leaked: 184 bytes
mv: cannot stat '/tmp/debug': No such file or directory
Memory leaks detected
 conf-sanity test_31: @@@@@@ FAIL: cleanup failed with rc 203 

We also see conf-sanity tests 0, 35a, and 78 fail with the same memory leak error.

These tests started failing on August 22, 2017. The logs for the first few failures are at
https://testing.hpdd.intel.com/test_sets/b644f444-87ab-11e7-b4b0-5254006e85c2
https://testing.hpdd.intel.com/test_sets/17c3eaf0-87c9-11e7-b3ca-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7c038e52-87ca-11e7-b4b0-5254006e85c2



 Comments   
Comment by Gerrit Updater [ 24/Aug/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/28695
Subject: LU-9913 lnet: balance references in lnet_discover_peer_locked()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 966e33c077f504b32df3687b103173f6eb2fb35f

Comment by John Hammond [ 24/Aug/17 ]

Copied over from LU-9909.

I bisected this locally by running conf-sanity 35a. This was introduced by commit 0f1aaad4c1b4447ee5097b8bb79a49d09eaa23c2 https://review.whamcloud.com/25789 LU-9480 lnet: implement Peer Discovery. Unfortunately leak finder doesn't work for LNet allocations. But the leak is an LNet peer.

Comment by Olaf Weber [ 25/Aug/17 ]

I agree with your diagnosis.

Comment by Quentin Bouget [ 25/Aug/17 ]

test_90a seems affected too: https://testing.hpdd.intel.com/test_sets/c0ce9d7c-8916-11e7-b50a-5254006e85c2

Comment by Quentin Bouget [ 25/Aug/17 ]

It would seem some test suites are not even able to launch due to this same bug: https://testing.hpdd.intel.com/test_sets/e7473514-8915-11e7-b94a-5254006e85c2. The same LNetError happens and fails the test suite early (e.g. https://testing.hpdd.intel.com/test_logs/ebb5656c-8915-11e7-b94a-5254006e85c2/show_text)

Comment by Gerrit Updater [ 25/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28695/
Subject: LU-9913 lnet: balance references in lnet_discover_peer_locked()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1c45d9051764e0637ba90b3db06ba8fa37722916

Comment by Peter Jones [ 25/Aug/17 ]

Landed for 2,11

Generated at Sat Feb 10 02:30:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.