[LU-9094] OOM caused by huge number of peers in case of INVALID_SERVICE_ID Created: 09/Feb/17  Updated: 01/Mar/17  Resolved: 01/Mar/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Sergey Cheremencev Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Comments   
Comment by Sergey Cheremencev [ 09/Feb/17 ]

Please change the Topic to something like "OOM caused by huge number of peers in case of INVALID_SERVICE_ID".

Comment by Sergey Cheremencev [ 09/Feb/17 ]

Peer shouldn't be killed each time in case of INVALID_SERVICE_ID. This produces
huge number of peers for the same nid and may cause an OOM.

Issue could be simple reproduced using lctl ping to the node where lnet is not loaded(ib should be up).

[root@pink03 tests]# cat ~/oom.sh
while true; do
	lctl ping 172.18.56.129@o2ib0
done

The OOM was frequently seen with mlnx-ofa-kernel-2.3 where was used
RCU mechanism in mlx4_cq_free. In older mlx4 versions to mitigate
the issue mlx4_cq_free is reworked and doesn't use RCU anymore.
Anyway we shouldn't create and remove tons of peers with the same nid to don't affect performance and memory.

Also OOM issue should be reproducible on all mlx5 not depending on mlnx-ofa-kernel version.
I reproduced it on mlnx-ofa_kernel-3.4 with mlx5.

I prepared and tested a set of patches for it. Will send it in the nearest time.

Comment by Gerrit Updater [ 10/Feb/17 ]

Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25375
Subject: LU-9094 lnet: remove ni from lnet_finalize
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a312fcd635e63a79c53dd072ece9a066c1baf342

Comment by Gerrit Updater [ 10/Feb/17 ]

Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25376
Subject: LU-9094 o2iblnd: kill timedout txs from ibp_tx_queue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fdee92bf4859793dc3fe4911b491ad9d0b21533e

Comment by Gerrit Updater [ 10/Feb/17 ]

Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25378
Subject: LU-9094 o2iblnd: reconnect peer for REJ_INVALID_SERVICE_ID
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 496c48a3daa21d0423a387625821de03a57db443

Comment by Gerrit Updater [ 18/Feb/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25378/
Subject: LU-9094 o2iblnd: reconnect peer for REJ_INVALID_SERVICE_ID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 603aa7a1df6ee6ce6fe0d501a8b2bd1bfdf43bb8

Comment by Sergey Cheremencev [ 20/Feb/17 ]

Please take note that "reconnect peer for REJ_INVALID_SERVICE_ID" without "kill timedout txs from ibp_tx_queue" causes lctl ping hung when lnet is not loaded on the target node(lctl ping waits indefinitely).
It was the reason why I pushed all 3 patches together.

Comment by Gerrit Updater [ 01/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25375/
Subject: LU-9094 lnet: remove ni from lnet_finalize
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dab78a9efd05e4f22fc83232bdadce347d3dafda

Comment by Gerrit Updater [ 01/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25376/
Subject: LU-9094 o2iblnd: kill timedout txs from ibp_tx_queue
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 824120da92fe8feb4b4308a136e33ec65fe3b635

Generated at Sat Feb 10 02:23:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.