Peer shouldn't be killed each time in case of INVALID_SERVICE_ID. This produces
huge number of peers for the same nid and may cause an OOM.
Issue could be simple reproduced using lctl ping to the node where lnet is not loaded(ib should be up).
The OOM was frequently seen with mlnx-ofa-kernel-2.3 where was used
RCU mechanism in mlx4_cq_free. In older mlx4 versions to mitigate
the issue mlx4_cq_free is reworked and doesn't use RCU anymore.
Anyway we shouldn't create and remove tons of peers with the same nid to don't affect performance and memory.
Also OOM issue should be reproducible on all mlx5 not depending on mlnx-ofa-kernel version.
I reproduced it on mlnx-ofa_kernel-3.4 with mlx5.
I prepared and tested a set of patches for it. Will send it in the nearest time.
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25376/
Subject:
LU-9094o2iblnd: kill timedout txs from ibp_tx_queueProject: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 824120da92fe8feb4b4308a136e33ec65fe3b635