[LU-9094] OOM caused by huge number of peers in case of INVALID_SERVICE_ID Created: 09/Feb/17 Updated: 01/Mar/17 Resolved: 01/Mar/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sergey Cheremencev | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Comments |
| Comment by Sergey Cheremencev [ 09/Feb/17 ] |
|
Please change the Topic to something like "OOM caused by huge number of peers in case of INVALID_SERVICE_ID". |
| Comment by Sergey Cheremencev [ 09/Feb/17 ] |
|
Peer shouldn't be killed each time in case of INVALID_SERVICE_ID. This produces Issue could be simple reproduced using lctl ping to the node where lnet is not loaded(ib should be up). [root@pink03 tests]# cat ~/oom.sh while true; do lctl ping 172.18.56.129@o2ib0 done The OOM was frequently seen with mlnx-ofa-kernel-2.3 where was used Also OOM issue should be reproducible on all mlx5 not depending on mlnx-ofa-kernel version. I prepared and tested a set of patches for it. Will send it in the nearest time. |
| Comment by Gerrit Updater [ 10/Feb/17 ] |
|
Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25375 |
| Comment by Gerrit Updater [ 10/Feb/17 ] |
|
Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25376 |
| Comment by Gerrit Updater [ 10/Feb/17 ] |
|
Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25378 |
| Comment by Gerrit Updater [ 18/Feb/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25378/ |
| Comment by Sergey Cheremencev [ 20/Feb/17 ] |
|
Please take note that "reconnect peer for REJ_INVALID_SERVICE_ID" without "kill timedout txs from ibp_tx_queue" causes lctl ping hung when lnet is not loaded on the target node(lctl ping waits indefinitely). |
| Comment by Gerrit Updater [ 01/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25375/ |
| Comment by Gerrit Updater [ 01/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25376/ |