[LU-7390] Router memory leak if we start a new router on a operationel configuration Created: 05/Nov/15 Updated: 28/Jun/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Antoine Percher | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
redhat7 mlx5 EDR and Connect-IB |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Router memory leak if we start a new router on a operationel configuration configuration : lustre server 2.5.3.90 with one IB and 2 ip address QQ.P.BBO.SY QQ.P.BBB.SY 2 lustre router 2.7 with 4 IB card and 4 ip address ~130 lustre clients i2.7 with one IB and 2 ip address JO.BOO.CX.CY JO.BOB.CX.CY we start all servers one router and all clients and waiting that and we start the router with modprobe lustre, the router never start KERNEL: /usr/lib/debug/lib/modules/3.10.0-229.7.2.el7.x86_64/vmlinux crash> kmem -i TOTAL SWAP 0 0 ---- There are a lot of zombies connections on the list : crash> p kiblnd_data.kib_connd_zombies All the connections have an ibc_state = 0x5 and we can see on the lustre debug trace some faulted connection : [root@neel121 127.0.0.1-2015.09.23-09:54:45]# grep kiblnd_rx_complete lustre.log I don't understand why a lot of connections have an EIO error but that explain the memory leak .... Router work fine if we start all router before start lustre on the clients I find on Jira lustre Intel database the Lustre version : Lustre configuration router : Client: Server: on the server side, there are a lot of other route that I didn't reported on the LNET_ROUTER_OPTIONS |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 05/Nov/15 ] |
|
Since I have also been involved on this problem when being on-site, I can complete the whole problem description by adding what occurs on Clients after the problem on the new router. So first of all, it seems that the problem can be reproduced simply by [re-]starting one of the LNET-Router in config. When this occurs, this router quite quickly triggers an OOM situation which is likely to be caused by a huge number of allocs in LNet/ko2iblnd layers seen in Lustre debug trace and possibly corresponding to numerous buffers in IB layer …. Then Clients fall in the situation where Lustre can not be fully shudown because filesystem can be unmounted ok, but then lustre_rmmod stalls due to underlying rmmod being stuck with the following stack trace : __schedule() schedule() schedule_timeout() __down_common() __down() lnet_router_checker_stop() LNetNIFini() ptlrpc_ni_fini() ptlrpc_exit_portals() cleanup_module() sys_delete_module() system_call_fastpath() when at the same time, « router_checker » thread is stuck with the following stack trace : __schedule() schedule() schedule_timeout() lnet_prune_rc_data() lnet_router_checker() kthread() At the same time, « Waiting for rc buffers to unlink\n » msg is repeatedly printed on the Console of Clients where the [lustre_]rmmod is stuck. It is also interesting that this same behavior can occur even if the new router being started has been dynamically removed from the Clients LNET config using both "lnetlctl route del [--net,--gateway]" cmds. Amir, has already suggested to try running "lctl net down" cmd before the lustre_rmmod cmd to see if it helps, but site has presently no more dedicated test slot to give it a try. Last, I had an interesting update from the site about their current network config/cabling which may be of interest : |
| Comment by Bruno Travouillon (Inactive) [ 05/Nov/15 ] |
|
FTR, we hit a similar issue with RHEL6/OFED3.12/Lustre 2.5.3.90 on some OSSs. _each OSS has 2 IB boards/attachment, both connected to the same fabric but each with an IP address in a different Client network. The memory leak happened during production. We were not trying to failover the OSTs or stopping the Lustre filesystem. During my research, I found I have not been able to reproduce yet, nor to test the proposed patch. |
| Comment by Joseph Gmitter (Inactive) [ 05/Nov/15 ] |
|
Hi Amir, |
| Comment by Amir Shehata (Inactive) [ 05/Nov/15 ] |
|
I agree with Bruno. When a router is started it'll get many connection requests, which could exploit the issue fixed in Could you apply this patch and see if it resolves the issue? |
| Comment by Amir Shehata (Inactive) [ 06/Nov/15 ] |
|
After discussing it internally, it seems that we're seeing this OOM issues on multiple different sites. All sites are using mlx5 stack. Is it possible to roll back to mlx4 on the routers? |
| Comment by James A Simmons [ 06/Nov/15 ] |
|
Not if you have a Connect-IB card. Mind you most of systems use mlx5 and we don't see this problem. |
| Comment by Bruno Travouillon (Inactive) [ 08/Nov/15 ] |
|
We have ConnectX-4 and Connect-IB cards on the routers, so we are stuck with mlx5. The similar memory leaks reported in As per your request, our engineering should apply http://review.whamcloud.com/#/c/14600 quickly to see if it can solve the issue. |
| Comment by Doug Oucharek (Inactive) [ 09/Nov/15 ] |
|
Does this system have the patch to |
| Comment by Gregoire Pichon [ 13/Nov/15 ] |
|
No, the Lustre version installed on this system does not include the patch to |
| Comment by Chris Horn [ 17/Dec/15 ] |
|
FWIW, we (Cray) seem to be hitting this issue as well and the patch http://review.whamcloud.com/#/c/14600 did not resolve the issue. |
| Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ] |
|
Patch 14600 has been "reinvented" as: http://review.whamcloud.com/#/c/17661. This is new and needs validation. I need to spend some time to determine if it can address this ticket. However, if you have time, please remove 14600 and apply 17661 and see if this addresses your problem. |