[LU-16530] OOM on routers with a faulty link/interface with 1 node Created: 03/Feb/23 Updated: 26/Jul/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Etienne Aujames | Assignee: | Cyril Bordage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | LNet, ko2iblnd, lnet, router | ||
| Environment: |
Production, Lustre 2.12.7 on router and computes, Lustre 2.12.9 + patches on servers |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
A LNet router crash regularly with OOM on a compute partition at the CEA. Environment: x32 infiniband x12 infiniband ~ x100 computes <--o2ib1--> routers <--o2ib0--> servers peer_credits = 42 discovery = 0 health_sensitivity = 0 transaction_timeout = 50 retry_count = 0 router RAM amount : 48GB Kdumps information: I found the peer interface with a server NID that have 42 msg blocked on tx: So it seems to be a connections leak. Analyze 1. Compute node have an issue and do not answer (or partially) to the router. Can someone help me with this ? |
| Comments |
| Comment by Peter Jones [ 03/Feb/23 ] |
|
Cyril Can you please advise? Thanks Peter |
| Comment by Etienne Aujames [ 01/Mar/23 ] |
|
Hi, We successfully reproduce the issue on a test filesystem with Infiniband: Configuration
Lustre 2.12.7 LTS on all nodes. LNet configuration: options lnet lnet_peer_discovery_disabled=1 lnet_health_sensitivity=0 options ko2iblnd peer_credits=42 servers use o2ib50 Reproducer
That's it! On the router:
The client2 is not able to communicate with servers. All the Rx peers_ni of the servers are saturated on the router (peer_buffer_credit < 0). The servers keep trying to reconnect to the clients. Remarks I try to increase peer_buffer_credit and set lnet_health_sensitivity, this does not change the behavior. |
| Comment by Cyril Bordage [ 01/Mar/23 ] |
|
Hello Etienne, thank you for the reproducer. I will take a look into that when I will be back in one week. |
| Comment by Etienne Aujames [ 30/Mar/23 ] |
|
Hi Cyril, Have you got the time to look into that issue ? |
| Comment by Cyril Bordage [ 30/Mar/23 ] |
|
Hello Etienne, I did take a look but then was on something else… Sorry about that. I plan to work on it again very soon. Thank you. |
| Comment by Cyril Bordage [ 25/Apr/23 ] |
|
Hello Etienne, do you have logs of your tests? Is your setup still available? Thank you. |
| Comment by Etienne Aujames [ 27/Apr/23 ] |
|
Hi Cyril, I can't get you debug_log (maybe some dmesg if you want). |
| Comment by Cyril Bordage [ 27/Apr/23 ] |
|
Hello Etienne, yes, dmesg could be useful. Thank you. |
| Comment by Etienne Aujames [ 26/Jul/23 ] |
|
Hi Cyril, Sorry for the delay. I have submitted 2 dmesg logs:
Those are logs from 2 crashes of the router in production. The situation was stabilized by changing the CPU of the faulty client node. |
| Comment by Etienne Aujames [ 26/Jul/23 ] |
|
Here some context for logs:
|