[LU-6251] Melanox / O2ib lnd cause a OOM on OST node Created: 16/Feb/15 Updated: 13/Oct/21 Resolved: 13/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | WC Triage |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.5.1 based Lustre code. |
||
| Severity: | 3 |
| Rank (Obsolete): | 17504 |
| Description |
|
Due investigation an OOM on node, we found a large number allocations done with 532480 and 266240 bytes size. Example of vm_struct for memory region with size 266240:
crash> vm_struct ffff880019c542c0
struct vm_struct {
next = 0xffff880588f29900,
addr = 0xffffc904a626d000,
size = 266240,
flags = 4,
pages = 0x0,
nr_pages = 0,
phys_addr = 0,
caller = 0xffffffffa00b7136 <mlx4_buf_alloc+870>
}
99% of memory regions with size 266240 and 523480 has caller = 0xffffffffa00b7136 <mlx4_buf_alloc+870>. number a regions is 31042 / 31296. PID: 83859 TASK: ffff8807d64ca040 CPU: 0 COMMAND: "kiblnd_connd"
#0 [ffff8807b2835a90] schedule at ffffffff815253c0
#1 [ffff8807b2835b58] schedule_timeout at ffffffff815262a5
#2 [ffff8807b2835c08] wait_for_common at ffffffff81525f23
#3 [ffff8807b2835c98] wait_for_completion at ffffffff8152603d
#4 [ffff8807b2835ca8] synchronize_sched at ffffffff81096e88
#5 [ffff8807b2835cf8] mlx4_cq_free at ffffffffa00bf188 [mlx4_core]
#6 [ffff8807b2835d68] mlx4_ib_destroy_cq at ffffffffa04725f5 [mlx4_ib]
#7 [ffff8807b2835d88] ib_destroy_cq at ffffffffa043de99 [ib_core]
#8 [ffff8807b2835d98] kiblnd_destroy_conn at ffffffffa0acbafc [ko2iblnd]
#9 [ffff8807b2835dd8] kiblnd_connd at ffffffffa0ad5fe1 [ko2iblnd]
#10 [ffff8807b2835ee8] kthread at ffffffff8109ac66
#11 [ffff8807b2835f48] kernel_thread at ffffffff8100c20a
so thread blocked with something while destroy an ib connection. crash> p ((struct task_struct *)0xffff8807d64ca040)->se.cfs_rq->rq->clock $25 = 230339336880160 crash> p ((struct task_struct *)0xffff8807d64ca040)->se.block_start $26 = 230337329685261 >>> (230339336880160-230337329685261)/10**9 2 but more interested in an o2ib lnd statistic i found crash> kib_net 0xffff8808325e9dc0
struct kib_net {
ibn_list = {
next = 0xffff8807b40a2f40,
prev = 0xffff8807b40a2f40
},
ibn_incarnation = 1423478059211439,
ibn_init = 2,
ibn_shutdown = 0,
ibn_npeers = {
counter = 31042
},
ibn_nconns = {
counter = 31041
},
so 31k peers - but tests are run on cluster with 14 real clients and 5 server nodes, so isn't more 20 connections exist. crash> p &kiblnd_data.kib_connd_zombies $7 = (struct list_head *) 0xffffffffa0ae7e70 <kiblnd_data+112> crash> list -H 0xffffffffa0ae7e70 -o kib_conn.ibc_list | wc -l 31030 so all memory consumed with zombi which need more than 2s to destroy. |
| Comments |
| Comment by Sergey Cheremencev [ 16/Feb/15 ] |
|
mlnx version: mlnx-ofa_kernel-2.3 |
| Comment by Bruno Travouillon (Inactive) [ 21/Aug/15 ] |
|
Alexey, Sergey, Have you been able to troubleshoot this issue? We are hitting a similar issue with 2.5.3.90 and OFED 3.12. |
| Comment by Alexey Lyashkov [ 24/Aug/15 ] |
|
Bruno, no yet. we have it issue once. Liang was created patch http://review.whamcloud.com/#/c/14600/ which may help for it, but it's isn't help with original bug create for. So if you have repeatable it issue - try it patch and say how it help for you. |
| Comment by Alexey Lyashkov [ 25/Aug/15 ] |
|
I have discussion with IB guys today, he say LNet had a bug in IB connect error handing as lack a something like a it may put connection in wrong state if some packets was lost during connect handshake. |
| Comment by Bruno Travouillon (Inactive) [ 27/Aug/15 ] |
|
Thanks Alexey. We still need to reproduce the issue on a test cluster, I should test with the patch afterwards. However, it will only avoid the memory pressure and won't solve the underlying issue with the zombie connection. |
| Comment by Alexey Lyashkov [ 27/Aug/15 ] |
|
from my point view, zombi is result of handshake packet lost, so if we fix it problem we will fix zombie. |
| Comment by Wesley Yu (Inactive) [ 11/Sep/15 ] |
|
Alexey, Bruno, |
| Comment by Bruno Travouillon (Inactive) [ 30/Sep/15 ] |
|
Wesley, We are still investigating. I should try http://review.whamcloud.com/#/c/14600/ soon. |
| Comment by Doug Oucharek (Inactive) [ 26/Nov/15 ] |
|
Hi Alexey, Has there been any progress on this issue? We are seeing more cases of this OOM with mlx5. I'm starting to suspect that mlx5 is even more aggressive in memory usage making this problem even worse with newer Mellanox cards. Patch http://review.whamcloud.com/#/c/14600/ does not seem to be enough of a break to allow memory to be released. |
| Comment by Sergey Cheremencev [ 14/Dec/15 ] |
|
The issue is fixed in seagate after updating mlnx-ofa_kernel from 2.3 to 3.1-1.0.3. [root@windu-head ~]# pdsh -S -w windu-client[02-11] perfquery | grep PortXmitWait windu-client02: PortXmitWait:....................0 windu-client07: PortXmitWait:....................0 windu-client03: PortXmitWait:....................0 windu-client10: PortXmitWait:....................0 windu-client06: PortXmitWait:....................0 windu-client11: PortXmitWait:....................298906 windu-client08: PortXmitWait:....................0 windu-client04: PortXmitWait:....................0 windu-client09: PortXmitWait:....................0 windu-client05: PortXmitWait:....................0 # ibqueryerrors Errors for "winduoem mlx4_0" GUID 0x2c903004cb42d port 1: [LinkDownedCounter == 16] [PortXmitDiscards == 32] Errors for "windu02 mlx4_0" GUID 0x1e67030066eb15 port 1: [PortXmitWait == 1] Errors for "windu07 mlx4_0" GUID 0x1e67030066ed9d port 1: [PortXmitWait == 1] Errors for "MT25408 ConnectX Mellanox Technologies" GUID 0x2590ffffdfac0d port 1: [PortXmitWait == 298906] Errors for 0xf452140300838740 "SwitchX - Mellanox Technologies" GUID 0xf452140300838740 port ALL: [LinkDownedCounter == 255] [PortRcvSwitchRelayErrors == 4951] [PortXmitWait == 4294967295] GUID 0xf452140300838740 port 1: [LinkDownedCounter == 1] [PortRcvSwitchRelayErrors == 1] GUID 0xf452140300838740 port 4: [LinkDownedCounter == 38] [PortRcvSwitchRelayErrors == 485] [PortXmitWait == 25194846] GUID 0xf452140300838740 port 6: [LinkDownedCounter == 75] [PortXmitWait == 2742] GUID 0xf452140300838740 port 11: [LinkDownedCounter == 98] [PortRcvSwitchRelayErrors == 352] [PortXmitWait == 4484903] GUID 0xf452140300838740 port 12: [LinkDownedCounter == 96] [PortRcvSwitchRelayErrors == 352] [PortXmitWait == 95352145] GUID 0xf452140300838740 port 18: [PortXmitWait == 3689472946] GUID 0xf452140300838740 port 20: [LinkDownedCounter == 27] GUID 0xf452140300838740 port 22: [LinkDownedCounter == 204] [PortXmitWait == 206951] GUID 0xf452140300838740 port 26: [LinkDownedCounter == 42] [PortRcvSwitchRelayErrors == 1009] [PortXmitWait == 7327592] GUID 0xf452140300838740 port 28: [LinkDownedCounter == 42] [PortRcvSwitchRelayErrors == 1006] [PortXmitWait == 7555602] GUID 0xf452140300838740 port 30: [LinkDownedCounter == 41] [PortRcvSwitchRelayErrors == 1000] [PortXmitWait == 6546015] GUID 0xf452140300838740 port 32: [LinkDownedCounter == 194] GUID 0xf452140300838740 port 34: [LinkDownedCounter == 36] [PortRcvSwitchRelayErrors == 746] [PortXmitWait == 6957849] GUID 0xf452140300838740 port 36: [LinkDownedCounter == 184] [PortXmitWait == 4294967295] Errors for "windu08 mlx4_0" GUID 0x1e670300670add port 1: [PortXmitWait == 1] Errors for 0xf4521403008386c0 "SwitchX - Mellanox Technologies" GUID 0xf4521403008386c0 port ALL: [LinkDownedCounter == 255] [PortRcvSwitchRelayErrors == 5742] [PortXmitWait == 4294967295] GUID 0xf4521403008386c0 port 6: [LinkDownedCounter == 26] [PortXmitWait == 2687] GUID 0xf4521403008386c0 port 8: [LinkDownedCounter == 176] [PortXmitWait == 1849083011] GUID 0xf4521403008386c0 port 10: [LinkDownedCounter == 71] [PortXmitWait == 3901] GUID 0xf4521403008386c0 port 11: [LinkDownedCounter == 94] [PortRcvSwitchRelayErrors == 352] [PortXmitWait == 14950798] GUID 0xf4521403008386c0 port 12: [LinkDownedCounter == 98] [PortRcvSwitchRelayErrors == 352] [PortXmitWait == 5555905] GUID 0xf4521403008386c0 port 13: [LinkDownedCounter == 38] [PortRcvSwitchRelayErrors == 1007] [PortXmitWait == 7829732] GUID 0xf4521403008386c0 port 14: [LinkDownedCounter == 38] [PortRcvSwitchRelayErrors == 1014] [PortXmitWait == 6837648] GUID 0xf4521403008386c0 port 15: [LinkDownedCounter == 39] [PortRcvSwitchRelayErrors == 1006] [PortXmitWait == 7558563] GUID 0xf4521403008386c0 port 16: [LinkDownedCounter == 42] [PortRcvSwitchRelayErrors == 1004] [PortXmitWait == 6831755] GUID 0xf4521403008386c0 port 17: [LinkDownedCounter == 39] [PortRcvSwitchRelayErrors == 1006] [PortXmitWait == 7336329] GUID 0xf4521403008386c0 port 20: [PortXmitWait == 3261894682] GUID 0xf4521403008386c0 port 30: [LinkDownedCounter == 10] [PortRcvSwitchRelayErrors == 1] ## Summary: 26 nodes checked, 7 bad nodes found ## 96 ports checked, 31 ports have errors beyond threshold ## Thresholds: ## Suppressed: On the other hand each connection should be destroyed and freed much faster. We could reproduce the problem on last master with kernel 2.6.32-431.17.1 and default mlnx-ofa-kernel-2.3. Also want to point that mlnx-ofa_kernel 3.1-1.0.3 for unknown reasons has needed changes from RCU to spin locking only for mlnx4. |
| Comment by Doug Oucharek (Inactive) [ 14/Dec/15 ] |
|
I've been investigating a similar issue. Here is what I think I am seeing: 1- Two nodes have a race condition as they try to connect to each other (a reconnect actually). Over a short time, the number of reconnects to one node is generating a huge number of zombies which are occupying all the memory. I'm hypothesizing a two part fix for now given mlx5 is not fixed as you mention above: 1- When we fail a connection (i.e. due to race), immediately close the RDMA queue (cmid) so it cannot trigger a bunch of reconnects. In theory, 1 should prevent 2, but I think doing 2 is good programming to prevent any unexpected failures of this sort. Comments? |
| Comment by Andreas Dilger [ 13/Oct/21 ] |
|
MOFED 2.x is no longer of interest. |