[LU-6251] Melanox / O2ib lnd cause a OOM on OST node Created: 16/Feb/15  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

2.5.1 based Lustre code.


Severity: 3
Rank (Obsolete): 17504

 Description   

Due investigation an OOM on node, we found a large number allocations done with 532480 and 266240 bytes size.

Example of vm_struct for memory region with size 266240:
crash> vm_struct ffff880019c542c0
struct vm_struct {
  next = 0xffff880588f29900, 
  addr = 0xffffc904a626d000, 
  size = 266240, 
  flags = 4, 
  pages = 0x0, 
  nr_pages = 0, 
  phys_addr = 0, 
  caller = 0xffffffffa00b7136 <mlx4_buf_alloc+870>
}

99% of memory regions with size 266240 and 523480 has caller = 0xffffffffa00b7136 <mlx4_buf_alloc+870>.

number a regions is 31042 / 31296.
I found strange backtraces in kernel

PID: 83859  TASK: ffff8807d64ca040  CPU: 0   COMMAND: "kiblnd_connd"
 #0 [ffff8807b2835a90] schedule at ffffffff815253c0
 #1 [ffff8807b2835b58] schedule_timeout at ffffffff815262a5
 #2 [ffff8807b2835c08] wait_for_common at ffffffff81525f23
 #3 [ffff8807b2835c98] wait_for_completion at ffffffff8152603d
 #4 [ffff8807b2835ca8] synchronize_sched at ffffffff81096e88
 #5 [ffff8807b2835cf8] mlx4_cq_free at ffffffffa00bf188 [mlx4_core]
 #6 [ffff8807b2835d68] mlx4_ib_destroy_cq at ffffffffa04725f5 [mlx4_ib]
 #7 [ffff8807b2835d88] ib_destroy_cq at ffffffffa043de99 [ib_core]
 #8 [ffff8807b2835d98] kiblnd_destroy_conn at ffffffffa0acbafc [ko2iblnd]
 #9 [ffff8807b2835dd8] kiblnd_connd at ffffffffa0ad5fe1 [ko2iblnd]
#10 [ffff8807b2835ee8] kthread at ffffffff8109ac66
#11 [ffff8807b2835f48] kernel_thread at ffffffff8100c20a

so thread blocked with something while destroy an ib connection.
inspecting a task

crash> p ((struct task_struct *)0xffff8807d64ca040)->se.cfs_rq->rq->clock
$25 = 230339336880160
 crash> p ((struct task_struct *)0xffff8807d64ca040)->se.block_start
$26 = 230337329685261
 >>> (230339336880160-230337329685261)/10**9
2

but more interested in an o2ib lnd statistic i found

crash> kib_net 0xffff8808325e9dc0
struct kib_net {
  ibn_list = {
    next = 0xffff8807b40a2f40, 
    prev = 0xffff8807b40a2f40
  }, 
  ibn_incarnation = 1423478059211439, 
  ibn_init = 2, 
  ibn_shutdown = 0, 
  ibn_npeers = {
    counter = 31042
  }, 
  ibn_nconns = {
    counter = 31041
  },

so 31k peers - but tests are run on cluster with 14 real clients and 5 server nodes, so isn't more 20 connections exist.
but where it placed?

crash> p &kiblnd_data.kib_connd_zombies
$7 = (struct list_head *) 0xffffffffa0ae7e70 <kiblnd_data+112>
crash> list -H 0xffffffffa0ae7e70 -o kib_conn.ibc_list | wc -l
31030

so all memory consumed with zombi which need more than 2s to destroy.



 Comments   
Comment by Sergey Cheremencev [ 16/Feb/15 ]

mlnx version: mlnx-ofa_kernel-2.3

Comment by Bruno Travouillon (Inactive) [ 21/Aug/15 ]

Alexey, Sergey,

Have you been able to troubleshoot this issue? We are hitting a similar issue with 2.5.3.90 and OFED 3.12.

Comment by Alexey Lyashkov [ 24/Aug/15 ]

Bruno,

no yet. we have it issue once. Liang was created patch http://review.whamcloud.com/#/c/14600/ which may help for it, but it's isn't help with original bug create for. So if you have repeatable it issue - try it patch and say how it help for you.

Comment by Alexey Lyashkov [ 25/Aug/15 ]

I have discussion with IB guys today, he say LNet had a bug in IB connect error handing as lack a something like a
http://sourceforge.net/p/scst/svn/HEAD/tree/trunk/iscsi-scst/kernel/isert-scst/iser_rdma.c
line 782

it may put connection in wrong state if some packets was lost during connect handshake.

Comment by Bruno Travouillon (Inactive) [ 27/Aug/15 ]

Thanks Alexey. We still need to reproduce the issue on a test cluster, I should test with the patch afterwards.

However, it will only avoid the memory pressure and won't solve the underlying issue with the zombie connection.

Comment by Alexey Lyashkov [ 27/Aug/15 ]

from my point view, zombi is result of handshake packet lost, so if we fix it problem we will fix zombie.

Comment by Wesley Yu (Inactive) [ 11/Sep/15 ]

Alexey, Bruno,
We also encountered this OOM recently since memory was exhausted by mlx4_buf_alloc.
Do you know how the issue is triggered and can we do something helpful for it?

Comment by Bruno Travouillon (Inactive) [ 30/Sep/15 ]

Wesley,

We are still investigating. I should try http://review.whamcloud.com/#/c/14600/ soon.

Comment by Doug Oucharek (Inactive) [ 26/Nov/15 ]

Hi Alexey, Has there been any progress on this issue? We are seeing more cases of this OOM with mlx5. I'm starting to suspect that mlx5 is even more aggressive in memory usage making this problem even worse with newer Mellanox cards. Patch http://review.whamcloud.com/#/c/14600/ does not seem to be enough of a break to allow memory to be released.

Comment by Sergey Cheremencev [ 14/Dec/15 ]

The issue is fixed in seagate after updating mlnx-ofa_kernel from 2.3 to 3.1-1.0.3.
In several words the problem is caused by internal mlnx4 ib driver error.
This error causes creation of huge number of zombie connections(about 300 000).
This connections consumes all the memory on server.
Below is example of such error symptoms:

[root@windu-head ~]# pdsh -S -w windu-client[02-11] perfquery | grep PortXmitWait
windu-client02: PortXmitWait:....................0
windu-client07: PortXmitWait:....................0
windu-client03: PortXmitWait:....................0
windu-client10: PortXmitWait:....................0
windu-client06: PortXmitWait:....................0
windu-client11: PortXmitWait:....................298906
windu-client08: PortXmitWait:....................0
windu-client04: PortXmitWait:....................0
windu-client09: PortXmitWait:....................0
windu-client05: PortXmitWait:....................0
# ibqueryerrors
Errors for "winduoem mlx4_0"
   GUID 0x2c903004cb42d port 1: [LinkDownedCounter == 16] [PortXmitDiscards == 32]
Errors for "windu02 mlx4_0"
   GUID 0x1e67030066eb15 port 1: [PortXmitWait == 1]
Errors for "windu07 mlx4_0"
   GUID 0x1e67030066ed9d port 1: [PortXmitWait == 1]
Errors for "MT25408 ConnectX Mellanox Technologies"
   GUID 0x2590ffffdfac0d port 1: [PortXmitWait == 298906]
Errors for 0xf452140300838740 "SwitchX -  Mellanox Technologies"
   GUID 0xf452140300838740 port ALL: [LinkDownedCounter == 255] [PortRcvSwitchRelayErrors == 4951] [PortXmitWait == 4294967295]
   GUID 0xf452140300838740 port 1: [LinkDownedCounter == 1] [PortRcvSwitchRelayErrors == 1]
   GUID 0xf452140300838740 port 4: [LinkDownedCounter == 38] [PortRcvSwitchRelayErrors == 485] [PortXmitWait == 25194846]
   GUID 0xf452140300838740 port 6: [LinkDownedCounter == 75] [PortXmitWait == 2742]
   GUID 0xf452140300838740 port 11: [LinkDownedCounter == 98] [PortRcvSwitchRelayErrors == 352] [PortXmitWait == 4484903]
   GUID 0xf452140300838740 port 12: [LinkDownedCounter == 96] [PortRcvSwitchRelayErrors == 352] [PortXmitWait == 95352145]
   GUID 0xf452140300838740 port 18: [PortXmitWait == 3689472946]
   GUID 0xf452140300838740 port 20: [LinkDownedCounter == 27]
   GUID 0xf452140300838740 port 22: [LinkDownedCounter == 204] [PortXmitWait == 206951]
   GUID 0xf452140300838740 port 26: [LinkDownedCounter == 42] [PortRcvSwitchRelayErrors == 1009] [PortXmitWait == 7327592]
   GUID 0xf452140300838740 port 28: [LinkDownedCounter == 42] [PortRcvSwitchRelayErrors == 1006] [PortXmitWait == 7555602]
   GUID 0xf452140300838740 port 30: [LinkDownedCounter == 41] [PortRcvSwitchRelayErrors == 1000] [PortXmitWait == 6546015]
   GUID 0xf452140300838740 port 32: [LinkDownedCounter == 194]
   GUID 0xf452140300838740 port 34: [LinkDownedCounter == 36] [PortRcvSwitchRelayErrors == 746] [PortXmitWait == 6957849]
   GUID 0xf452140300838740 port 36: [LinkDownedCounter == 184] [PortXmitWait == 4294967295]
Errors for "windu08 mlx4_0"
   GUID 0x1e670300670add port 1: [PortXmitWait == 1]
Errors for 0xf4521403008386c0 "SwitchX -  Mellanox Technologies"
   GUID 0xf4521403008386c0 port ALL: [LinkDownedCounter == 255] [PortRcvSwitchRelayErrors == 5742] [PortXmitWait == 4294967295]
   GUID 0xf4521403008386c0 port 6: [LinkDownedCounter == 26] [PortXmitWait == 2687]
   GUID 0xf4521403008386c0 port 8: [LinkDownedCounter == 176] [PortXmitWait == 1849083011]
   GUID 0xf4521403008386c0 port 10: [LinkDownedCounter == 71] [PortXmitWait == 3901]
   GUID 0xf4521403008386c0 port 11: [LinkDownedCounter == 94] [PortRcvSwitchRelayErrors == 352] [PortXmitWait == 14950798]
   GUID 0xf4521403008386c0 port 12: [LinkDownedCounter == 98] [PortRcvSwitchRelayErrors == 352] [PortXmitWait == 5555905]
   GUID 0xf4521403008386c0 port 13: [LinkDownedCounter == 38] [PortRcvSwitchRelayErrors == 1007] [PortXmitWait == 7829732]
   GUID 0xf4521403008386c0 port 14: [LinkDownedCounter == 38] [PortRcvSwitchRelayErrors == 1014] [PortXmitWait == 6837648]
   GUID 0xf4521403008386c0 port 15: [LinkDownedCounter == 39] [PortRcvSwitchRelayErrors == 1006] [PortXmitWait == 7558563]
   GUID 0xf4521403008386c0 port 16: [LinkDownedCounter == 42] [PortRcvSwitchRelayErrors == 1004] [PortXmitWait == 6831755]
   GUID 0xf4521403008386c0 port 17: [LinkDownedCounter == 39] [PortRcvSwitchRelayErrors == 1006] [PortXmitWait == 7336329]
   GUID 0xf4521403008386c0 port 20: [PortXmitWait == 3261894682]
   GUID 0xf4521403008386c0 port 30: [LinkDownedCounter == 10] [PortRcvSwitchRelayErrors == 1]
## Summary: 26 nodes checked, 7 bad nodes found
##          96 ports checked, 31 ports have errors beyond threshold
## Thresholds: 
## Suppressed:

On the other hand each connection should be destroyed and freed much faster.
According to my investigation kiblnd_connd spend about 2 seconds to destroy each connection !(see description in LU-6251).
Possible reason is RCU mechanism used in mlx4_cq_free.
RCU locking in mlx4_cq_free is replaced by spin locks in newer mlnx drivers.
I found discussion of similar problem at http://permalink.gmane.org/gmane.linux.drivers.rdma/22243.
There is a patch to solve the issue. But it can't be applied for 2.3 or for 3.1.
Anyway changing RCU locking to spin locking that is done in 3.1 is enough here.

We could reproduce the problem on last master with kernel 2.6.32-431.17.1 and default mlnx-ofa-kernel-2.3.
After updating mlnx-ofa_kernel from 2.3 to 3.1 the problem is not seen anymore.

Also want to point that mlnx-ofa_kernel 3.1-1.0.3 for unknown reasons has needed changes from RCU to spin locking only for mlnx4.
mlnx5 is still not fixed in 3.1-1.0.3 !

Comment by Doug Oucharek (Inactive) [ 14/Dec/15 ]

I've been investigating a similar issue. Here is what I think I am seeing:

1- Two nodes have a race condition as they try to connect to each other (a reconnect actually).
2- Liang's patch http://review.whamcloud.com/#/c/14600/ has the lower NID side back off so the higher NID can reconnect successfully.
3- The connection cleanup is delayed on the passive side due to 14600.
4- Leaving the connection around in a closing state means the RDMA queue and CQ are both still in play.
5- All the Rx buffers begin to fail due to IB_WC_WR_FLUSH_ERROR (Alexey refers to this in his link above).
6- At the same time as the Rx buffers are failing on the CQ, we are seeing stale connection failures on the RDMA queue (they are matching up with the Rx buffer failures).
7- Because of the logic in the code, we are doing a reconnect due to the RDMA queue stale connection failures. Since there are many Rx buffers, we end up with many reconnects occurring at the same time. And these are themselves failing (not sure why) which triggers a new batch of escalating reconnects.

Over a short time, the number of reconnects to one node is generating a huge number of zombies which are occupying all the memory.

I'm hypothesizing a two part fix for now given mlx5 is not fixed as you mention above:

1- When we fail a connection (i.e. due to race), immediately close the RDMA queue (cmid) so it cannot trigger a bunch of reconnects.
2- Check the connecting counter for the peer and only allow one reconnect to be in flight at any given moment.

In theory, 1 should prevent 2, but I think doing 2 is good programming to prevent any unexpected failures of this sort.

Comments?

Comment by Andreas Dilger [ 13/Oct/21 ]

MOFED 2.x is no longer of interest.

Generated at Sat Feb 10 01:58:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.