[LU-7054] ib_cm scalling issue when lustre clients connect to OSS Created: 28/Aug/15  Updated: 12/May/16  Resolved: 27/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

OFED3.5, MOFED241, and MOFED3.5


Attachments: PDF File load.pdf     Text File lustre-log.1445147654.68807.gz     Text File lustre-log.1445147717.68744.gz     Text File lustre-log.1445147754.68673.gz     File nbp8-os11.var.log.messages.oct.17.gz     PDF File opensfs-HLDForSMPnodeaffinity-060415-1623-4.pdf     PDF File read.pdf     Text File service104.+net+malloc.gz     File service115.+net.gz     File trace.ib_cm_1rack.out.gz     PDF File write.pdf    
Issue Links:
Related
is related to LU-7290 lock callback not getting to client Resolved
is related to LU-7676 OSS Servers stuck in connecting/disco... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When a large number of lustre clients (>3000) try connecting to a OSS/MDS at the same, ib_cm threads on the OSS/MDS are unable to services the incoming connection in time. Using ibdump we have seen server replies taking 30sec, by that time the clients have timed out the request and are retrying which results in even more work for ib_cm.

Ib_cm is never able to catchup and usually requires a reboot of the server. Sometime we have been able to recover by ifdowning the ib interface, to give ib_cm time to 'catchup' and then ifuping the interface.

Most of the threads will be in 'D' state here is a example stack trace:

0xffff88062f3c0aa0     1655        2  0    1   D  0xffff88062f3c1140  ib_cm/1^M
sp                ip                Function (args)^M
0xffff880627237a90 0xffffffff81559b50 thread_return^M
0xffff880627237b58 0xffffffff8155b30e __mutex_lock_slowpath+0x13e (0xffff88062f76d260)^M
0xffff880627237bc8 0xffffffff8155b1ab mutex_lock+0x2b (0xffff88062f76d260)^M
0xffff880627237be8 0xffffffffa043f23e [rdma_cm]cma_disable_callback+0x2e (0xffff88062f76d000, unknown)^M
0xffff880627237c18 0xffffffffa044440f [rdma_cm]cma_req_handler+0x8f (0xffff880365eec200, 0xffff880494844698)^M
0xffff880627237d28 0xffffffffa0393e37 [ib_cm]cm_process_work+0x27 (0xffff880365eec200, 0xffff880494844600)^M
0xffff880627237d78 0xffffffffa0394aaa [ib_cm]cm_req_handler+0x6ba (0xffff880494844600)^M
0xffff880627237de8 0xffffffffa0395735 [ib_cm]cm_work_handler+0x145 (0xffff880494844600)^M
0xffff880627237e38 0xffffffff81093f30 worker_thread+0x170 (0xffffe8ffffc431c0)^M
0xffff880627237ee8 0xffffffff8109a106 kthread+0x96 (0xffff880627ae5da8)^M
0xffff880627237f48 0xffffffff8100c20a child_rip+0xa (unknown, unknown)^M

Using systemtap I was able to get a trace of ib_cm it shows a great deal of time is spent in spin_lock_irq. see attached file



 Comments   
Comment by James A Simmons [ 28/Aug/15 ]

Can you try this patch http://review.whamcloud.com/#/c/14600 to see if it helps.

Comment by Joseph Gmitter (Inactive) [ 28/Aug/15 ]

Amir,
Can you investigate this issue to see if it is related to the patch for LU-5718?
Thanks.
Joe

Comment by Amir Shehata (Inactive) [ 03/Sep/15 ]

This does look similar to LU-5718. Have you tried the patch indicated above?
Is there a way to install multiple IB cards on the MDS/OSS, set them up on different LNet networks, and try to spread the traffic between the different IB cards? That'll spread out the load.

Comment by Mahmoud Hanafi [ 12/Sep/15 ]

We have tried the patch from LU-5718 not much help. We also upgraded to MOFED3.0 on both client and server.

Still getting the high load on the oss when clients connect. We have been using a trick of ifdow-ing ib1 waiting for load to drop and then ifup-ing the interface. This work sometimes. But we have started to hit this LBUG.

LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) ASSERTION( peer->ibp_connecting == 0 ) failed: 
LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) LBUG
Pid: 6521, comm: kiblnd_connd

Call Trace:

Entering kdb (current=0xffff880b1f6faaa0, pid 6521) on processor 3 Oops: (null)
due to oops @ 0x0
kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
[3]kdb> [-- dtalcott@localhost attached -- Fri Sep 11 22:18:02 2015]
set LINES 50000
[3]kdb> btc
btc: cpu status: Currently on cpu 3
Available cpus: 0-2(I), 3, 4-6(I), 7
Stack traceback for pid 0
0xffffffff81a2d020        0        0  1    0   I  0xffffffff81a2d6c0  swapper
sp                ip                Function (args)
0xffffffff81a01e28 0xffffffff8100c453 kdb_interrupt+0x13 (0xffffffff81d3f228, 0xffffffff81a01fd8, 0x0, 0x0, 0x0, 0x7fffffffffffffff)
0xffffffff81a01e88 0xffffffff81016757 mwait_idle+0x77
0xffffffff81a01ed0 0xffffffff81009fd6 cpu_idle+0xb6
Stack traceback for pid 0
0xffff880c2da5a040        0        0  1    1   I  0xffff880c2da5a6e0  swapper
sp                ip                Function (args)
0xffff880c2da87e58 0xffffffff8100c453 kdb_interrupt+0x13 (0xffffffff81d3f228, 0xffff880c2da87fd8, 0x0, 0x0, 0x0, 0xffff880028251608)
0xffff880c2da87eb8 0xffffffff81016757 mwait_idle+0x77
0xffff880c2da87f00 0xffffffff81009fd6 cpu_idle+0xb6
Stack traceback for pid 0
0xffff880c2dab7500        0        0  1    2   I  0xffff880c2dab7ba0  swapper
sp                ip                Function (args)
0xffff880c2dab9e58 0xffffffff8100c453 kdb_interrupt+0x13 (0xffffffff81d3f228, 0xffff880c2dab9fd8, 0x0, 0x0, 0x0, 0x7fffffffffffffff)
0xffff880c2dab9eb8 0xffffffff81016757 mwait_idle+0x77
0xffff880c2dab9f00 0xffffffff81009fd6 cpu_idle+0xb6
Stack traceback for pid 6521
0xffff880b1f6faaa0     6521        2  1    3   R  0xffff880b1f6fb140 *kiblnd_connd
kdba_bt_stack: null regs - should never happen
Process did not save state, cannot backtrace
0xffff880b1f6faaa0     6521        2  1    3   R  0xffff880b1f6fb140 *kiblnd_connd
Stack traceback for pid 0
0xffff880c2db27500        0        0  1    4   I  0xffff880c2db27ba0  swapper
sp                ip                Function (args)
0xffff880c2db29e58 0xffffffff8100c453 kdb_interrupt+0x13 (0xffffffff81d3f228, 0xffff880c2db29fd8, 0x0, 0x0, 0x0, 0x7fffffffffffffff)
0xffff880c2db29eb8 0xffffffff81016757 mwait_idle+0x77
0xffff880c2db29f00 0xffffffff81009fd6 cpu_idle+0xb6
Stack traceback for pid 0
0xffff880c2db5d500        0        0  1    5   I  0xffff880c2db5dba0  swapper
sp                ip                Function (args)
0xffff880c2db5fe58 0xffffffff8100c453 kdb_interrupt+0x13 (0xffffffff81d3f228, 0xffff880c2db5ffd8, 0x0, 0x0, 0x0, 0xffff880028351608)
Stack traceback for pid 6521
0xffff880b1f6faaa0     6521        2  1    3   R  0xffff880b1f6fb140 *kiblnd_connd
kdba_bt_stack: null regs - should never happen
Process did not save state, cannot backtrace
0xffff880b1f6faaa0     6521        2  1    3   R  0xffff880b1f6fb140 *kiblnd_connd
Stack traceback for pid 0
0xffff880c2db27500        0        0  1    4   I  0xffff880c2db27ba0  swapper
sp                ip                Function (args)
0xffff880c2db29e58 0xffffffff8100c453 kdb_interrupt+0x13 (0xffffffff81d3f228, 0xffff880c2db29fd8, 0x0, 0x0, 0x0, 0x7fffffffffffffff)
0xffff880c2db29eb8 0xffffffff81016757 mwait_idle+0x77
0xffff880c2db29f00 0xffffffff81009fd6 cpu_idle+0xb6
Stack traceback for pid 0
0xffff880c2db5d500        0        0  1    5   I  0xffff880c2db5dba0  swapper
sp                ip                Function (args)
0xffff880c2db5fe58 0xffffffff8100c453 kdb_interrupt+0x13 (0xffffffff81d3f228, 0xffff880c2db5ffd8, 0x0, 0x0, 0x0, 0xffff880028351608)
0xffff880c2db5feb8 0xffffffff81016757 mwait_idle+0x77
0xffff880c2db5ff00 0xffffffff81009fd6 cpu_idle+0xb6
Stack traceback for pid 0
0xffff880c2db97500        0        0  1    6   I  0xffff880c2db97ba0  swapper
sp                ip                Function (args)
0xffff880c2db99e58 0xffffffff8100c453 kdb_interrupt+0x13 (0xffffffff81d3f228, 0xffff880c2db99fd8, 0x0, 0x0, 0x0, 0x0)
0xffff880c2db99eb8 0xffffffff81016757 mwait_idle+0x77
0xffff880c2db99f00 0xffffffff81009fd6 cpu_idle+0xb6
Stack traceback for pid 2638
0xffff880c1b91f540     2638     2637  1    7   R  0xffff880c1b91fbe0  syslog-ng
sp                ip                Function (args)
0xffff880b65e35c18 0xffffffff8100c453 kdb_interrupt+0x13 (0x246, 0x2c9ec, 0xffffffff, 0x0, 0xffffffff81ba95c0, 0x0)
0xffff880b65e35c78 0xffffffff81071643 release_console_sem+0x53
0xffff880b65e35ce0 0xffffffff8136d003 do_con_write+0x8a3 (0xffff880c1bb29800, 0xffff880b6fbd3800, unknown)
0xffff880b65e35dd0 0xffffffff8136e83e con_write+0x1e (0xffff880c1bb29800)
0xffff880b65e35df0 0xffffffff8135ac70 n_tty_write+0x1c0 (0xffff880c1bb29800, 0xffff880c2585c980, 0xffff880b6fbd3800, unknown)
0xffff880b65e35e80 0xffffffff81357b61 tty_write+0x1b1 (0xffff880c2585c980, unknown, unknown)
0xffff880b65e35ef0 0xffffffff81185aa8 vfs_write+0xb8 (0xffff880c2585c980, 0x86c340, unknown, 0xffff880b65e35f48)
0xffff880b65e35f30 0xffffffff81186471 sys_write+0x51 (unknown, 0x86c340, 0x40)
bb_sanity_check: Expected rsp, got osp-0x50
Comment by Mahmoud Hanafi [ 12/Sep/15 ]

The above crash also had 100s of ll_ost* threads spining at

4> [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
<4> [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
<4> [<ffffffffa0a3d5f4>] kiblnd_pool_alloc_node+0x204/0x290 [ko2iblnd]
<4> [<ffffffffa0a48189>] kiblnd_get_idle_tx+0x29/0x2c0 [ko2iblnd]
<4> [<ffffffffa0a4b7e5>] kiblnd_check_sends+0x425/0x610 [ko2iblnd]
<4> [<ffffffffa0a4ddfe>] kiblnd_post_rx+0x15e/0x3b0 [ko2iblnd]
<4> [<ffffffffa0a4e166>] kiblnd_recv+0x116/0x560 [ko2iblnd]
<4> [<ffffffffa064cd8f>] ? lnet_try_match_md+0x22f/0x310 [lnet]
<4> [<ffffffffa064eeeb>] lnet_ni_recv+0xbb/0x320 [lnet]
<4> [<ffffffffa064f1d3>] lnet_recv_put+0x83/0xb0 [lnet]
<4> [<ffffffffa064f32a>] lnet_recv_delayed_msg_list+0x12a/0x210 [lnet]
<4> [<ffffffffa064b1e7>] LNetMDAttach+0x427/0x5a0 [lnet]
<4> [<ffffffffa087a8bc>] ptlrpc_register_rqbd+0x10c/0x390 [ptlrpc]
<4> [<ffffffffa0888845>] ptlrpc_server_post_idle_rqbds+0x75/0xe0 [ptlrpc]
<4> [<ffffffffa08918c1>] ptlrpc_main+0xb21/0x1780 [ptlrpc]
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffffa0890da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

i uploaded lustre debug file to ftp:/uploads/LU7054/lustre-log.1442024446.8745.txt.gz

is this releated to this? http://review.whamcloud.com/#/c/12852/

Comment by Amir Shehata (Inactive) [ 15/Sep/15 ]

The crash might be due to LU-5718. Is it possible to back it up. There is probably a change of behavior there that's causing the assert. I'll investigate further.

In the meantime, the original issue is that when connecting greater than 3000 clients to a lustre OSS/MDS the load is increased beyond what it could handle.

I would like to clarify as well are all 3000 clients connected to the same OSS/MDS? Is it possible to commission multiple nodes and spread traffic?

Comment by Bob Ciotti [ 15/Sep/15 ]

The OSS's seem to get stuck here:

00000800:00000200:0.0:1442094541.632085:0:15400:0:(o2iblnd.c:1890:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting for her to complete
00000800:00000200:7.0:1442094541.632086:0:15290:0:(o2iblnd.c:1890:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting for her to complete
00000800:00000200:4.0:1442094541.632087:0:15280:0:(o2iblnd.c:1890:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting for her to complete
00000800:00000200:3.0:1442094541.632087:0:15410:0:(o2iblnd.c:1890:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting for her to complete
00000800:00000200:6.0:1442094541.632088:0:15346:0:(o2iblnd.c:1890:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting for her to complete

(10's of millions of these messages)

from o2iblnd.c : threads are spinning and waiting for ps->ps_increasing to be 0.
--------------------------------------------------------------------
if (ps->ps_increasing)

{ /* another thread is allocating a new pool */ spin_unlock(&ps->ps_lock); CDEBUG(D_NET, "Another thread is allocating new " "%s pool, waiting for her to complete\n", ps->ps_name); schedule(); goto again; }

if (cfs_time_before(cfs_time_current(), ps->ps_next_retry))

{ /* someone failed recently */ spin_unlock(&ps->ps_lock); return NULL; }

ps->ps_increasing = 1;
spin_unlock(&ps->ps_lock);

CDEBUG(D_NET, "%s pool exhausted, allocate new pool\n", ps->ps_name);

rc = ps->ps_pool_create(ps, ps->ps_pool_size, &pool);
--------------------------------------------------------------------

and so either ps_pool_create doesn't return quickly enough or:

--------------------------------------------------------------------
if (cfs_time_before(cfs_time_current(), ps->ps_next_retry)) {
/* someone failed recently */
--------------------------------------------------------------------

is always true.

Note that we typically do not see the message:

CDEBUG(D_NET, "%s pool exhausted, allocate new pool\n", ps->ps_name)

on the servers that are hanging, so this may be a limitation of the kernel debug buffer or that the thread dispatched to allocate the pool is stuck.

I have lctl set_param debug=+net, lctl set_param debug=+malloc output from all servers, but cannot attach to the case because they are >150MB compressed.

One thing to note is that this seems to happen more frequently on our older 'harpertown' systems. They have older connectX-2 MT26428 HCAs. The harpertown are the last FSB intel chips with memory controllers in the 'Seaberg' MCH northbridge. As such, certain types of locking operations may be
very slow/sluggish, causing this issue to happen more frequently on these systems. I suppose its possible that forward progress is happening, but so slowly at this point that this will never complete. This is an 8 processor system and there are ~300 threads in 'R' state. All these threads are hammering spin locks into this routine to interrogate the state of ps->ps_increasing.

At this point we ifconfig down ib1 interface and this will eventually bring the system back to idle after several minutes. Our trick is simply to wait until the load average is < 1. At thing point, we ifconfig up ib1 interface and its most often the case that the system will now recover. If the system does recover, there will not be an excessive number of these CDEBUG(D_NET, "Another thread is allocating new TX pool, waiting for her to complete\n" messages, but there may be few them and a few of the CDEBUG(D_NET, "%s pool exhausted, allocate new pool\n", ps->ps_name)

We have also recently modified the rdma_cm so that the timeouts within ib_cm are now much longer. I believe that this has helped from when mahmoud initially reported this problem by reducing the number times that the rdma_cm hits its internal timeout and returns a timeout error. An rdma_cm timeout then must bubbles up into lustre and maybe into the pool allocation mechanism causing some sort of thrash in that code.

Comment by Bob Ciotti [ 15/Sep/15 ]

To answer earlier question from Amir: we have ~12,000 clients all connecting directly to each oss/mds in the system. There are another 1000 clients connecting via routers. Want it to work that way. We could consider alternatives, but that would require some changes/development and not the preferred alternative.

Also, I wrote a lossy compressor for the kernel logs and was able to get between 40:1 and 200:1 compression and can now upload some of the kernel debug files. It at least gives you an idea of whats happening.

The uploaded files are from two different times. service104.+net+malloc was earlier. We could have waited longer for things to degrade on
this oss - its was about +200 load average, but when it gets to this point there typically is no recovering. The service115.+net file only had +net kernel debug for that oss , but was a little further along, it had about 300 ll_ost threads in R.

You can search through the files for Error. error. failed, etc. There are often large numbers of kiblnd_cm_callback()) 10.151.53.146@o2ib: UNREACHABLE -110, just not in these traces, so unclear of this cm callback involvement.

BTW,
The compressor simply tells you how many lines are repeats of the previous line when you remove any numbers [0-9] from the lines,
printing only the first line in the series...

gawk '
{
if (not_first_line) {

CLINE=$0
gsub(/[0-9]/, "", $0)
CLINE_DNUM=$0

if (CLINE_DNUM==LLINE_DNUM)

{ repeated++ if (repeated==1) printf ("%s\n", LLINE); }

else {
if (repeated>1)

{ TLINE_DNUM=LLINE gsub(/[0-9]/, " ", TLINE_DNUM) printf ("%s\n%d similar lines repeated (de-num)\n", TLINE_DNUM, repeated) } else { if (repeated<=1) printf ("%s\n", LLINE); }
repeated=0
}
}
not_first_line=1
LLINE=CLINE
LLINE_DNUM=CLINE_DNUM
}
END {
if (repeated>1) { TLINE_DNUM=LLINE gsub(/[0-9]/, " ", TLINE_DNUM) printf ("%sn%d similar lines repeated (de-num)n", TLINE_DNUM, repeated) }

else

{ if (repeated<=1) printf ("%s\n", LLINE); }

} '

Comment by Mahmoud Hanafi [ 16/Sep/15 ]

We had a node LBUG again under. Please Identify root case of this LBUG

LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) ASSERTION( peer->ibp_connecting == 0 ) failed: 
LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) LBUG
Pid: 6521, comm: kiblnd_connd
Comment by Bob Ciotti [ 16/Sep/15 ]

FWIW, I think that this LBUG may be separate from the scaling issue. Possibly that the scaling performance issue degrades into eventually generating this LBUG?

Comment by Gerrit Updater [ 16/Sep/15 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/16454
Subject: LU-7054 o2iblnd: less intense allocating retry
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 3aa4bc3cf0e15ae8e4775e30a426d582b591e8e5

Comment by Bob Ciotti [ 16/Sep/15 ]

nice!. Thank You.
I might cap the retry at some large number - like 30 seconds, or 60 seconds... thoughts?

if (interval < cfs_time_seconds(1) && interval < 32)
interval *= 2;

it would also be nice to get a count of the number of times the thread passed through the goto section looking for ps->ps_increasing to be cleared.

Comment by Amir Shehata (Inactive) [ 16/Sep/15 ]

From the latest logs attached I see:

(o2iblnd_cb.c:1011:kiblnd_tx_complete()) Tx -> 10.151.37.216@o2ib cookie 0x277097 sending 1 waiting 0: failed 12
(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.151.37.216@o2ib: error -5(waiting)
(o2iblnd_cb.c:912:kiblnd_post_tx_locked()) Error -103 posting transmit to 10.151.37.216@o2ib

Connection has already been torn down, because of failure to transmit:
Reason for failure: IB_WC_RETRY_EXC_ERR
Causing the 103 error when another transmit is attempted on a connection that's been closed, or closing.

I'm investigating whether this could lead to retries, which causes the issue to escalate to the point we see where the pools are exhausted and all threads are stuck in spinlock.

Comment by Bob Ciotti [ 17/Sep/15 ]

The other thing to consider here is profiling OSS to see where all the time is being spent. We believe that a thread get stuck down in rc = ps->ps_pool_create(ps, ps->ps_pool_size, &pool);, but where? That might be useful to do in the testing to be done that sets NTX to some small value.

Also, the suspicion is that memory registration/qp deallocation may be contending over something.

Comment by Amir Shehata (Inactive) [ 17/Sep/15 ]

to summarize the conclusion from the discussion today:

1. NASA to try and bump the ntx tunable to 2048 to allocate more tx pools
2. Intel to update the patch to record how many times we wait for longer than 30 seconds.
3. NASA to determine the impact of ntx tunable separately from the patch for this LU.
4. Intel still needs to investigate reason for LBUG.
5. Intel to try internal testing with the ntx pushed to 1 to see if we can recreate some of the issues NASA is seeing.

For more details on the memory impact of increasing ntx, please take a look at kiblnd_create_tx_pool(). The size == ntx/ncpts.

I would also suggest to look at the reason we're seeing: IB_WC_RETRY_EXC_ERR, on transmitting, it could point us to another aspect of the problem.

I looked at the code to find out how the cpts are assigned. Here is my preliminary understanding.

When LNet comes up it figures the number of CPTs through kernel calls. then when the pools are created we iterate through that number of CPTs and allocate pools per CPT. On outgoing messages the CPT is determined by the NID using lnet_cpt_of_nid(). Which I think is what's of interest for us here, since we determine the pool set to look through by matching its CPT with the CPT of the NID. So if we ever transmit to the same NID again, we keep depleting the same pool.

From what we discussed we reached the conclusion that the high load depletes the pools, which will cause us to allocate more and do RDMA mapping for these newly allocated pools which are expensive operations and could take a while. We're thinking of mitigating this by increasing the number of tx in the pool to avoid having to frequently expand them. And when we do expand them we will be easing the number of times to go back and check for a free node to grab.

One thing to note, however, that in the time we wait, other threads could come in and grab the free node, so I wonder if that would cause some threads to starve, or to wait for a long time before getting a node, under heavy load.

If I missed something please let me know.

Comment by Bob Ciotti [ 17/Sep/15 ]

Looks good.
I would add that there are two areas where pool allocations may be under allocation pressure.

1) long round trip times. system operating normally. My understanding at this point is that active transmit pool size correlates to the number of incomplete/active messages. During the time when we bring an interface up, there will be 13,000 clients all wanting to establish RC connection, so naturally, these will queue to responsible thread generating a need TX == 1625/cpt. Recommend we establish relationship between NTX value and client count. Like NTX>= (client count/N), with N small, like 2 or even 1 and document in system setup documentation.

2) rdma_cm connection failures. either timeout or other issue completing successful creation of RC channel. rdma_cm has internal retry that should continue for 5 minutes. So, pool connection is locked down for this entire period. Bottom line is that 2048 is likely not enough to cover this since this this assumes that each pool has has a reuse factor of 7. (NTX=13000/2048). In other words, exhausting 256 (2048/8) entries seems likely and given the small amount of memory assiciated with each pool NTX allocation (64 bytes), There may not be a good reason to be tight with this memory given that 13000x64 bytes == 832k, less than 1 MB. Need verification here. Basically, I never want to be in a slow pool allocation routine.

The explanation of why the ifdown/ifup works is that this stops rdma_cm from establishing new connections, so current in-flight either complete or fail, but when the interface is down and the system idles out, there are several client connections that have been successfully established. Then, bringing up the interface has fewer clients to deal with and less pressure thru rdma_cm connection establishment.

There is another issue of concern. I am unfamiliar with best practices in regard to hyperthreading. The most problematic servers are two 4 core sockets, so 8 cpt. However, we have some 8 core, two socket OSS with hyperthreading enabled, or potentially 32 cpt if OS does not distinguish. This may not be ideal for register state, coherence or locking overheads, depending of assumptions made regrading thread state. Does Intel have recommendation and/or performance measurements regarding use of HT?

Comment by Mahmoud Hanafi [ 17/Sep/15 ]

One thing which remains unexplained.

Why does reboot/crash of a OSS on Filesystem A will cause client timeout and evictions on Filesystem B.

Comment by Bob Ciotti [ 17/Sep/15 ]

Possible to have the same problem in the pool allocation mechanism on the client too? That does make sense. If the same behaviour on the client occurred, that is the pool allocation function gets stuck, then that might effectively mute the client from any TX to any server, assuming a TX pool buffer is required to send a (any?) message. This could cause a different filesystems OSS to timeout the same client. Interestingly enough, because the pool allocation mechanism splits work based on nid to cpt, when any pool runs out of TX, then any TX to any NID that maps to that cpt, then backs up behind the pool allocator. I have an idea about how this might happen but need to look into code further.

In fact, the haswell client nodes have 48 cpu threads. that means that the the pool size/allocation size is just 5 or 6 entries (if I understand correctly). So, its very likely that during instability of one filesystem will lead to calls into the pool allocation routine, which we know to be problematic/slow in some cases. If stuck in pool allocation routine, clients can't talk to anyone. Clients should not need a lot of these, but the nid balancing might not be perfect.

Further, on our big SGI systems that have 1024 cores, this same issue may be even worse. I wonder if this could explain some of the issues we see on the UV systems? If pool allocation slowed down for some reason between OS releases, that would make this particular problem even worse.

There are several tunables here. But may be an issue to have such a small pool when dividing by cpu threads (cpt). Also, need guidance here regarding any tunables,

/* Pools (shared by connections on each CPT) */
/* These pools can grow at runtime, so don't need give a very large value */
#define IBLND_TX_POOL 256
#define IBLND_PMR_POOL 256
#define IBLND_FMR_POOL 256
#define IBLND_FMR_POOL_FLUSH 192

/* TX messages (shared by all connections) */
#define IBLND_TX_MSGS() (*kiblnd_tunables.kib_ntx)

/* RX messages (per connection) */
#define IBLND_RX_MSGS(v) (IBLND_MSG_QUEUE_SIZE(v) * 2 + IBLND_OOB_MSGS(v))
#define IBLND_RX_MSG_BYTES(v) (IBLND_RX_MSGS(v) * IBLND_MSG_SIZE)
#define IBLND_RX_MSG_PAGES(v) ((IBLND_RX_MSG_BYTES(v) + PAGE_SIZE - 1) / PAGE_SIZE)

/* WRs and CQEs (per connection) */
#define IBLND_RECV_WRS(v) IBLND_RX_MSGS(v)
#define IBLND_SEND_WRS(v) ((IBLND_RDMA_FRAGS(v) + 1) * IBLND_CONCURRENT_SENDS(v))
#define IBLND_CQ_ENTRIES(v) (IBLND_RECV_WRS(v) + IBLND_SEND_WRS(v))

Comment by Bob Ciotti [ 17/Sep/15 ]

Regarding the testing Amir planned to do with the small NTX value. You may also want to try setting the client peer timeout value to some small value as well. Reboot a server and see if the pool allocation delays can result in client timeouts where the underling RDMA qp is still connected, but lustre believe client is unreachable. What does server do in this case? try to tear down existing rdma qp?

Comment by Liang Zhen (Inactive) [ 17/Sep/15 ]

Amir, here is a patch that could be helpful: http://review.whamcloud.com/#/c/16454/
I suspect the problem is because when system load is high (and under memory pressure), if ko2iblnd ran out of TXs and one thread (or a few) is allocating pool and waiting, all other threads will spin and occupy all CPUs, system can't release memory (or very slow) so the first thread can't finish the allocation.

This patch will change the behave of o2iblnd, schedulers will wait longer and longer if allocation is already in progress.

Comment by Bob Ciotti [ 17/Sep/15 ]

Liang, We are going to test this. Were waiting for an update to print out trip counts. I can't comment on the patch so putting in here. Also, this need to be applied to the client code as well (I think), since same behaviour is suspected on the client.

#define POOL_BACKOFF_TIMELIMIT 32

kiblnd_pool_alloc_node(kib_poolset_t *ps)
{
cfs_list_t *node;
kib_pool_t *pool;
int rc;
unsigned int interval = 1;
unsigned int trips = 1;

again:
spin_lock(&ps->ps_lock);
cfs_list_for_each_entry(pool, &ps->ps_pool_list, po_list) {
if (cfs_list_empty(&pool->po_free_list))
continue;

pool->po_allocated ++;
pool->po_deadline = cfs_time_shift(IBLND_POOL_DEADLINE);
node = pool->po_free_list.next;
cfs_list_del(node);

if (ps->ps_node_init != NULL)

{ /* still hold the lock */ ps->ps_node_init(pool, node); }

spin_unlock(&ps->ps_lock);
return node;
}

/* no available tx pool and ... */
if (ps->ps_increasing)

{ /* another thread is allocating a new pool */ spin_unlock(&ps->ps_lock); CDEBUG(D_NET, "Another thread is allocating new " "%s pool, waiting %d HZs for her to complete. Tried %d times\n", ps->ps_name, interval, trips); schedule_timeout(interval); if ((interval < cfs_time_seconds(1)) && (interval < POOL_BACKOFF_TIMELIMIT)) interval *= 2; trips++; goto again; }

if (cfs_time_before(cfs_time_current(), ps->ps_next_retry))

{ /* someone failed recently */ spin_unlock(&ps->ps_lock); return NULL; }

ps->ps_increasing = 1;

Comment by Amir Shehata (Inactive) [ 17/Sep/15 ]

As I looked at the code in more details, I'm not sure if we need to define

#define POOL_BACKOFF_TIMELIMIT 32

In

		schedule_timeout(interval);
		if (interval < cfs_time_seconds(1))
			interval *= 2;

Interval is actually in jiffies. So we're pretty much saying back off incrementally until you wait for a maximum of 1 second. And from this point we'd wait for 1 second before going back and checking if the there is a free node to allocate from the pool.

The patch update I'm pushing in momentarily, also adds some time profiling around the ps_pool_create() to see how long it's taking us to allocate new pools and dma map the memory. That would be interesting to see as the system comes under more pressure.

I also added a counter to count the number of trips we make through the scheduler as requested.

Comment by Gerrit Updater [ 17/Sep/15 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/16470
Subject: LU-7054 o2iblnd: less intense allocating retry
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8861aae7ebdcc564fa47cd84ace253e62bafef4e

Comment by Jay Lan (Inactive) [ 17/Sep/15 ]

Amir,

Your patch was generated from master branch and caused conflict in 2.5.3:

<<<<<<< HEAD
cfs_list_t *node;
kib_pool_t *pool;
int rc;
=======
struct list_head *node;
kib_pool_t *pool;
int rc;
unsigned int interval = 1;
cfs_time_t time_before;
unsigned int trips = 0;
>>>>>>> 9be6d5c... LU-7054 o2iblnd: less intense allocating retry

I need to change "struct list_head" to "cfs_list_t".

Also, interval and trips do not need to be 'unsigned int'. Actually they were used as int in CDEBUG.

Comment by Amir Shehata (Inactive) [ 17/Sep/15 ]

Jay,
I updated Liang's original check in on b2_5.

Comment by Amir Shehata (Inactive) [ 18/Sep/15 ]

I updated the b2_5 patch to add a few more debug info:
1. print the number of pools in the pool set when allocating new pools
2. print the size of each pool
3. when you do lctl list_nids the number of cpts is printed at D_NET level.

This should clarify how many times we're allocating pools over time. Will also give us some insight into the cpt pools are getting associated with.

I have also been discussing internally the impact of Hyperthreading on performance. There is a tendency to think that it could negatively impact performance. Would it be possible to turn it off and find out if there is any performance improvement.

However I do print the number of CPTs as indicated above so that should tell us if we're considering HT as logical cores.

I have also attached the HLD for the SMP node affinity for your reference. It describes how the CPT is implemented in the system.

I'll continue looking at how memory is allocated and dma mapped and provide an explanation.

Comment by Amir Shehata (Inactive) [ 21/Sep/15 ]

When a pool is created enough kernel pages are allocated to cover the transmit. The number of pages is determined by the size of the tx message and the size of the page. So for example if the maximum message size is 4K. If the page size is 4K, then we would allocate one page per message. And if we are allocating 256 tx, then we'd allocate 256 pages.

In kiblnd_map_tx_pool() each tx->tx_msg is set up to point to the page allocated for that tx. Then dma_map_sing() is used to map this kernel page to a DMA address.

The tx is then added to the pool tx free list ready for use.

In the discussion previously a question arose whether pinning memory is related to a connection. As described above, it doesn't look like this is the case.

However, allocating pages and dma mapping them look like they can take time to complete as the memory pressure grows on the system, as in the case here.

Comment by Mahmoud Hanafi [ 13/Oct/15 ]

Small bug in patch http://review.whamcloud.com/16470

CDEBUG(D_NET, "ps_pool_create took %lu HZ to complete",
           cfs_time_current() - time_before);

this should end with new line.

Comment by James A Simmons [ 13/Oct/15 ]

Outside of that bug does it resolve your issues?

Comment by Mahmoud Hanafi [ 15/Oct/15 ]

I was able to getting some timing info for kiblnd_create_tx_pool. It can take on average 12-13 sec to complete the call. Most of the time is spent inside the for loop in

         LIBCFS_CPT_ALLOC(tx->tx_wrq, lnet_cpt_table(), ps->ps_cpt,
                 (1 + IBLND_MAX_RDMA_FRAGS) *
                 sizeof(*tx->tx_wrq));
Comment by Mahmoud Hanafi [ 18/Oct/15 ]

Yesterday we had a oss that lost connection to the clustre we couldn't find any ib related issue. In the logs there are a few time where ldlm_cn threads dump call traces. for example

Oct 17 12:38:53 nbp8-oss11 kernel: LustreError: 58897:0:(ldlm_lockd.c:435:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-nbp8-OST00a6_UUID lock: ffff8807b3a6cb40/0xec5cb59ce7507b6a lrc: 2/0,0 mode: --/PW res: [0x1bbc9af:0x0:0x0].0 rrc: 5 type: EXT [311853056->18446744073709551615] (req 311853056->18446744073709551615) flags: 0x74801000000020 nid: 10.151.13.98@o2ib remote: 0x1ee2c1eb7b3a5e70 expref: 5 pid: 17316 timeout: 5055271427 lvb_type: 0
Oct 17 12:38:53 nbp8-oss11 kernel: Pid: 58897, comm: ldlm_cn00_026
Oct 17 12:38:53 nbp8-oss11 kernel:
Oct 17 12:38:53 nbp8-oss11 kernel: Call Trace:
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa054f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080986b>] ldlm_add_waiting_lock+0x1db/0x310 [ptlrpc]
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080b068>] ldlm_server_completion_ast+0x598/0x770 [ptlrpc]
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080aad0>] ? ldlm_server_completion_ast+0x0/0x770 [ptlrpc]
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa07de15c>] ldlm_work_cp_ast_lock+0xcc/0x200 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa081fb1c>] ptlrpc_set_wait+0x6c/0x860 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa081b28a>] ? ptlrpc_prep_set+0xfa/0x2f0 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07de090>] ? ldlm_work_cp_ast_lock+0x0/0x200 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07e10ab>] ldlm_run_ast_work+0x1bb/0x470 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07e1475>] ldlm_reprocess_all+0x115/0x300 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa0802ff7>] ldlm_request_cancel+0x277/0x410 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08032cd>] ldlm_handle_cancel+0x13d/0x240 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08091c9>] ldlm_cancel_handler+0x1e9/0x500 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08390c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa05618d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa0831a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa083b89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa083ada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Oct 17 12:38:54 nbp8-oss11 kernel:
Oct 17 12:57:00 nbp8-oss11 kernel: LNet: 3458:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.151.15.171@o2ib: error -116(waiting)

and again at 22:22:37
and there are call trace dump for ll_ost_io at 22:54:24.

The the major outaged occured at Oct 17 22:52:32

see attched nbp8-os11.var.log.messages.oct.17.gz
And 3 lustre debug dumps
lustre-log.1445147654.68807.gz
lustre-log.1445147717.68744.gz
lustre-log.1445147754.68673.gz

I have debug dumps for some of the clients but we didn't have net debuging enabled on them.

Comment by Mahmoud Hanafi [ 20/Oct/15 ]

Uploaded lustre debug dump to ftpsite:/uploads/LU7054/nbp9-oss16.ldebug.gz

It shows a lock callback timer eviction at

Oct 19 22:20:39 nbp9-oss16 kernel: LustreError: 11651:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.6.134@o2ib  ns: filter-nbp9-OST003f_UUID lock: ffff8800810877c0/0x15ed8b9f78013cb2 lrc: 3/0,0 mode: PW/PW res: [0x85f72a:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000000020 nid: 10.151.6.134@o2ib remote: 0x810eb6843fa5e1d1 expref: 9 pid: 11647 timeout: 5261835188 lvb_type: 0

From the debug dump we seet at 151 sec before (22:18:08) the servers was running in

o2iblnd_cb.c 3306 kiblnd_cq_completion()) conn[ffff880f0c5d7000] (20)++

for a long time.

It is possible that the call back request was never sent?

Comment by Mahmoud Hanafi [ 30/Oct/15 ]

Uploaded 2 new debug logs and /var/log/messages to ftp:/uploads/LU7054/
messages.gz
sec.20151029.21.03.25.gz
sec.20151029.21.05.55.gz

At Oct 29 21:03:25 we got this in /var/log/messages

Oct 29 21:03:25 nbp8-oss14 kernel: [1828390.200665] LustreError: 68632:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s  req@ffff8812e7162c00 x1516263417841768/t0(0) o3->d07e5b2b-77f6-1c68-f1fd-e6e4f8f614d7@10.151.57.146@o2ib:0/0 lens 4568/432 e 1 to 0 dl 1446177826 ref 1 fl Interpret:/0/0 rc 0/0

We dump debug logs to sec.20151029.21.03.25.gz
it shows all clients connecting.

00000800:00000200:0.2:1446177579.541489:0:0:0:(o2iblnd_cb.c:3306:kiblnd_cq_completion()) conn[ffff8804e04ad400] (20)++
....
00000800:00000200:1.0:1446177803.579278:0:11134:0:(o2iblnd_cb.c:993:kiblnd_check_sends()) conn[ffff880da89faa00] (31)--

I don't under why the debug logs don't show any other activite before this other than

00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 123 lcw_cb()) Process entered
00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 126 lcw_cb()) Process leaving

Hopping you guys can make sense of it.

Comment by Doug Oucharek (Inactive) [ 30/Oct/15 ]

Mahmoud: On Oct.14, you indicated the time was spent in a for loop but put the macro LIBCFS_CPT_ALLOC() in the copied code. Was the time spent in this macro call?

If the allocation code is freezing, there could be one of two reasons: memory has been exhausted or memory has become so fragmented we cannot get the number of contiguous pages we are requesting. I have seen our allocation macros freeze the calling kernel thread until such a time as the memory can be allocated. This could be a very long time and absolutely needs to be avoided. This could explain the 12-13 second delays.

So, it is important to figure out whether the freezing is in the allocation call.

Looking at the pool code, I see something that worries me. When a new pool is allocated (not the first one), it is given a deadline of 300 seconds. Once the 300 seconds has expired, whenever buffers are freed to that pool, it is checked if any buffers are still in use. If none are, we free the pool and all associated memory. This potential "yoyo" effect could be causing a problem. It might be interesting to try and set the deadline value to something enormous so we never free up pools allocated and just leave them around to be used.

That being said, if a large enough pool is allocated at initialization time, it never gets freed and would solve this issue. Was the test to increase NTX suggested above ever tried? Do the servers have enough memory for a large pool allocation?

If the system is truly running out of memory due to a large load, the peer_credits of the clients can be lowered from the default of 8 to something like 6 or 4. That will cause the large number of clients you have to back off reducing the load on the servers thereby forgoing the need for such a big pool allocation. Worse case for a server: number of clients * peer_credits = max number of tx buffers that could be used.

Another thing to check is memory statistics once a system gets into trouble. On Linux, if you "echo m > /proc/sysrq-trigger" and look into /var/log/messages, there are some very useful memory statistics there. If you don't get such memory statistics, then SysRq is not enabled. You can do so by "echo 1 > /proc/sys/kernel/sysrq". On newer kernels, you can also "cat /proc/buddyinfo" to get a subset of the same information.

Comment by Mahmoud Hanafi [ 30/Oct/15 ]

I guess there are 2 major issues here.
1. why does server get into the state that requires all clients to reconnect at the same time.
2. The server handling and recovering for the connection storm.

The connection back-off patch and raising the ntx and interface credits has help with item 2. But I think we need more tuning here. enable FMR? higher Peer_credits? higher ntx and interface credits. The pool allocation (LIBCFS_CPT_ALLOC) happens during this part.

But the root cause is item 1. Why do all the clients need to connect at the same time. We have ruled out the IB fabric because it isn't logging any errors during this event.

From the logs I uploaded yesterday (sec.20151029.21.03.25.gz) the connections start at 20:59 (1446177579.541489). Can you see anything in the logs to indicate why?

The load and util on the server was light before the connections started.
I uploaded load.pdf, write.pdf, and read.pdf

Comment by Mahmoud Hanafi [ 31/Oct/15 ]

Just uploaded a new debug dump (ftp:/uploads/LU7054/sec.20151031.11.31.38.gz). OSS experienced the issue with first log at 1446316297.011251 with bulk timeout.

00010000:00020000:11.0:1446316297.011251:0:20421:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s  req@ffff8809c4ae0400 x1516094020471028/t0(0) o3->236aaab5-0544-9964-0bd9-544f78fd896b@10.151.11.64@o2ib:0/0 lens 488/432 e 2 to 0 dl 1446316318 ref 1 fl Interpret:/0/0 rc 0/0
00010000:00020000:6.0:1446316297.011279:0:15047:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s  req@ffff880cf9000c00 x1515275583687516/t0(0) o3->f89cd266-d025-1f42-d13b-d6a51a9b5e01@10.151.11.151@o2ib:0/0 lens 488/432 e 2 to 0 dl 1446316318 ref 1 fl Interpret:/0/0 rc 0/0
Comment by Doug Oucharek (Inactive) [ 03/Nov/15 ]

With regards to the two issues: could not the same cause be behind both? We are speculating that memory pool allocation is the cause of issue 2 (many reconnections). Could not the original problem be triggered by the same thing? With so many clients aggressively sending to the OSS, could it not be freezing on memory allocation for several seconds causing evictions?

This is why I have be focusing so much on addressing the pool allocation problem. If the two problems are truly independent, then I suggest we start a different Jira ticket for problem 1 and use this ticket only for problem 2. That way the evidence behind each problem do not get mingled.

With the changes made (raising NTX, back-off, and interface credits), has problem 2 happened again or just reduced frequency? Can NTX be increased even more if only reduced frequency?

Comment by Mahmoud Hanafi [ 03/Nov/15 ]

I think you are correct the two issue are the same.

The changes we have made (raising NTX, back-off, and interface credits) has allowed oss recover and not get stuck with ib_cm in 'D' state. I think the frequency is less often. We are going to test increase by 2X NTX and interface credits. How much will enabling FMR help us.

Comment by Doug Oucharek (Inactive) [ 03/Nov/15 ]

I updated LU-7224 with some tuning recommendations for this situation.

We were looking at the code to see if FMR saves us on memory allocation. It does not. Might actually use a bit more memory. With Mellanox cards, we have not noticed much improvement in performance except in high latency cases (like WAN conditions). It benefits Truescale IB cards most. Note: Mellanox has discontinued support for FMR in their latest mlx5-based cards.

So, I'm not optimistic that FMR will help in this case. See LU-7224 for a summary of tuning suggestions.

Comment by Gerrit Updater [ 26/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16470/
Subject: LU-7054 o2iblnd: less intense allocating retry
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d96879f34ce229695566a3e5de1f5160f4c9ef02

Comment by Joseph Gmitter (Inactive) [ 27/Jan/16 ]

Patch has landed to master for 2.8.

Generated at Sat Feb 10 02:05:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.