[LU-2432] ptlrpc_alloc_rqbd spinning on vmap_area_lock on MDS Created: 05/Dec/12  Updated: 18/Mar/13  Resolved: 18/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Prakash Surya (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: sequoia

Issue Links:
Related
is related to LU-2708 MDS thrashing in ptlrpc_alloc_rqbd Resolved
is related to LU-2424 add memory limits for ptlrpc service Resolved
Severity: 3
Rank (Obsolete): 5764

 Description   

vmalloc based allocations can potentially take a very long time to complete due to a regression in the kernel. As a result, I've seen our MDS "lock up" for certain periods of time while all of the cores spin on the vmap_area_lock down in ptlrpc_alloc_rqbd.

For example:

    2012-11-01 11:34:28 Pid: 34505, comm: mdt02_051
    2012-11-01 11:34:28 
    2012-11-01 11:34:28 Call Trace:
    2012-11-01 11:34:28  [<ffffffff81273155>] ? rb_insert_color+0x125/0x160
    2012-11-01 11:34:28  [<ffffffff81149f1f>] ? __vmalloc_area_node+0x5f/0x190
    2012-11-01 11:34:28  [<ffffffff810609ea>] __cond_resched+0x2a/0x40
    2012-11-01 11:34:28  [<ffffffff814efa60>] _cond_resched+0x30/0x40
    2012-11-01 11:34:28  [<ffffffff8115fa88>] kmem_cache_alloc_node_notrace+0xa8/0x130
    2012-11-01 11:34:28  [<ffffffff8115fc8b>] __kmalloc_node+0x7b/0x100
    2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
    2012-11-01 11:34:28  [<ffffffff81149f1f>] __vmalloc_area_node+0x5f/0x190
    2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
    2012-11-01 11:34:28  [<ffffffff81149eb2>] __vmalloc_node+0xa2/0xb0
    2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
    2012-11-01 11:34:28  [<ffffffff8114a199>] vmalloc_node+0x29/0x30
    2012-11-01 11:34:28  [<ffffffffa05a2a40>] cfs_cpt_vmalloc+0x20/0x30 [libcfs]
    2012-11-01 11:34:28  [<ffffffffa0922ffe>] ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffffa09235b5>] ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffffa0927fbd>] ptlrpc_main+0xd0d/0x19f0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffff8100c14a>] child_rip+0xa/0x20
    2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffff8100c140>] ? child_rip+0x0/0x20

Here's a couple links regarding the kernel regression:



 Comments   
Comment by Prakash Surya (Inactive) [ 05/Dec/12 ]

See: http://review.whamcloud.com/4439

Comment by Peter Jones [ 06/Dec/12 ]

Thanks Prakash!

Bobijam could you please review this patch?

Comment by Zhenyu Xu [ 06/Dec/12 ]

svc->srv_buf_size can be MDS_BUFSIZE = (362 + LOV_MAX_STRIPE_COUNT * 56 + 1024) ~= 110KB for MDS service, could it be problematic?

Comment by Andreas Dilger [ 11/Dec/12 ]

We discussed at LAD that one problem with the request buffers is that the incoming LNET buffers (sorry, I don't have the correct LNET terms here) are allocated only large enough for the largest single request, though most requests are smaller than this. Unfortunately, as soon as a single RPC is waiting in the incoming buffer, there is no longer enough space in the buffer to receive a maximum-sized incoming request. This means that each buffer is only ever used for a single message, regardless of how many might fit.

A solution that was discussed was to make the request buffer be 2x as large as the maximum request size and/or rounded up to the next power-of-two boundary. That would at least increase the buffer utilization to 50%, and would likely allow tens of requests per LNET buffer.

It may be that the patch for LU-2424 will already address this issue?

Comment by Prakash Surya (Inactive) [ 11/Dec/12 ]

I wasn't at LAD, so I'm unaware of that discussion. But, what trade offs are being made between the number of buffers used and the size of each? i.e why can't we just have one huge buffer, increasing the utilization to (BUFFER_SIZE-REQUEST_SIZE)/BUFFER_SIZE percent (trending towards 100% as BUFFER_SIZE grows large)? Granted I don't understand the LNET code well, so I must be missing something which makes that obviously the wrong thing to do.

Comment by Andreas Dilger [ 11/Dec/12 ]

My (imperfect) understanding is that the receive buffers cannot be re-used until all of the requests therein are processed. That means the buffered are filled from the start, processed, and then returned to the incoming buffer list. If the buffer is too large, then requests sitting in the buffer may wait too long to be processed, or the buffer still will not be fully utilized if there is an upper limit for how long a request will wait.

Comment by Liang Zhen (Inactive) [ 12/Dec/12 ]

I didn't realize that we still don't have the "big request buffer" fix, then this should be the right way to fix this problem and LU-2424.
I would suggest to have 512K or 1M as request buffer size, as Andreas said, a very large request buffer can't be reused if any of those (thousands or more) requests is pending on something, so it might have some other issues.
And I still think it's a nice improvement if we only allow one thread (per CPT) to enter allocating path.

Comment by Liang Zhen (Inactive) [ 01/Jan/13 ]

I posted a patch for this:
http://review.whamcloud.com/#change,4939
and another patch to resolve buffer utilizaiton issue:
http://review.whamcloud.com/#change,4940

Comment by Prakash Surya (Inactive) [ 04/Jan/13 ]

Liang, Why would limiting the vmalloc calls to a single thread fix the issue? That one thread will still be affected by the regression. Will the other threads still be able to service requests despite needing more request buffers? Or will they all have wait for this single thread to finish the allocations?

Comment by Liang Zhen (Inactive) [ 04/Jan/13 ]

I think we might not care one thread (or very few threads) spinning, because each service has tens or even hundreds of threads, and servers normally have many CPU cores, all other threads can serve requests, they will not wait for buffer allocating at all.
The key issue of this ticket is vmalloc can't be parallelized, so it's a waste if all threads/CPUs try to allocate buffers at the same time.

Comment by Prakash Surya (Inactive) [ 07/Jan/13 ]

all other threads can serve requests, they will not wait for buffer allocating at all

Perfect, that's what I wanted to verify with you. Thanks for the clarification!

Comment by Andreas Dilger [ 18/Mar/13 ]

Both http://review.whamcloud.com/4939 and http://review.whamcloud.com/4940 have landed, so I think this bug could be closed. There should only be a single thread calling vmalloc() now.

Generated at Sat Feb 10 01:25:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.