[LU-2432] ptlrpc_alloc_rqbd spinning on vmap_area_lock on MDS - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- sequoia

Severity:
3
Rank (Obsolete):
5764

Description

vmalloc based allocations can potentially take a very long time to complete due to a regression in the kernel. As a result, I've seen our MDS "lock up" for certain periods of time while all of the cores spin on the vmap_area_lock down in ptlrpc_alloc_rqbd.

For example:

    2012-11-01 11:34:28 Pid: 34505, comm: mdt02_051
    2012-11-01 11:34:28 
    2012-11-01 11:34:28 Call Trace:
    2012-11-01 11:34:28  [<ffffffff81273155>] ? rb_insert_color+0x125/0x160
    2012-11-01 11:34:28  [<ffffffff81149f1f>] ? __vmalloc_area_node+0x5f/0x190
    2012-11-01 11:34:28  [<ffffffff810609ea>] __cond_resched+0x2a/0x40
    2012-11-01 11:34:28  [<ffffffff814efa60>] _cond_resched+0x30/0x40
    2012-11-01 11:34:28  [<ffffffff8115fa88>] kmem_cache_alloc_node_notrace+0xa8/0x130
    2012-11-01 11:34:28  [<ffffffff8115fc8b>] __kmalloc_node+0x7b/0x100
    2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
    2012-11-01 11:34:28  [<ffffffff81149f1f>] __vmalloc_area_node+0x5f/0x190
    2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
    2012-11-01 11:34:28  [<ffffffff81149eb2>] __vmalloc_node+0xa2/0xb0
    2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
    2012-11-01 11:34:28  [<ffffffff8114a199>] vmalloc_node+0x29/0x30
    2012-11-01 11:34:28  [<ffffffffa05a2a40>] cfs_cpt_vmalloc+0x20/0x30 [libcfs]
    2012-11-01 11:34:28  [<ffffffffa0922ffe>] ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffffa09235b5>] ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffffa0927fbd>] ptlrpc_main+0xd0d/0x19f0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffff8100c14a>] child_rip+0xa/0x20
    2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
    2012-11-01 11:34:28  [<ffffffff8100c140>] ? child_rip+0x0/0x20

Here's a couple links regarding the kernel regression:

Attachments

Issue Links

is related to

LU-2708 MDS thrashing in ptlrpc_alloc_rqbd

Resolved

LU-2424 add memory limits for ptlrpc service

Resolved

Activity

[LU-2432] ptlrpc_alloc_rqbd spinning on vmap_area_lock on MDS

Liang Zhen (Inactive) added a comment - 04/Jan/13 10:33 PM

I think we might not care one thread (or very few threads) spinning, because each service has tens or even hundreds of threads, and servers normally have many CPU cores, all other threads can serve requests, they will not wait for buffer allocating at all.
The key issue of this ticket is vmalloc can't be parallelized, so it's a waste if all threads/CPUs try to allocate buffers at the same time.

Liang Zhen (Inactive) added a comment - 04/Jan/13 10:33 PM I think we might not care one thread (or very few threads) spinning, because each service has tens or even hundreds of threads, and servers normally have many CPU cores, all other threads can serve requests, they will not wait for buffer allocating at all. The key issue of this ticket is vmalloc can't be parallelized, so it's a waste if all threads/CPUs try to allocate buffers at the same time.

Prakash Surya (Inactive) added a comment - 04/Jan/13 4:09 PM

Liang, Why would limiting the vmalloc calls to a single thread fix the issue? That one thread will still be affected by the regression. Will the other threads still be able to service requests despite needing more request buffers? Or will they all have wait for this single thread to finish the allocations?

Prakash Surya (Inactive) added a comment - 04/Jan/13 4:09 PM Liang, Why would limiting the vmalloc calls to a single thread fix the issue? That one thread will still be affected by the regression. Will the other threads still be able to service requests despite needing more request buffers? Or will they all have wait for this single thread to finish the allocations?

Liang Zhen (Inactive) added a comment - 01/Jan/13 4:09 AM

I posted a patch for this:
http://review.whamcloud.com/#change,4939
and another patch to resolve buffer utilizaiton issue:
http://review.whamcloud.com/#change,4940

Liang Zhen (Inactive) added a comment - 01/Jan/13 4:09 AM I posted a patch for this: http://review.whamcloud.com/#change,4939 and another patch to resolve buffer utilizaiton issue: http://review.whamcloud.com/#change,4940

Liang Zhen (Inactive) added a comment - 12/Dec/12 1:19 AM

I didn't realize that we still don't have the "big request buffer" fix, then this should be the right way to fix this problem and ~~LU-2424~~.
I would suggest to have 512K or 1M as request buffer size, as Andreas said, a very large request buffer can't be reused if any of those (thousands or more) requests is pending on something, so it might have some other issues.
And I still think it's a nice improvement if we only allow one thread (per CPT) to enter allocating path.

Liang Zhen (Inactive) added a comment - 12/Dec/12 1:19 AM I didn't realize that we still don't have the "big request buffer" fix, then this should be the right way to fix this problem and LU-2424 . I would suggest to have 512K or 1M as request buffer size, as Andreas said, a very large request buffer can't be reused if any of those (thousands or more) requests is pending on something, so it might have some other issues. And I still think it's a nice improvement if we only allow one thread (per CPT) to enter allocating path.

Andreas Dilger added a comment - 11/Dec/12 9:21 PM

My (imperfect) understanding is that the receive buffers cannot be re-used until all of the requests therein are processed. That means the buffered are filled from the start, processed, and then returned to the incoming buffer list. If the buffer is too large, then requests sitting in the buffer may wait too long to be processed, or the buffer still will not be fully utilized if there is an upper limit for how long a request will wait.

Andreas Dilger added a comment - 11/Dec/12 9:21 PM My (imperfect) understanding is that the receive buffers cannot be re-used until all of the requests therein are processed. That means the buffered are filled from the start, processed, and then returned to the incoming buffer list. If the buffer is too large, then requests sitting in the buffer may wait too long to be processed, or the buffer still will not be fully utilized if there is an upper limit for how long a request will wait.

Prakash Surya (Inactive) added a comment - 11/Dec/12 2:16 PM

I wasn't at LAD, so I'm unaware of that discussion. But, what trade offs are being made between the number of buffers used and the size of each? i.e why can't we just have one huge buffer, increasing the utilization to (BUFFER_SIZE-REQUEST_SIZE)/BUFFER_SIZE percent (trending towards 100% as BUFFER_SIZE grows large)? Granted I don't understand the LNET code well, so I must be missing something which makes that obviously the wrong thing to do.

Prakash Surya (Inactive) added a comment - 11/Dec/12 2:16 PM I wasn't at LAD, so I'm unaware of that discussion. But, what trade offs are being made between the number of buffers used and the size of each? i.e why can't we just have one huge buffer, increasing the utilization to (BUFFER_SIZE-REQUEST_SIZE)/BUFFER_SIZE percent (trending towards 100% as BUFFER_SIZE grows large)? Granted I don't understand the LNET code well, so I must be missing something which makes that obviously the wrong thing to do.

Andreas Dilger added a comment - 11/Dec/12 1:37 PM

We discussed at LAD that one problem with the request buffers is that the incoming LNET buffers (sorry, I don't have the correct LNET terms here) are allocated only large enough for the largest single request, though most requests are smaller than this. Unfortunately, as soon as a single RPC is waiting in the incoming buffer, there is no longer enough space in the buffer to receive a maximum-sized incoming request. This means that each buffer is only ever used for a single message, regardless of how many might fit.

A solution that was discussed was to make the request buffer be 2x as large as the maximum request size and/or rounded up to the next power-of-two boundary. That would at least increase the buffer utilization to 50%, and would likely allow tens of requests per LNET buffer.

It may be that the patch for ~~LU-2424~~ will already address this issue?

Andreas Dilger added a comment - 11/Dec/12 1:37 PM We discussed at LAD that one problem with the request buffers is that the incoming LNET buffers (sorry, I don't have the correct LNET terms here) are allocated only large enough for the largest single request, though most requests are smaller than this. Unfortunately, as soon as a single RPC is waiting in the incoming buffer, there is no longer enough space in the buffer to receive a maximum-sized incoming request. This means that each buffer is only ever used for a single message, regardless of how many might fit. A solution that was discussed was to make the request buffer be 2x as large as the maximum request size and/or rounded up to the next power-of-two boundary. That would at least increase the buffer utilization to 50%, and would likely allow tens of requests per LNET buffer. It may be that the patch for LU-2424 will already address this issue?

Zhenyu Xu added a comment - 06/Dec/12 9:45 PM

svc->srv_buf_size can be MDS_BUFSIZE = (362 + LOV_MAX_STRIPE_COUNT * 56 + 1024) ~= 110KB for MDS service, could it be problematic?

Zhenyu Xu added a comment - 06/Dec/12 9:45 PM svc->srv_buf_size can be MDS_BUFSIZE = (362 + LOV_MAX_STRIPE_COUNT * 56 + 1024) ~= 110KB for MDS service, could it be problematic?

Peter Jones added a comment - 06/Dec/12 10:28 AM

Thanks Prakash!

Bobijam could you please review this patch?

Peter Jones added a comment - 06/Dec/12 10:28 AM Thanks Prakash! Bobijam could you please review this patch?

Prakash Surya (Inactive) added a comment - 05/Dec/12 6:26 PM

See: http://review.whamcloud.com/4439

Prakash Surya (Inactive) added a comment - 05/Dec/12 6:26 PM See: http://review.whamcloud.com/4439

People

Assignee:: Zhenyu Xu

Reporter:: Prakash Surya (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 05/Dec/12 6:08 PM

Updated:: 18/Mar/13 9:25 AM

Resolved:: 18/Mar/13 9:25 AM