[LU-2432] ptlrpc_alloc_rqbd spinning on vmap_area_lock on MDS Created: 05/Dec/12 Updated: 18/Mar/13 Resolved: 18/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sequoia | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 5764 | ||||||||||||
| Description |
|
vmalloc based allocations can potentially take a very long time to complete due to a regression in the kernel. As a result, I've seen our MDS "lock up" for certain periods of time while all of the cores spin on the vmap_area_lock down in ptlrpc_alloc_rqbd. For example: 2012-11-01 11:34:28 Pid: 34505, comm: mdt02_051
2012-11-01 11:34:28
2012-11-01 11:34:28 Call Trace:
2012-11-01 11:34:28 [<ffffffff81273155>] ? rb_insert_color+0x125/0x160
2012-11-01 11:34:28 [<ffffffff81149f1f>] ? __vmalloc_area_node+0x5f/0x190
2012-11-01 11:34:28 [<ffffffff810609ea>] __cond_resched+0x2a/0x40
2012-11-01 11:34:28 [<ffffffff814efa60>] _cond_resched+0x30/0x40
2012-11-01 11:34:28 [<ffffffff8115fa88>] kmem_cache_alloc_node_notrace+0xa8/0x130
2012-11-01 11:34:28 [<ffffffff8115fc8b>] __kmalloc_node+0x7b/0x100
2012-11-01 11:34:28 [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
2012-11-01 11:34:28 [<ffffffff81149f1f>] __vmalloc_area_node+0x5f/0x190
2012-11-01 11:34:28 [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
2012-11-01 11:34:28 [<ffffffff81149eb2>] __vmalloc_node+0xa2/0xb0
2012-11-01 11:34:28 [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
2012-11-01 11:34:28 [<ffffffff8114a199>] vmalloc_node+0x29/0x30
2012-11-01 11:34:28 [<ffffffffa05a2a40>] cfs_cpt_vmalloc+0x20/0x30 [libcfs]
2012-11-01 11:34:28 [<ffffffffa0922ffe>] ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc]
2012-11-01 11:34:28 [<ffffffffa09235b5>] ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc]
2012-11-01 11:34:28 [<ffffffffa0927fbd>] ptlrpc_main+0xd0d/0x19f0 [ptlrpc]
2012-11-01 11:34:28 [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
2012-11-01 11:34:28 [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-11-01 11:34:28 [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
2012-11-01 11:34:28 [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
2012-11-01 11:34:28 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Here's a couple links regarding the kernel regression:
|
| Comments |
| Comment by Prakash Surya (Inactive) [ 05/Dec/12 ] |
| Comment by Peter Jones [ 06/Dec/12 ] |
|
Thanks Prakash! Bobijam could you please review this patch? |
| Comment by Zhenyu Xu [ 06/Dec/12 ] |
|
svc->srv_buf_size can be MDS_BUFSIZE = (362 + LOV_MAX_STRIPE_COUNT * 56 + 1024) ~= 110KB for MDS service, could it be problematic? |
| Comment by Andreas Dilger [ 11/Dec/12 ] |
|
We discussed at LAD that one problem with the request buffers is that the incoming LNET buffers (sorry, I don't have the correct LNET terms here) are allocated only large enough for the largest single request, though most requests are smaller than this. Unfortunately, as soon as a single RPC is waiting in the incoming buffer, there is no longer enough space in the buffer to receive a maximum-sized incoming request. This means that each buffer is only ever used for a single message, regardless of how many might fit. A solution that was discussed was to make the request buffer be 2x as large as the maximum request size and/or rounded up to the next power-of-two boundary. That would at least increase the buffer utilization to 50%, and would likely allow tens of requests per LNET buffer. It may be that the patch for |
| Comment by Prakash Surya (Inactive) [ 11/Dec/12 ] |
|
I wasn't at LAD, so I'm unaware of that discussion. But, what trade offs are being made between the number of buffers used and the size of each? i.e why can't we just have one huge buffer, increasing the utilization to (BUFFER_SIZE-REQUEST_SIZE)/BUFFER_SIZE percent (trending towards 100% as BUFFER_SIZE grows large)? Granted I don't understand the LNET code well, so I must be missing something which makes that obviously the wrong thing to do. |
| Comment by Andreas Dilger [ 11/Dec/12 ] |
|
My (imperfect) understanding is that the receive buffers cannot be re-used until all of the requests therein are processed. That means the buffered are filled from the start, processed, and then returned to the incoming buffer list. If the buffer is too large, then requests sitting in the buffer may wait too long to be processed, or the buffer still will not be fully utilized if there is an upper limit for how long a request will wait. |
| Comment by Liang Zhen (Inactive) [ 12/Dec/12 ] |
|
I didn't realize that we still don't have the "big request buffer" fix, then this should be the right way to fix this problem and |
| Comment by Liang Zhen (Inactive) [ 01/Jan/13 ] |
|
I posted a patch for this: |
| Comment by Prakash Surya (Inactive) [ 04/Jan/13 ] |
|
Liang, Why would limiting the vmalloc calls to a single thread fix the issue? That one thread will still be affected by the regression. Will the other threads still be able to service requests despite needing more request buffers? Or will they all have wait for this single thread to finish the allocations? |
| Comment by Liang Zhen (Inactive) [ 04/Jan/13 ] |
|
I think we might not care one thread (or very few threads) spinning, because each service has tens or even hundreds of threads, and servers normally have many CPU cores, all other threads can serve requests, they will not wait for buffer allocating at all. |
| Comment by Prakash Surya (Inactive) [ 07/Jan/13 ] |
Perfect, that's what I wanted to verify with you. Thanks for the clarification! |
| Comment by Andreas Dilger [ 18/Mar/13 ] |
|
Both http://review.whamcloud.com/4939 and http://review.whamcloud.com/4940 have landed, so I think this bug could be closed. There should only be a single thread calling vmalloc() now. |