Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.0
-
3
-
5764
Description
vmalloc based allocations can potentially take a very long time to complete due to a regression in the kernel. As a result, I've seen our MDS "lock up" for certain periods of time while all of the cores spin on the vmap_area_lock down in ptlrpc_alloc_rqbd.
For example:
2012-11-01 11:34:28 Pid: 34505, comm: mdt02_051 2012-11-01 11:34:28 2012-11-01 11:34:28 Call Trace: 2012-11-01 11:34:28 [<ffffffff81273155>] ? rb_insert_color+0x125/0x160 2012-11-01 11:34:28 [<ffffffff81149f1f>] ? __vmalloc_area_node+0x5f/0x190 2012-11-01 11:34:28 [<ffffffff810609ea>] __cond_resched+0x2a/0x40 2012-11-01 11:34:28 [<ffffffff814efa60>] _cond_resched+0x30/0x40 2012-11-01 11:34:28 [<ffffffff8115fa88>] kmem_cache_alloc_node_notrace+0xa8/0x130 2012-11-01 11:34:28 [<ffffffff8115fc8b>] __kmalloc_node+0x7b/0x100 2012-11-01 11:34:28 [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 2012-11-01 11:34:28 [<ffffffff81149f1f>] __vmalloc_area_node+0x5f/0x190 2012-11-01 11:34:28 [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 2012-11-01 11:34:28 [<ffffffff81149eb2>] __vmalloc_node+0xa2/0xb0 2012-11-01 11:34:28 [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 2012-11-01 11:34:28 [<ffffffff8114a199>] vmalloc_node+0x29/0x30 2012-11-01 11:34:28 [<ffffffffa05a2a40>] cfs_cpt_vmalloc+0x20/0x30 [libcfs] 2012-11-01 11:34:28 [<ffffffffa0922ffe>] ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc] 2012-11-01 11:34:28 [<ffffffffa09235b5>] ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc] 2012-11-01 11:34:28 [<ffffffffa0927fbd>] ptlrpc_main+0xd0d/0x19f0 [ptlrpc] 2012-11-01 11:34:28 [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc] 2012-11-01 11:34:28 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-11-01 11:34:28 [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc] 2012-11-01 11:34:28 [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc] 2012-11-01 11:34:28 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Here's a couple links regarding the kernel regression: