[LU-1212] On MDS startup upon client connection mdt_xx threads Consume All Available CPU Created: 13/Mar/12 Updated: 27/Sep/12 Resolved: 08/Apr/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.3.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Ian Colle (Inactive) | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4687 |
| Description |
|
Found during IR testing at ORNL. On MDS startup soon after clients start hitting it, all mdt_xx threads are starting to use all cpu there is. we tried to sysrq-t and all of them are in grow_rqbd the condition to enter there is racy the num posted rqbds < nbuf_group/2 we have kdump log, but it still needs to be transported. |
| Comments |
| Comment by Oleg Drokin [ 14/Mar/12 ] |
|
Now that I got them, here are the traces from first occurence (all mdts as far as the eye can see are stuck in this way): mdt_10 R running task 0 28386 2 0x00000080 ffff8806b38d3b70 0000000000000046 0000000000000002 ffff880000069b00 ffff8806b38d1500 00000000000000d2 ffff8806b38d3c00 ffffffff811238a1 ffff8806b38d1ab8 ffff8806b38d3fd8 000000000000f4e8 ffff8806b38d1ac0 Call Trace: [<ffffffff811238a1>] ? __alloc_pages_nodemask+0x111/0x940 [<ffffffff8127150d>] ? rb_insert_color+0x9d/0x160 [<ffffffff81149bff>] ? __vmalloc_area_node+0x5f/0x190 [<ffffffff81061a5a>] __cond_resched+0x2a/0x40 [<ffffffff814ed750>] _cond_resched+0x30/0x40 [<ffffffff8115f658>] kmem_cache_alloc_node_notrace+0xa8/0x130 [<ffffffff8115f85b>] __kmalloc_node+0x7b/0x100 [<ffffffffa05c0a7e>] ? cfs_alloc_large+0xe/0x10 [libcfs] [<ffffffff81149bff>] __vmalloc_area_node+0x5f/0x190 [<ffffffffa05c0a7e>] ? cfs_alloc_large+0xe/0x10 [libcfs] [<ffffffff81149b92>] __vmalloc_node+0xa2/0xb0 [<ffffffffa05c0a7e>] ? cfs_alloc_large+0xe/0x10 [libcfs] [<ffffffff81149f7c>] vmalloc+0x2c/0x30 [<ffffffffa05c0a7e>] cfs_alloc_large+0xe/0x10 [libcfs] [<ffffffffa07e4325>] ptlrpc_alloc_rqbd+0x105/0x560 [ptlrpc] [<ffffffffa07e0460>] ? ptlrpc_server_post_idle_rqbds+0x40/0xf0 [ptlrpc] [<ffffffffa07e47e9>] ptlrpc_grow_req_bufs+0x69/0x170 [ptlrpc] [<ffffffffa07e7988>] ptlrpc_main+0xc18/0x1a60 [ptlrpc] [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 [<ffffffffa07e6d70>] ? ptlrpc_main+0x0/0x1a60 [ptlrpc] [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffffa07e6d70>] ? ptlrpc_main+0x0/0x1a60 [ptlrpc] [<ffffffffa07e6d70>] ? ptlrpc_main+0x0/0x1a60 [ptlrpc] This situation continues for 25-35 minutes and then the load dies down and never repeats. The kdump I cannot decode ATM because I did not get the debug kernel from ornl yet, as soon as I get it I'll poke inside (I have the kdump itself here now). |
| Comment by Liang Zhen (Inactive) [ 14/Mar/12 ] |
|
I've posted a patch at here for review: http://review.whamcloud.com/#change,2308 |
| Comment by Peter Jones [ 14/Mar/12 ] |
|
Not a blocker for RC1 will include if we have an RC2 |
| Comment by Ian Colle (Inactive) [ 14/Mar/12 ] |
|
Understood |
| Comment by Peter Jones [ 14/Mar/12 ] |
|
Ian Understood that it would be a blocker for ORNL and several other very large sites. If any of these sites were to run in production on 2.2 RC1 they would certainly want to include this fix. Peter |
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Christopher Morrone [ 16/Mar/12 ] |
|
Needed in 2.1 as well, I should think. |
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 26/Mar/12 ] |
|
Landed for 2.1.2 and 2.2 |
| Comment by Peter Jones [ 26/Mar/12 ] |
|
Sorry, not landed for 2.1.2 yet |
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|