[LU-1548] Test failure on test suite mds-survey, subtest test_1 Created: 21/Jun/12 Updated: 13/Dec/16 Resolved: 09/Jan/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4490 | ||||||||
| Description |
|
This issue was created by maloo for Li Wei <liwei@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/e80242ee-bb86-11e1-95bf-52540035b04c. The sub-test test_1 failed with the following error:
Info required for matching: mds-survey 1 00:18:25:Lustre: DEBUG MARKER: == mds-survey test 1: Metadata survey with zero-stripe =============================================== 00:18:24 (1340263104) 00:18:28:Lustre: Echo OBD driver; http://www.lustre.org/ 00:18:29:LustreError: 11224:0:(echo_client.c:1590:echo_md_lookup()) lookup tests: rc = -2 00:18:29:LustreError: 11224:0:(echo_client.c:1787:echo_md_destroy_internal()) Can't find child tests: rc = -2 00:18:30:LustreError: 11246:0:(echo_client.c:1590:echo_md_lookup()) lookup tests1: rc = -2 00:18:30:LustreError: 11246:0:(echo_client.c:1787:echo_md_destroy_internal()) Can't find child tests1: rc = -2 00:18:31:Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:0 00:26:34:vmstat invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 00:26:34:vmstat cpuset=/ mems_allowed=0 00:26:34:Pid: 11134, comm: vmstat Not tainted 2.6.32-220.17.1.el6_lustre.x86_64 #1 00:26:34:Call Trace: 00:26:34: [<ffffffff810c2f21>] ? cpuset_print_task_mems_allowed+0x91/0xb0 00:26:34: [<ffffffff81113c80>] ? dump_header+0x90/0x1b0 00:26:34: [<ffffffff8120dcbc>] ? security_real_capable_noaudit+0x3c/0x70 00:26:34: [<ffffffff8111410a>] ? oom_kill_process+0x8a/0x2c0 00:26:34: [<ffffffff81114041>] ? select_bad_process+0xe1/0x120 00:26:34: [<ffffffff81114560>] ? out_of_memory+0x220/0x3c0 00:26:34: [<ffffffff8112427e>] ? __alloc_pages_nodemask+0x89e/0x940 00:26:34: [<ffffffff8115857a>] ? alloc_pages_current+0xaa/0x110 00:26:34: [<ffffffff811110a7>] ? __page_cache_alloc+0x87/0x90 00:26:34: [<ffffffff81126ccb>] ? __do_page_cache_readahead+0xdb/0x210 00:26:34: [<ffffffff81126e21>] ? ra_submit+0x21/0x30 00:26:34: [<ffffffff81112373>] ? filemap_fault+0x4c3/0x500 00:26:34: [<ffffffff8113b514>] ? __do_fault+0x54/0x510 00:26:34: [<ffffffff8113bac7>] ? handle_pte_fault+0xf7/0xb50 00:26:34: [<ffffffff8113c704>] ? handle_mm_fault+0x1e4/0x2b0 00:26:35: [<ffffffff81042c29>] ? __do_page_fault+0x139/0x480 00:26:35: [<ffffffff811daf35>] ? pde_users_dec+0x25/0x60 00:26:35: [<ffffffff814f2e3e>] ? do_page_fault+0x3e/0xa0 00:26:35: [<ffffffff814f01f5>] ? page_fault+0x25/0x30 00:26:35:Mem-Info: 00:26:35:Node 0 DMA per-cpu: 00:26:35:CPU 0: hi: 0, btch: 1 usd: 0 00:26:35:Node 0 DMA32 per-cpu: 00:26:35:CPU 0: hi: 186, btch: 31 usd: 51 00:26:40:active_anon:34 inactive_anon:34 isolated_anon:0 00:26:40: active_file:68408 inactive_file:44890 isolated_file:0 00:26:40: unevictable:0 dirty:0 writeback:0 unstable:0 00:26:40: free:13270 slab_reclaimable:330063 slab_unreclaimable:11604 00:26:40: mapped:2 shmem:35 pagetables:1018 bounce:0 00:26:40:Node 0 DMA free:8352kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:1112kB inactive_file:1144kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4580kB slab_unreclaimable:448kB kernel_stack:96kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:20898 all_unreclaimable? yes 00:26:41:lowmem_reserve[]: 0 2003 2003 2003 00:26:41:Node 0 DMA32 free:44728kB min:44720kB low:55900kB high:67080kB active_anon:136kB inactive_anon:136kB active_file:272520kB inactive_file:178416kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:140kB slab_reclaimable:1315672kB slab_unreclaimable:45968kB kernel_stack:1352kB pagetables:4072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:4804688 all_unreclaimable? yes 00:26:41:lowmem_reserve[]: 0 0 0 0 00:26:41:Node 0 DMA: 58*4kB 83*8kB 52*16kB 21*32kB 9*64kB 4*128kB 1*256kB 3*512kB 1*1024kB 1*2048kB 0*4096kB = 8352kB 00:26:41:Node 0 DMA32: 8406*4kB 862*8kB 5*16kB 3*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 44728kB 00:26:41:86630 total pagecache pages 00:26:41:17 pages in swap cache 00:26:41:Swap cache stats: add 6145, delete 6128, find 161/219 00:26:41:Free swap = 4106228kB 00:26:41:Total swap = 4128760kB 00:26:41:524284 pages RAM 00:26:41:10578 pages reserved 00:26:41:86646 pages shared 00:26:41:408800 pages non-shared 00:26:42:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 00:26:42:[ 464] 0 464 2662 0 0 -17 -1000 udevd 00:26:42:[ 791] 0 791 2727 0 0 -17 -1000 udevd 00:26:42:[ 792] 0 792 2727 0 0 -17 -1000 udevd 00:26:42:[ 1151] 0 1151 23295 1 0 -17 -1000 auditd 00:26:42:[ 1167] 0 1167 62170 1 0 0 0 rsyslogd 00:26:43:[ 1209] 32 1209 4739 1 0 0 0 rpcbind 00:26:43:[ 1227] 29 1227 5783 1 0 0 0 rpc.statd 00:26:43:[ 1259] 0 1259 7360 1 0 0 0 rpc.idmapd 00:26:43:[ 1291] 81 1291 5859 1 0 0 0 dbus-daemon 00:26:43:[ 1315] 0 1315 1016 0 0 0 0 acpid 00:26:43:[ 1324] 68 1324 6756 2 0 0 0 hald 00:26:43:[ 1325] 0 1325 4522 1 0 0 0 hald-runner 00:26:43:[ 1353] 0 1353 5051 1 0 0 0 hald-addon-inpu 00:26:48:[ 1363] 68 1363 4447 1 0 0 0 hald-addon-acpi 00:26:48:[ 1382] 0 1382 167501 1 0 0 0 automount 00:26:49:[ 1424] 0 1424 26823 0 0 0 0 rpc.rquotad 00:26:49:[ 1427] 0 1427 5349 0 0 0 0 rpc.mountd 00:26:49:[ 1466] 0 1466 44091 1 0 0 0 sssd 00:26:49:[ 1470] 0 1470 49941 1 0 0 0 sssd_be 00:26:49:[ 1476] 0 1476 43291 1 0 0 0 sssd_nss 00:26:49:[ 1477] 0 1477 42811 1 0 0 0 sssd_pam 00:26:49:[ 1480] 0 1480 16007 0 0 0 0 sshd 00:26:49:[ 1488] 0 1488 5519 1 0 0 0 xinetd 00:26:49:[ 1496] 38 1496 6999 1 0 0 0 ntpd 00:26:49:[ 1512] 0 1512 22177 0 0 0 0 sendmail 00:26:49:[ 1520] 51 1520 19521 0 0 0 0 sendmail 00:26:49:[ 1542] 0 1542 2287 1 0 0 0 abrt-dump-oops 00:26:53:[ 1550] 0 1550 29295 1 0 0 0 crond 00:26:53:[ 1561] 0 1561 5356 0 0 0 0 atd 00:26:53:[ 1586] 0 1586 1015 1 0 0 0 agetty 00:26:53:[ 1588] 0 1588 1012 1 0 0 0 mingetty 00:26:53:[ 1590] 0 1590 1012 1 0 0 0 mingetty 00:26:53:[ 1592] 0 1592 1012 1 0 0 0 mingetty 00:26:53:[ 1594] 0 1594 1012 1 0 0 0 mingetty 00:26:53:[ 1596] 0 1596 1012 1 0 0 0 mingetty 00:26:53:[ 1598] 0 1598 1012 1 0 0 0 mingetty 00:26:53:[11093] 0 11093 25480 1 0 0 0 sshd 00:26:53:[11097] 0 11097 2304 1 0 0 0 bash 00:26:53:[11134] 0 11134 1528 7 0 0 0 vmstat 00:26:53:[11644] 0 11644 25480 1 0 0 0 sshd 00:26:54:[11646] 0 11646 2304 1 0 0 0 bash 00:26:55:[11665] 0 11665 2305 1 0 0 0 bash 00:26:55:[11666] 0 11666 3934 2 0 0 0 lctl 00:26:55:[11667] 0 11667 3934 2 0 0 0 lctl 00:26:55:[11668] 0 11668 3934 2 0 0 0 lctl 00:26:55:[11669] 0 11669 3934 10 0 0 0 lctl 00:26:55:[11670] 0 11670 3934 11 0 0 0 lctl 00:26:55:[11671] 0 11671 3934 11 0 0 0 lctl 00:26:55:[11672] 0 11672 3934 2 0 0 0 lctl 00:26:55:[11673] 0 11673 3934 2 0 0 0 lctl 00:26:55:[11674] 0 11674 3934 2 0 0 0 lctl 00:26:55:Out of memory: Kill process 1167 (rsyslogd) score 1 or sacrifice child [...] |
| Comments |
| Comment by Peter Jones [ 23/Jul/12 ] |
|
Minh Can you explain this failure? Peter |
| Comment by Minh Diep [ 24/Jul/12 ] |
|
This is due to out-of-memory on the VM |
| Comment by Minh Diep [ 25/Jul/12 ] |
|
I found similar setup that PASS https://maloo.whamcloud.com/test_sessions/a8af3fc4-d248-11e1-8823-52540035b04c This OOM only occurred once. I am adding Wangdi for further comment. I don't think this is a blocker |
| Comment by Di Wang [ 25/Jul/12 ] |
|
Basically, I would not run mds-survey tests on a virtual machine with 2G memory. Since the test itself needs a lot memory. So I agree with Minh, this is not a blocker. Do we also use virtual disk here? or real disk? According to the bug itself, it seems echo threads were stuck on these two places 1. 2. 01:01:18: [<ffffffff8115eb08>] ? ____cache_alloc_node+0x108/0x160 As for 1, in osd_key_init, a few allocations are indeed unnecessary for echo thread, I can remove that. But to be honest, I am not sure whether this fix is necessary, because we are running this test on the environment, which we do not suggest at all. |
| Comment by Peter Jones [ 30/Jul/12 ] |
|
Chris Can the default memory allocated to running mds-survey be increased to something more realistic? Peter |
| Comment by Andreas Dilger [ 06/Aug/12 ] |
|
Di, can you explain the specific details of why mds_survey shouldn't be able to run with 2GB of RAM? That is a LOT of RAM for the purposes of testing. While it may cause slowdowns and inaccurate performance results for mds_survey, since this is being run in a VM it will not produce accurate performance results anyway. What IS important is that the test actually run and pass, and that we figure out why there is so much memory used, and what can be done to reduce this memory requirement. I suspect that we will have similar memory constraints on larger systems with more service threads and more clients, so we can't always just increase the memory usage of Lustre. We also need to look at where there is too much memory being allocated and/or pinned and reduce it. That will also improve the MDS performance in real production usage, since the MDS could then handle more requests from the same sized cache. Also, it may be that this is exposing a deadlock, where a thread is blocked in allocation while holding a lock that another thread needs in order to make progress. |
| Comment by Di Wang [ 06/Aug/12 ] |
|
The major problem is in osd_key_init, which will try to allocate 1 + 1/4 pages each time. static void *osd_key_init(const struct lu_context *ctx, ................ #define OSD_IT_EA_BUFSIZE (CFS_PAGE_SIZE + CFS_PAGE_SIZE/4) In the real MDS stack, this is fine, since these thread key will be allocated only 1 time during the initialization. But for mds_survey, we might call it many time, which will cause memory fragmented, then it would be harder to get this > 1 page memory. Actually, this memory info is only used in index iterate method (readdir etc), and we probably should not allocate it at all for mds_survey. I will cook a patch here. According to the trace, another place to slow things down is in jbd(see trace), which I am not sure whether it is related with the above problem. |
| Comment by Peter Jones [ 07/Aug/12 ] |
|
Reassign to Di who is working on a patch |
| Comment by Di Wang [ 07/Aug/12 ] |
|
I investigate the problem a bit further, and found that ldiskfs_inode_cache is the biggest memory consumer here. cat /proc/slabinfo ...... There are 570635 ldiskfs_inode, and each one has 1416 bytes, and each page has 2 ldiskfs_inode, so 570635 /2 = 285318 pages = 1114 MB And looking at the test log, file_count=150000 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=0 rslt_loc=/tmp targets="client-14vm3:lustre-MDT0000" /usr/bin/mds-survey It tries to create 150000 * 8 = 1.2M inode, and it seems ldiskfs is trying to hold all of these inode in memory, is this expected? And also the file_count probably should be total files to be created, instead of files_per_thread as it is in current implementation. |
| Comment by Andreas Dilger [ 07/Aug/12 ] |
|
I see patch http://review.whamcloud.com/3549 has already been submitted to change the mds-survey script to use files_per_thread. It is interesting to learn that struct ldiskfs_inode_info is using 1416 bytes, and only fitting 2 inodes per page. Reducing the size of ldiskfs_inode by 60 bytes, either by packing it better to avoid holes due to 64-bit alignment, and/or removing some unnecessary fields, or making a union of fields that are not used together, would allow 3 inodes to fit into a single 4096-byte page. That would increase the number of inodes that can fit into RAM by 50%, which would be a good benefit regardless of how much RAM is available. The output of "pahole" (part of the dwarves package) shows ldiskfs_inode_info is only 1096 bytes, and in fact has 24 bytes of wasted space due to field alignment: /* size: 1096, cachelines: 18, members: 38 */
/* sum members: 1072, holes: 7, sum holes: 24 */
/* last cacheline: 8 bytes */
It looks trivial to move i_dir_start_lookup to follow i_block_group to remove 8 bytes of padding. Similarly, moving i_extra_isize after i_delalloc_reserved_flag would remove 8 more bytes of padding. However, this isn't enough in itself to remove 60 bytes from the struct, so it is worthwhile to understand why you see 1416 bytes for this struct compared to the 1096 bytes that I see. Best would be to run pahole -E -r --class_name=ldiskfs_inode_info ldiskfs/ldiskfs/ldiskfs.ko on your local system. I suspect there are more similar savings to be had, so better to fix the actual problem than to just hide it away during testing. |
| Comment by Di Wang [ 07/Aug/12 ] |
|
Hmm, I use latest el6 kernel (linux-2.6.32-279.2.1) with current master. Here is the pahole output (nice tool! btw) citation Even we can remove those holes, there are still 7 bytes short. I enable SLAB_DEBUG in my test environment, which seems add another 24 bytes to each object, so it is 1416 here. |
| Comment by Di Wang [ 08/Aug/12 ] |
|
After some investigation, it seems i_htree_lock could be moved from ldiskfs_inode_info to iam_container, which will save us 48 bytes. And also after reposition some items inside ldiskfs, we can remove those holes. Then ldiskfs_inode_info comes to /* size: 1328, cachelines: 21, members: 37 */
/* padding: 4 */
/* last cacheline: 48 bytes */
http://review.whamcloud.com/3561 |
| Comment by James A Simmons [ 16/Aug/12 ] |
|
Someone forgot to add the patch to SLES11 series file. I will submit a patch shortly. |
| Comment by James A Simmons [ 16/Aug/12 ] |
|
Patch for SLES11 SP1 is at http://review.whamcloud.com/3700 |
| Comment by Peter Jones [ 17/Aug/12 ] |
|
Landed for 2.3 |
| Comment by James A Simmons [ 17/Aug/12 ] |
|
Thanks. The patch for SLES11 was determined not to be needed. |
| Comment by Sarah Liu [ 13/Dec/12 ] |
|
Hit this issue again in the latest tag-2.3.57 RHEL6 testing: it may be related to |
| Comment by Jodi Levi (Inactive) [ 09/Jan/13 ] |
|
This landed in August and Andreas believes can be closed as fixed for now. |