Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1548

Test failure on test suite mds-survey, subtest test_1

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 2.1.4
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • 3
    • 4490

    Description

      This issue was created by maloo for Li Wei <liwei@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/e80242ee-bb86-11e1-95bf-52540035b04c.

      The sub-test test_1 failed with the following error:

      mds-survey failed

      Info required for matching: mds-survey 1

      00:18:25:Lustre: DEBUG MARKER: == mds-survey test 1: Metadata survey with zero-stripe =============================================== 00:18:24 (1340263104)
      00:18:28:Lustre: Echo OBD driver; http://www.lustre.org/
      00:18:29:LustreError: 11224:0:(echo_client.c:1590:echo_md_lookup()) lookup tests: rc = -2
      00:18:29:LustreError: 11224:0:(echo_client.c:1787:echo_md_destroy_internal()) Can't find child tests: rc = -2
      00:18:30:LustreError: 11246:0:(echo_client.c:1590:echo_md_lookup()) lookup tests1: rc = -2
      00:18:30:LustreError: 11246:0:(echo_client.c:1787:echo_md_destroy_internal()) Can't find child tests1: rc = -2
      00:18:31:Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:0
      00:26:34:vmstat invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
      00:26:34:vmstat cpuset=/ mems_allowed=0
      00:26:34:Pid: 11134, comm: vmstat Not tainted 2.6.32-220.17.1.el6_lustre.x86_64 #1
      00:26:34:Call Trace:
      00:26:34: [<ffffffff810c2f21>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      00:26:34: [<ffffffff81113c80>] ? dump_header+0x90/0x1b0
      00:26:34: [<ffffffff8120dcbc>] ? security_real_capable_noaudit+0x3c/0x70
      00:26:34: [<ffffffff8111410a>] ? oom_kill_process+0x8a/0x2c0
      00:26:34: [<ffffffff81114041>] ? select_bad_process+0xe1/0x120
      00:26:34: [<ffffffff81114560>] ? out_of_memory+0x220/0x3c0
      00:26:34: [<ffffffff8112427e>] ? __alloc_pages_nodemask+0x89e/0x940
      00:26:34: [<ffffffff8115857a>] ? alloc_pages_current+0xaa/0x110
      00:26:34: [<ffffffff811110a7>] ? __page_cache_alloc+0x87/0x90
      00:26:34: [<ffffffff81126ccb>] ? __do_page_cache_readahead+0xdb/0x210
      00:26:34: [<ffffffff81126e21>] ? ra_submit+0x21/0x30
      00:26:34: [<ffffffff81112373>] ? filemap_fault+0x4c3/0x500
      00:26:34: [<ffffffff8113b514>] ? __do_fault+0x54/0x510
      00:26:34: [<ffffffff8113bac7>] ? handle_pte_fault+0xf7/0xb50
      00:26:34: [<ffffffff8113c704>] ? handle_mm_fault+0x1e4/0x2b0
      00:26:35: [<ffffffff81042c29>] ? __do_page_fault+0x139/0x480
      00:26:35: [<ffffffff811daf35>] ? pde_users_dec+0x25/0x60
      00:26:35: [<ffffffff814f2e3e>] ? do_page_fault+0x3e/0xa0
      00:26:35: [<ffffffff814f01f5>] ? page_fault+0x25/0x30
      00:26:35:Mem-Info:
      00:26:35:Node 0 DMA per-cpu:
      00:26:35:CPU    0: hi:    0, btch:   1 usd:   0
      00:26:35:Node 0 DMA32 per-cpu:
      00:26:35:CPU    0: hi:  186, btch:  31 usd:  51
      00:26:40:active_anon:34 inactive_anon:34 isolated_anon:0
      00:26:40: active_file:68408 inactive_file:44890 isolated_file:0
      00:26:40: unevictable:0 dirty:0 writeback:0 unstable:0
      00:26:40: free:13270 slab_reclaimable:330063 slab_unreclaimable:11604
      00:26:40: mapped:2 shmem:35 pagetables:1018 bounce:0
      00:26:40:Node 0 DMA free:8352kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:1112kB inactive_file:1144kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4580kB slab_unreclaimable:448kB kernel_stack:96kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:20898 all_unreclaimable? yes
      00:26:41:lowmem_reserve[]: 0 2003 2003 2003
      00:26:41:Node 0 DMA32 free:44728kB min:44720kB low:55900kB high:67080kB active_anon:136kB inactive_anon:136kB active_file:272520kB inactive_file:178416kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:140kB slab_reclaimable:1315672kB slab_unreclaimable:45968kB kernel_stack:1352kB pagetables:4072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:4804688 all_unreclaimable? yes
      00:26:41:lowmem_reserve[]: 0 0 0 0
      00:26:41:Node 0 DMA: 58*4kB 83*8kB 52*16kB 21*32kB 9*64kB 4*128kB 1*256kB 3*512kB 1*1024kB 1*2048kB 0*4096kB = 8352kB
      00:26:41:Node 0 DMA32: 8406*4kB 862*8kB 5*16kB 3*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 44728kB
      00:26:41:86630 total pagecache pages
      00:26:41:17 pages in swap cache
      00:26:41:Swap cache stats: add 6145, delete 6128, find 161/219
      00:26:41:Free swap  = 4106228kB
      00:26:41:Total swap = 4128760kB
      00:26:41:524284 pages RAM
      00:26:41:10578 pages reserved
      00:26:41:86646 pages shared
      00:26:41:408800 pages non-shared
      00:26:42:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
      00:26:42:[  464]     0   464     2662        0   0     -17         -1000 udevd
      00:26:42:[  791]     0   791     2727        0   0     -17         -1000 udevd
      00:26:42:[  792]     0   792     2727        0   0     -17         -1000 udevd
      00:26:42:[ 1151]     0  1151    23295        1   0     -17         -1000 auditd
      00:26:42:[ 1167]     0  1167    62170        1   0       0             0 rsyslogd
      00:26:43:[ 1209]    32  1209     4739        1   0       0             0 rpcbind
      00:26:43:[ 1227]    29  1227     5783        1   0       0             0 rpc.statd
      00:26:43:[ 1259]     0  1259     7360        1   0       0             0 rpc.idmapd
      00:26:43:[ 1291]    81  1291     5859        1   0       0             0 dbus-daemon
      00:26:43:[ 1315]     0  1315     1016        0   0       0             0 acpid
      00:26:43:[ 1324]    68  1324     6756        2   0       0             0 hald
      00:26:43:[ 1325]     0  1325     4522        1   0       0             0 hald-runner
      00:26:43:[ 1353]     0  1353     5051        1   0       0             0 hald-addon-inpu
      00:26:48:[ 1363]    68  1363     4447        1   0       0             0 hald-addon-acpi
      00:26:48:[ 1382]     0  1382   167501        1   0       0             0 automount
      00:26:49:[ 1424]     0  1424    26823        0   0       0             0 rpc.rquotad
      00:26:49:[ 1427]     0  1427     5349        0   0       0             0 rpc.mountd
      00:26:49:[ 1466]     0  1466    44091        1   0       0             0 sssd
      00:26:49:[ 1470]     0  1470    49941        1   0       0             0 sssd_be
      00:26:49:[ 1476]     0  1476    43291        1   0       0             0 sssd_nss
      00:26:49:[ 1477]     0  1477    42811        1   0       0             0 sssd_pam
      00:26:49:[ 1480]     0  1480    16007        0   0       0             0 sshd
      00:26:49:[ 1488]     0  1488     5519        1   0       0             0 xinetd
      00:26:49:[ 1496]    38  1496     6999        1   0       0             0 ntpd
      00:26:49:[ 1512]     0  1512    22177        0   0       0             0 sendmail
      00:26:49:[ 1520]    51  1520    19521        0   0       0             0 sendmail
      00:26:49:[ 1542]     0  1542     2287        1   0       0             0 abrt-dump-oops
      00:26:53:[ 1550]     0  1550    29295        1   0       0             0 crond
      00:26:53:[ 1561]     0  1561     5356        0   0       0             0 atd
      00:26:53:[ 1586]     0  1586     1015        1   0       0             0 agetty
      00:26:53:[ 1588]     0  1588     1012        1   0       0             0 mingetty
      00:26:53:[ 1590]     0  1590     1012        1   0       0             0 mingetty
      00:26:53:[ 1592]     0  1592     1012        1   0       0             0 mingetty
      00:26:53:[ 1594]     0  1594     1012        1   0       0             0 mingetty
      00:26:53:[ 1596]     0  1596     1012        1   0       0             0 mingetty
      00:26:53:[ 1598]     0  1598     1012        1   0       0             0 mingetty
      00:26:53:[11093]     0 11093    25480        1   0       0             0 sshd
      00:26:53:[11097]     0 11097     2304        1   0       0             0 bash
      00:26:53:[11134]     0 11134     1528        7   0       0             0 vmstat
      00:26:53:[11644]     0 11644    25480        1   0       0             0 sshd
      00:26:54:[11646]     0 11646     2304        1   0       0             0 bash
      00:26:55:[11665]     0 11665     2305        1   0       0             0 bash
      00:26:55:[11666]     0 11666     3934        2   0       0             0 lctl
      00:26:55:[11667]     0 11667     3934        2   0       0             0 lctl
      00:26:55:[11668]     0 11668     3934        2   0       0             0 lctl
      00:26:55:[11669]     0 11669     3934       10   0       0             0 lctl
      00:26:55:[11670]     0 11670     3934       11   0       0             0 lctl
      00:26:55:[11671]     0 11671     3934       11   0       0             0 lctl
      00:26:55:[11672]     0 11672     3934        2   0       0             0 lctl
      00:26:55:[11673]     0 11673     3934        2   0       0             0 lctl
      00:26:55:[11674]     0 11674     3934        2   0       0             0 lctl
      00:26:55:Out of memory: Kill process 1167 (rsyslogd) score 1 or sacrifice child
      [...]
      

      Attachments

        Issue Links

          Activity

            [LU-1548] Test failure on test suite mds-survey, subtest test_1

            This landed in August and Andreas believes can be closed as fixed for now.
            The last problem was not related to this bug.

            jlevi Jodi Levi (Inactive) added a comment - This landed in August and Andreas believes can be closed as fixed for now. The last problem was not related to this bug.
            sarah Sarah Liu added a comment - - edited

            Hit this issue again in the latest tag-2.3.57 RHEL6 testing:
            https://maloo.whamcloud.com/test_sets/50b7d5b2-44b0-11e2-8c8b-52540035b04c

            it may be related to LU-2468

            sarah Sarah Liu added a comment - - edited Hit this issue again in the latest tag-2.3.57 RHEL6 testing: https://maloo.whamcloud.com/test_sets/50b7d5b2-44b0-11e2-8c8b-52540035b04c it may be related to LU-2468

            Thanks. The patch for SLES11 was determined not to be needed.

            simmonsja James A Simmons added a comment - Thanks. The patch for SLES11 was determined not to be needed.
            pjones Peter Jones added a comment -

            Landed for 2.3

            pjones Peter Jones added a comment - Landed for 2.3
            simmonsja James A Simmons added a comment - - edited

            Patch for SLES11 SP1 is at http://review.whamcloud.com/3700

            simmonsja James A Simmons added a comment - - edited Patch for SLES11 SP1 is at http://review.whamcloud.com/3700

            Someone forgot to add the patch to SLES11 series file. I will submit a patch shortly.

            simmonsja James A Simmons added a comment - Someone forgot to add the patch to SLES11 series file. I will submit a patch shortly.
            di.wang Di Wang added a comment - - edited

            After some investigation, it seems i_htree_lock could be moved from ldiskfs_inode_info to iam_container, which will save us 48 bytes. And also after reposition some items inside ldiskfs, we can remove those holes. Then ldiskfs_inode_info comes to

                    /* size: 1328, cachelines: 21, members: 37 */
                    /* padding: 4 */
                    /* last cacheline: 48 bytes */
            

            http://review.whamcloud.com/3561
            http://review.whamcloud.com/3562

            di.wang Di Wang added a comment - - edited After some investigation, it seems i_htree_lock could be moved from ldiskfs_inode_info to iam_container, which will save us 48 bytes. And also after reposition some items inside ldiskfs, we can remove those holes. Then ldiskfs_inode_info comes to /* size: 1328, cachelines: 21, members: 37 */ /* padding: 4 */ /* last cacheline: 48 bytes */ http://review.whamcloud.com/3561 http://review.whamcloud.com/3562
            di.wang Di Wang added a comment -

            Hmm, I use latest el6 kernel (linux-2.6.32-279.2.1) with current master. Here is the pahole output (nice tool! btw)

            citation
            /* size: 1392, cachelines: 22, members: 38 */
            /* sum members: 1372, holes: 5, sum holes: 20 */
            /* last cacheline: 48 bytes */
            citation

            Even we can remove those holes, there are still 7 bytes short. I enable SLAB_DEBUG in my test environment, which seems add another 24 bytes to each object, so it is 1416 here.

            di.wang Di Wang added a comment - Hmm, I use latest el6 kernel (linux-2.6.32-279.2.1) with current master. Here is the pahole output (nice tool! btw) citation /* size: 1392, cachelines: 22, members: 38 */ /* sum members: 1372, holes: 5, sum holes: 20 */ /* last cacheline: 48 bytes */ citation Even we can remove those holes, there are still 7 bytes short. I enable SLAB_DEBUG in my test environment, which seems add another 24 bytes to each object, so it is 1416 here.

            I see patch http://review.whamcloud.com/3549 has already been submitted to change the mds-survey script to use files_per_thread.

            It is interesting to learn that struct ldiskfs_inode_info is using 1416 bytes, and only fitting 2 inodes per page. Reducing the size of ldiskfs_inode by 60 bytes, either by packing it better to avoid holes due to 64-bit alignment, and/or removing some unnecessary fields, or making a union of fields that are not used together, would allow 3 inodes to fit into a single 4096-byte page. That would increase the number of inodes that can fit into RAM by 50%, which would be a good benefit regardless of how much RAM is available.

            The output of "pahole" (part of the dwarves package) shows ldiskfs_inode_info is only 1096 bytes, and in fact has 24 bytes of wasted space due to field alignment:

                    /* size: 1096, cachelines: 18, members: 38 */
                    /* sum members: 1072, holes: 7, sum holes: 24 */
                    /* last cacheline: 8 bytes */
            

            It looks trivial to move i_dir_start_lookup to follow i_block_group to remove 8 bytes of padding. Similarly, moving i_extra_isize after i_delalloc_reserved_flag would remove 8 more bytes of padding. However, this isn't enough in itself to remove 60 bytes from the struct, so it is worthwhile to understand why you see 1416 bytes for this struct compared to the 1096 bytes that I see. Best would be to run pahole -E -r --class_name=ldiskfs_inode_info ldiskfs/ldiskfs/ldiskfs.ko on your local system.

            I suspect there are more similar savings to be had, so better to fix the actual problem than to just hide it away during testing.

            adilger Andreas Dilger added a comment - I see patch http://review.whamcloud.com/3549 has already been submitted to change the mds-survey script to use files_per_thread. It is interesting to learn that struct ldiskfs_inode_info is using 1416 bytes, and only fitting 2 inodes per page. Reducing the size of ldiskfs_inode by 60 bytes, either by packing it better to avoid holes due to 64-bit alignment, and/or removing some unnecessary fields, or making a union of fields that are not used together, would allow 3 inodes to fit into a single 4096-byte page. That would increase the number of inodes that can fit into RAM by 50%, which would be a good benefit regardless of how much RAM is available. The output of "pahole" (part of the dwarves package) shows ldiskfs_inode_info is only 1096 bytes, and in fact has 24 bytes of wasted space due to field alignment: /* size: 1096, cachelines: 18, members: 38 */ /* sum members: 1072, holes: 7, sum holes: 24 */ /* last cacheline: 8 bytes */ It looks trivial to move i_dir_start_lookup to follow i_block_group to remove 8 bytes of padding. Similarly, moving i_extra_isize after i_delalloc_reserved_flag would remove 8 more bytes of padding. However, this isn't enough in itself to remove 60 bytes from the struct, so it is worthwhile to understand why you see 1416 bytes for this struct compared to the 1096 bytes that I see. Best would be to run pahole -E -r --class_name=ldiskfs_inode_info ldiskfs/ldiskfs/ldiskfs.ko on your local system. I suspect there are more similar savings to be had, so better to fix the actual problem than to just hide it away during testing.
            di.wang Di Wang added a comment -

            I investigate the problem a bit further, and found that ldiskfs_inode_cache is the biggest memory consumer here.

            cat /proc/slabinfo

            ......
            ldiskfs_inode_cache 570635 570636 1416 2 1 : tunables 24 12 8 : slabdata 285318 285318 0 : globalstat 1403791 600086 735202 76 0 18 66687 0 0 : cpustat 673887 796773 825049 74975

            There are 570635 ldiskfs_inode, and each one has 1416 bytes, and each page has 2 ldiskfs_inode, so 570635 /2 = 285318 pages = 1114 MB

            And looking at the test log,

            file_count=150000 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=0 rslt_loc=/tmp targets="client-14vm3:lustre-MDT0000" /usr/bin/mds-survey

            It tries to create 150000 * 8 = 1.2M inode, and it seems ldiskfs is trying to hold all of these inode in memory, is this expected?

            And also the file_count probably should be total files to be created, instead of files_per_thread as it is in current implementation.

            di.wang Di Wang added a comment - I investigate the problem a bit further, and found that ldiskfs_inode_cache is the biggest memory consumer here. cat /proc/slabinfo ...... ldiskfs_inode_cache 570635 570636 1416 2 1 : tunables 24 12 8 : slabdata 285318 285318 0 : globalstat 1403791 600086 735202 76 0 18 66687 0 0 : cpustat 673887 796773 825049 74975 There are 570635 ldiskfs_inode, and each one has 1416 bytes, and each page has 2 ldiskfs_inode, so 570635 /2 = 285318 pages = 1114 MB And looking at the test log, file_count=150000 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=0 rslt_loc=/tmp targets="client-14vm3:lustre-MDT0000" /usr/bin/mds-survey It tries to create 150000 * 8 = 1.2M inode, and it seems ldiskfs is trying to hold all of these inode in memory, is this expected? And also the file_count probably should be total files to be created, instead of files_per_thread as it is in current implementation.

            People

              di.wang Di Wang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: