Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12505

mounting bigalloc enabled large OST takes a long time

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • master
    • 3
    • 9223372036854775807

    Description

      Not only Lustre OST, but also when OSS mounts large OST device which 'bigalloc' is enabled, it takes huge amount of time to complete.

      # time mount -t ldiskfs /dev/ddn/scratch0_ost0000 /lustre/scratch0/ost0000
      
      real    12m32.153s
      user    0m0.000s
      sys     11m49.887s
      
      # dumpe2fs -h /dev/ddn/scratch0_ost0000
      dumpe2fs 1.45.2.wc1 (27-May-2019)
      Filesystem volume name:   scratch0-OST0000
      Last mounted on:          /
      Filesystem UUID:          1ca9dd81-8b70-4805-a430-78b0eafc1c45
      Filesystem magic number:  0xEF53
      Filesystem revision #:    1 (dynamic)
      Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery meta_bg extent 64bit mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota bigalloc
      Filesystem flags:         signed_directory_hash 
      Default mount options:    user_xattr acl
      Filesystem state:         clean
      Errors behavior:          Continue
      Filesystem OS type:       Linux
      Inode count:              1074397184
      Block count:              275045679104
      Reserved block count:     2750456791
      Free blocks:              274909403680
      Free inodes:              1074396851
      First block:              0
      Block size:               4096
      Cluster size:             131072
      Group descriptor size:    64
      Blocks per group:         1048576
      Clusters per group:       32768
      Inodes per group:         4096
      Inode blocks per group:   512
      RAID stride:              512
      RAID stripe width:        512
      Flex block group size:    256
      Filesystem created:       Mon Jul  1 00:43:14 2019
      Last mount time:          Wed Jul  3 05:55:22 2019
      Last write time:          Wed Jul  3 05:55:22 2019
      Mount count:              8
      Maximum mount count:      -1
      Last checked:             Mon Jul  1 00:43:14 2019
      Check interval:           0 (<none>)
      Lifetime writes:          2693 GB
      Reserved blocks uid:      0 (user root)
      Reserved blocks gid:      0 (group root)
      First inode:              11
      Inode size:               512
      Required extra isize:     32
      Desired extra isize:      32
      Journal inode:            8
      Default directory hash:   half_md4
      Directory Hash Seed:      4eeb2234-062d-4af5-8973-872baabd2e9f
      Journal backup:           inode blocks
      MMP block number:         131680
      MMP update interval:      5
      User quota inode:         3
      Group quota inode:        4
      Journal features:         journal_incompat_revoke journal_64bit
      Journal size:             4096M
      Journal length:           1048576
      Journal sequence:         0x00000494
      Journal start:            0
      MMP_block:
          mmp_magic: 0x4d4d50
          mmp_check_interval: 10
          mmp_sequence: 0x0000cd
          mmp_update_date: Wed Jul  3 06:00:33 2019
          mmp_update_time: 1562133633
          mmp_node_name: es18k-vm11
          mmp_device_name: sda
      

      Without bigalloc

      # time mount -t ldiskfs /dev/ddn/scratch0_ost0000 /lustre/scratch0/ost0000
      
      real	0m6.484s
      user	0m0.000s
      sys	0m4.954s
      

      Attachments

        1. dumpe2fs.out.gz
          14.48 MB
          Shuichi Ihara

        Issue Links

          Activity

            [LU-12505] mounting bigalloc enabled large OST takes a long time

            Patch was landed upstream for 1.46 via commit 59037c5357d39c6d0f14a0aff70e67dc13eafc84

            adilger Andreas Dilger added a comment - Patch was landed upstream for 1.46 via commit 59037c5357d39c6d0f14a0aff70e67dc13eafc84

            To answer my own question, the bigalloc patches are on the master branch of the e2fsprogs repo, but not in the maint branch for 1.45.6.

            adilger Andreas Dilger added a comment - To answer my own question, the bigalloc patches are on the master branch of the e2fsprogs repo, but not in the maint branch for 1.45.6.

            Dongyang, have these patches been submitted upstream yet?

            adilger Andreas Dilger added a comment - Dongyang, have these patches been submitted upstream yet?

            Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/35781
            Subject: LU-12505 mke2fs: set overhead in super block for bigalloc
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 8624a496ff7c3e4fd69fb7217ff56030111f4460

            gerrit Gerrit Updater added a comment - Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/35781 Subject: LU-12505 mke2fs: set overhead in super block for bigalloc Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 8624a496ff7c3e4fd69fb7217ff56030111f4460

            >it looks ldiskfs_get_group_desc() and ldiskfs_calculate_overhead() are taking most of CPU cycle a long while during mount.

            only once and store to super block for later use.

            >46.30% libext2fs.so.2.4 [.] rb_test_bmap
            >32.98% libext2fs.so.2.4 [.] ext2fs_test_generic_bmap

            it's know problem. bitmaps on e2fsprogs isn't good designed in case word have a several bits set, replace with IDR (from kernel) can improve speed dramatically.

            shadow Alexey Lyashkov added a comment - >it looks ldiskfs_get_group_desc() and ldiskfs_calculate_overhead() are taking most of CPU cycle a long while during mount. only once and store to super block for later use. >46.30% libext2fs.so.2.4 [.] rb_test_bmap >32.98% libext2fs.so.2.4 [.] ext2fs_test_generic_bmap it's know problem. bitmaps on e2fsprogs isn't good designed in case word have a several bits set, replace with IDR (from kernel) can improve speed dramatically.

            Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/35659
            Subject: LU-12505 libext2fs: optimize ext2fs_convert_subcluster_bitmap()
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 47d5bc9d922585229dfd5da82a1f19ff93bea28e

            gerrit Gerrit Updater added a comment - Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/35659 Subject: LU-12505 libext2fs: optimize ext2fs_convert_subcluster_bitmap() Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 47d5bc9d922585229dfd5da82a1f19ff93bea28e

            It wouldn't be a bad idea to post an email to linux-ext4 with this information. Maybe we can get some input on how to fix it, or Ted will "just know" the best way to fix the problem.

            adilger Andreas Dilger added a comment - It wouldn't be a bad idea to post an email to linux-ext4 with this information. Maybe we can get some input on how to fix it, or Ted will "just know" the best way to fix the problem.

            maybe, it would be better to test with newer kernel if same behavior reproduced?
            btw, mke2fs to bigalloc enabled OST, is also very slow either.

            without bigalloc

            # time mkfs.lustre --ost --servicenode=127.0.0.2@tcp --fsname=scratch0 --index=2 --mgsnode=127.0.0.2@tcp --mkfsoptions='-E lazy_itable_init=0,lazy_journal_init=0,stripe_width=512,stride=512 -O meta_bg,^resize_inode -m1 -J size=4096' --reformat --backfstype=ldiskfs /dev/ddn/scratch0_ost0tune2fs -E mmp_update_interval=5 /dev/ddn/scratch0_ost0002
            
            real    9m11.614s
            user    0m59.894s
            sys     7m10.594s
            

            with bigalloc

            # time mkfs.lustre --ost --servicenode=127.0.0.2@tcp --fsname=scratch0 --index=0 --mgsnode=127.0.0.2@tcp --mkfsoptions='-E lazy_itable_init=0,lazy_journal_init=0,stripe_width=512,stride=512 -O bigalloc -C 131072 -m1 -J size=4096' --reformat --backfstype=ldiskfs /dev/ddn/scratch0_ost0000
            
            real    43m5.349s
            user    24m29.652s
            sys     18m35.058s
            

            The most of CPU time are consumed at the following functions which I didn't see mke2fs without '-O bigalloc'.

            Samples: 24K of event 'cycles', Event count (approx.): 14154870804              
            Overhead  Shared Object      Symbol                                             
              46.30%  libext2fs.so.2.4   [.] rb_test_bmap                                   
              32.98%  libext2fs.so.2.4   [.] ext2fs_test_generic_bmap                       
              13.10%  libext2fs.so.2.4   [.] ext2fs_convert_subcluster_bitmap               
               6.96%  libext2fs.so.2.4   [.] ext2fs_test_generic_bmap@plt       
            
            sihara Shuichi Ihara added a comment - maybe, it would be better to test with newer kernel if same behavior reproduced? btw, mke2fs to bigalloc enabled OST, is also very slow either. without bigalloc # time mkfs.lustre --ost --servicenode=127.0.0.2@tcp --fsname=scratch0 --index=2 --mgsnode=127.0.0.2@tcp --mkfsoptions='-E lazy_itable_init=0,lazy_journal_init=0,stripe_width=512,stride=512 -O meta_bg,^resize_inode -m1 -J size=4096' --reformat --backfstype=ldiskfs /dev/ddn/scratch0_ost0tune2fs -E mmp_update_interval=5 /dev/ddn/scratch0_ost0002 real 9m11.614s user 0m59.894s sys 7m10.594s with bigalloc # time mkfs.lustre --ost --servicenode=127.0.0.2@tcp --fsname=scratch0 --index=0 --mgsnode=127.0.0.2@tcp --mkfsoptions='-E lazy_itable_init=0,lazy_journal_init=0,stripe_width=512,stride=512 -O bigalloc -C 131072 -m1 -J size=4096' --reformat --backfstype=ldiskfs /dev/ddn/scratch0_ost0000 real 43m5.349s user 24m29.652s sys 18m35.058s The most of CPU time are consumed at the following functions which I didn't see mke2fs without '-O bigalloc'. Samples: 24K of event 'cycles', Event count (approx.): 14154870804 Overhead Shared Object Symbol 46.30% libext2fs.so.2.4 [.] rb_test_bmap 32.98% libext2fs.so.2.4 [.] ext2fs_test_generic_bmap 13.10% libext2fs.so.2.4 [.] ext2fs_convert_subcluster_bitmap 6.96% libext2fs.so.2.4 [.] ext2fs_test_generic_bmap@plt
            adilger Andreas Dilger added a comment - - edited

            It looks like the problem is in ext4_calculate_overhead() and count_overhead(), since there is a simple calculation for normal filesystems, and a complex one that loads and checks every group in the bigalloc case, and ext4_calculate_ovehead() calls count_overhead() for every group as well:

            static int count_overhead(struct super_block *sb, ext4_group_t grp,
                                      char *buf)
            {
                    if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_BIGALLOC))
                            return (ext4_bg_has_super(sb, grp) + ext4_bg_num_gdb(sb, grp) +
                                    sbi->s_itb_per_group + 2);
                    
                    first_block = le32_to_cpu(sbi->s_es->s_first_data_block) +
                            (grp * EXT4_BLOCKS_PER_GROUP(sb));
                    last_block = first_block + EXT4_BLOCKS_PER_GROUP(sb) - 1;
                    for (i = 0; i < ngroups; i++) {
                            gdp = ext4_get_group_desc(sb, i, NULL);
                            :
            
            int ext4_calculate_overhead(struct super_block *sb)
            {
                    /* Compute the overhead (FS structures).  This is constant
                     * for a given filesystem unless the number of block groups
                     * changes so we cache the previous value until it does. */
            
                    /* All of the blocks before first_data_block are overhead */
                    overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
            
                    /* Add the overhead found in each block group */
                    for (i = 0; i < ngroups; i++) {
                            blks = count_overhead(sb, i, buf);
                            overhead += blks;
            

            That means for a 1024 TiB filesystem (num_groups = 1024TB / (32768 * chunk_size/group) = 256K groups) it will do 256K*256K = 68B checks, which would be very slow and pointless. I did read somewhere that mke2fs should store this overhead into the superblock at format time, so the kernel can avoid doing this pointless operation, but possibly that isn't in the kernel you are using, or it isn't working properly and nobody noticed for small filesystems?

            adilger Andreas Dilger added a comment - - edited It looks like the problem is in ext4_calculate_overhead() and count_overhead() , since there is a simple calculation for normal filesystems, and a complex one that loads and checks every group in the bigalloc case, and ext4_calculate_ovehead() calls count_overhead() for every group as well: static int count_overhead(struct super_block *sb, ext4_group_t grp, char *buf) { if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_BIGALLOC)) return (ext4_bg_has_super(sb, grp) + ext4_bg_num_gdb(sb, grp) + sbi->s_itb_per_group + 2); first_block = le32_to_cpu(sbi->s_es->s_first_data_block) + (grp * EXT4_BLOCKS_PER_GROUP(sb)); last_block = first_block + EXT4_BLOCKS_PER_GROUP(sb) - 1; for (i = 0; i < ngroups; i++) { gdp = ext4_get_group_desc(sb, i, NULL); : int ext4_calculate_overhead(struct super_block *sb) { /* Compute the overhead (FS structures). This is constant * for a given filesystem unless the number of block groups * changes so we cache the previous value until it does. */ /* All of the blocks before first_data_block are overhead */ overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block)); /* Add the overhead found in each block group */ for (i = 0; i < ngroups; i++) { blks = count_overhead(sb, i, buf); overhead += blks; That means for a 1024 TiB filesystem ( num_groups = 1024TB / (32768 * chunk_size/group) = 256K groups ) it will do 256K*256K = 68B checks, which would be very slow and pointless. I did read somewhere that mke2fs should store this overhead into the superblock at format time, so the kernel can avoid doing this pointless operation, but possibly that isn't in the kernel you are using, or it isn't working properly and nobody noticed for small filesystems?
            sihara Shuichi Ihara added a comment - - edited

            uploaded dumpe2fs.out.gz. i've tested without 'meta_bg' before, but it was same and took a long time.
            And there were nothing disk I/O at most of time and 100% cpu bound below.

            Tasks: 237 total,   2 running, 235 sleeping,   0 stopped,   0 zombie
            %Cpu(s):  0.4 us,  6.6 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
            KiB Mem : 15456899+total, 15270185+free,  1553196 used,   313944 buff/cache
            KiB Swap:  5472252 total,  5472252 free,        0 used. 15216664+avail Mem 
            
              PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                            
            13008 root      20   0   19940   1052    868 R 100.0  0.0   0:29.86 mount                                                                              
                1 root      20   0   44604   4908   2552 S   0.0  0.0   0:02.00 systemd         
            

            it looks ldiskfs_get_group_desc() and ldiskfs_calculate_overhead() are taking most of CPU cycle a long while during mount.

            Samples: 108K of event 'cycles', Event count (approx.): 26372312997                                                                                     
            Overhead  Shared Object          Symbol                                                                                                                 
              52.20%  [kernel]               [k] ldiskfs_get_group_desc                                                                                             
              45.13%  [kernel]               [k] ldiskfs_calculate_overhead                                                                                         
               0.31%  [kernel]               [k] native_write_msr_safe                                                                                               
               0.23%  [kernel]               [k] crc16                                                                                                               
               0.21%  [kernel]               [k] apic_timer_interrupt                                                                                               
               0.19%  [kernel]               [k] arch_cpu_idle                    
            
            sihara Shuichi Ihara added a comment - - edited uploaded dumpe2fs.out.gz. i've tested without 'meta_bg' before, but it was same and took a long time. And there were nothing disk I/O at most of time and 100% cpu bound below. Tasks: 237 total, 2 running, 235 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.4 us, 6.6 sy, 0.0 ni, 93.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 15456899+total, 15270185+free, 1553196 used, 313944 buff/cache KiB Swap: 5472252 total, 5472252 free, 0 used. 15216664+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13008 root 20 0 19940 1052 868 R 100.0 0.0 0:29.86 mount 1 root 20 0 44604 4908 2552 S 0.0 0.0 0:02.00 systemd it looks ldiskfs_get_group_desc() and ldiskfs_calculate_overhead() are taking most of CPU cycle a long while during mount. Samples: 108K of event 'cycles', Event count (approx.): 26372312997 Overhead Shared Object Symbol 52.20% [kernel] [k] ldiskfs_get_group_desc 45.13% [kernel] [k] ldiskfs_calculate_overhead 0.31% [kernel] [k] native_write_msr_safe 0.23% [kernel] [k] crc16 0.21% [kernel] [k] apic_timer_interrupt 0.19% [kernel] [k] arch_cpu_idle

            People

              dongyang Dongyang Li
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: