[LU-12505] mounting bigalloc enabled large OST takes a long time Created: 03/Jul/19  Updated: 10/May/23  Resolved: 10/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: e2fsprogs
Environment:

master


Attachments: File dumpe2fs.out.gz    
Issue Links:
Related
is related to LU-13604 rebase Lustre e2fsprogs onto 1.45.6 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Not only Lustre OST, but also when OSS mounts large OST device which 'bigalloc' is enabled, it takes huge amount of time to complete.

# time mount -t ldiskfs /dev/ddn/scratch0_ost0000 /lustre/scratch0/ost0000

real    12m32.153s
user    0m0.000s
sys     11m49.887s
# dumpe2fs -h /dev/ddn/scratch0_ost0000
dumpe2fs 1.45.2.wc1 (27-May-2019)
Filesystem volume name:   scratch0-OST0000
Last mounted on:          /
Filesystem UUID:          1ca9dd81-8b70-4805-a430-78b0eafc1c45
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery meta_bg extent 64bit mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota bigalloc
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              1074397184
Block count:              275045679104
Reserved block count:     2750456791
Free blocks:              274909403680
Free inodes:              1074396851
First block:              0
Block size:               4096
Cluster size:             131072
Group descriptor size:    64
Blocks per group:         1048576
Clusters per group:       32768
Inodes per group:         4096
Inode blocks per group:   512
RAID stride:              512
RAID stripe width:        512
Flex block group size:    256
Filesystem created:       Mon Jul  1 00:43:14 2019
Last mount time:          Wed Jul  3 05:55:22 2019
Last write time:          Wed Jul  3 05:55:22 2019
Mount count:              8
Maximum mount count:      -1
Last checked:             Mon Jul  1 00:43:14 2019
Check interval:           0 (<none>)
Lifetime writes:          2693 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               512
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      4eeb2234-062d-4af5-8973-872baabd2e9f
Journal backup:           inode blocks
MMP block number:         131680
MMP update interval:      5
User quota inode:         3
Group quota inode:        4
Journal features:         journal_incompat_revoke journal_64bit
Journal size:             4096M
Journal length:           1048576
Journal sequence:         0x00000494
Journal start:            0
MMP_block:
    mmp_magic: 0x4d4d50
    mmp_check_interval: 10
    mmp_sequence: 0x0000cd
    mmp_update_date: Wed Jul  3 06:00:33 2019
    mmp_update_time: 1562133633
    mmp_node_name: es18k-vm11
    mmp_device_name: sda

Without bigalloc

# time mount -t ldiskfs /dev/ddn/scratch0_ost0000 /lustre/scratch0/ost0000

real	0m6.484s
user	0m0.000s
sys	0m4.954s


 Comments   
Comment by Andreas Dilger [ 03/Jul/19 ]

Could you please add the "dumpe2fs" output from the OST (gzipped). Normally when mount is slow it is because the disk is seeking between thousands/millions of different data structures at 10ms/seek. Those problems were largely fixed by flex_bg, but it may be that meta_bg has reintroduced this problem again. It may be that with bigalloc we don't need meta_bg anymore because the number of block groups are reduced?

Comment by Shuichi Ihara [ 03/Jul/19 ]

uploaded dumpe2fs.out.gz. i've tested without 'meta_bg' before, but it was same and took a long time.
And there were nothing disk I/O at most of time and 100% cpu bound below.

Tasks: 237 total,   2 running, 235 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  6.6 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 15456899+total, 15270185+free,  1553196 used,   313944 buff/cache
KiB Swap:  5472252 total,  5472252 free,        0 used. 15216664+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                            
13008 root      20   0   19940   1052    868 R 100.0  0.0   0:29.86 mount                                                                              
    1 root      20   0   44604   4908   2552 S   0.0  0.0   0:02.00 systemd         

it looks ldiskfs_get_group_desc() and ldiskfs_calculate_overhead() are taking most of CPU cycle a long while during mount.

Samples: 108K of event 'cycles', Event count (approx.): 26372312997                                                                                     
Overhead  Shared Object          Symbol                                                                                                                 
  52.20%  [kernel]               [k] ldiskfs_get_group_desc                                                                                             
  45.13%  [kernel]               [k] ldiskfs_calculate_overhead                                                                                         
   0.31%  [kernel]               [k] native_write_msr_safe                                                                                               
   0.23%  [kernel]               [k] crc16                                                                                                               
   0.21%  [kernel]               [k] apic_timer_interrupt                                                                                               
   0.19%  [kernel]               [k] arch_cpu_idle                    
Comment by Andreas Dilger [ 03/Jul/19 ]

It looks like the problem is in ext4_calculate_overhead() and count_overhead(), since there is a simple calculation for normal filesystems, and a complex one that loads and checks every group in the bigalloc case, and ext4_calculate_ovehead() calls count_overhead() for every group as well:

static int count_overhead(struct super_block *sb, ext4_group_t grp,
                          char *buf)
{
        if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_BIGALLOC))
                return (ext4_bg_has_super(sb, grp) + ext4_bg_num_gdb(sb, grp) +
                        sbi->s_itb_per_group + 2);
        
        first_block = le32_to_cpu(sbi->s_es->s_first_data_block) +
                (grp * EXT4_BLOCKS_PER_GROUP(sb));
        last_block = first_block + EXT4_BLOCKS_PER_GROUP(sb) - 1;
        for (i = 0; i < ngroups; i++) {
                gdp = ext4_get_group_desc(sb, i, NULL);
                :

int ext4_calculate_overhead(struct super_block *sb)
{
        /* Compute the overhead (FS structures).  This is constant
         * for a given filesystem unless the number of block groups
         * changes so we cache the previous value until it does. */

        /* All of the blocks before first_data_block are overhead */
        overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));

        /* Add the overhead found in each block group */
        for (i = 0; i < ngroups; i++) {
                blks = count_overhead(sb, i, buf);
                overhead += blks;

That means for a 1024 TiB filesystem (num_groups = 1024TB / (32768 * chunk_size/group) = 256K groups) it will do 256K*256K = 68B checks, which would be very slow and pointless. I did read somewhere that mke2fs should store this overhead into the superblock at format time, so the kernel can avoid doing this pointless operation, but possibly that isn't in the kernel you are using, or it isn't working properly and nobody noticed for small filesystems?

Comment by Shuichi Ihara [ 03/Jul/19 ]

maybe, it would be better to test with newer kernel if same behavior reproduced?
btw, mke2fs to bigalloc enabled OST, is also very slow either.

without bigalloc

# time mkfs.lustre --ost --servicenode=127.0.0.2@tcp --fsname=scratch0 --index=2 --mgsnode=127.0.0.2@tcp --mkfsoptions='-E lazy_itable_init=0,lazy_journal_init=0,stripe_width=512,stride=512 -O meta_bg,^resize_inode -m1 -J size=4096' --reformat --backfstype=ldiskfs /dev/ddn/scratch0_ost0tune2fs -E mmp_update_interval=5 /dev/ddn/scratch0_ost0002

real    9m11.614s
user    0m59.894s
sys     7m10.594s

with bigalloc

# time mkfs.lustre --ost --servicenode=127.0.0.2@tcp --fsname=scratch0 --index=0 --mgsnode=127.0.0.2@tcp --mkfsoptions='-E lazy_itable_init=0,lazy_journal_init=0,stripe_width=512,stride=512 -O bigalloc -C 131072 -m1 -J size=4096' --reformat --backfstype=ldiskfs /dev/ddn/scratch0_ost0000

real    43m5.349s
user    24m29.652s
sys     18m35.058s

The most of CPU time are consumed at the following functions which I didn't see mke2fs without '-O bigalloc'.

Samples: 24K of event 'cycles', Event count (approx.): 14154870804              
Overhead  Shared Object      Symbol                                             
  46.30%  libext2fs.so.2.4   [.] rb_test_bmap                                   
  32.98%  libext2fs.so.2.4   [.] ext2fs_test_generic_bmap                       
  13.10%  libext2fs.so.2.4   [.] ext2fs_convert_subcluster_bitmap               
   6.96%  libext2fs.so.2.4   [.] ext2fs_test_generic_bmap@plt       
Comment by Andreas Dilger [ 03/Jul/19 ]

It wouldn't be a bad idea to post an email to linux-ext4 with this information. Maybe we can get some input on how to fix it, or Ted will "just know" the best way to fix the problem.

Comment by Gerrit Updater [ 31/Jul/19 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/35659
Subject: LU-12505 libext2fs: optimize ext2fs_convert_subcluster_bitmap()
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 47d5bc9d922585229dfd5da82a1f19ff93bea28e

Comment by Alexey Lyashkov [ 01/Aug/19 ]

>it looks ldiskfs_get_group_desc() and ldiskfs_calculate_overhead() are taking most of CPU cycle a long while during mount.

only once and store to super block for later use.

>46.30% libext2fs.so.2.4 [.] rb_test_bmap
>32.98% libext2fs.so.2.4 [.] ext2fs_test_generic_bmap

it's know problem. bitmaps on e2fsprogs isn't good designed in case word have a several bits set, replace with IDR (from kernel) can improve speed dramatically.

Comment by Gerrit Updater [ 13/Aug/19 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/35781
Subject: LU-12505 mke2fs: set overhead in super block for bigalloc
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 8624a496ff7c3e4fd69fb7217ff56030111f4460

Comment by Andreas Dilger [ 01/Nov/19 ]

Dongyang, have these patches been submitted upstream yet?

Comment by Andreas Dilger [ 27/May/20 ]

To answer my own question, the bigalloc patches are on the master branch of the e2fsprogs repo, but not in the maint branch for 1.45.6.

Comment by Andreas Dilger [ 10/May/23 ]

Patch was landed upstream for 1.46 via commit 59037c5357d39c6d0f14a0aff70e67dc13eafc84

Generated at Sat Feb 10 02:53:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.