[LU-12505] mounting bigalloc enabled large OST takes a long time Created: 03/Jul/19 Updated: 10/May/23 Resolved: 10/May/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | e2fsprogs | ||
| Environment: |
master |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Not only Lustre OST, but also when OSS mounts large OST device which 'bigalloc' is enabled, it takes huge amount of time to complete. # time mount -t ldiskfs /dev/ddn/scratch0_ost0000 /lustre/scratch0/ost0000 real 12m32.153s user 0m0.000s sys 11m49.887s # dumpe2fs -h /dev/ddn/scratch0_ost0000
dumpe2fs 1.45.2.wc1 (27-May-2019)
Filesystem volume name: scratch0-OST0000
Last mounted on: /
Filesystem UUID: 1ca9dd81-8b70-4805-a430-78b0eafc1c45
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr dir_index filetype needs_recovery meta_bg extent 64bit mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota bigalloc
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 1074397184
Block count: 275045679104
Reserved block count: 2750456791
Free blocks: 274909403680
Free inodes: 1074396851
First block: 0
Block size: 4096
Cluster size: 131072
Group descriptor size: 64
Blocks per group: 1048576
Clusters per group: 32768
Inodes per group: 4096
Inode blocks per group: 512
RAID stride: 512
RAID stripe width: 512
Flex block group size: 256
Filesystem created: Mon Jul 1 00:43:14 2019
Last mount time: Wed Jul 3 05:55:22 2019
Last write time: Wed Jul 3 05:55:22 2019
Mount count: 8
Maximum mount count: -1
Last checked: Mon Jul 1 00:43:14 2019
Check interval: 0 (<none>)
Lifetime writes: 2693 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 512
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 4eeb2234-062d-4af5-8973-872baabd2e9f
Journal backup: inode blocks
MMP block number: 131680
MMP update interval: 5
User quota inode: 3
Group quota inode: 4
Journal features: journal_incompat_revoke journal_64bit
Journal size: 4096M
Journal length: 1048576
Journal sequence: 0x00000494
Journal start: 0
MMP_block:
mmp_magic: 0x4d4d50
mmp_check_interval: 10
mmp_sequence: 0x0000cd
mmp_update_date: Wed Jul 3 06:00:33 2019
mmp_update_time: 1562133633
mmp_node_name: es18k-vm11
mmp_device_name: sda
Without bigalloc # time mount -t ldiskfs /dev/ddn/scratch0_ost0000 /lustre/scratch0/ost0000 real 0m6.484s user 0m0.000s sys 0m4.954s |
| Comments |
| Comment by Andreas Dilger [ 03/Jul/19 ] |
|
Could you please add the "dumpe2fs" output from the OST (gzipped). Normally when mount is slow it is because the disk is seeking between thousands/millions of different data structures at 10ms/seek. Those problems were largely fixed by flex_bg, but it may be that meta_bg has reintroduced this problem again. It may be that with bigalloc we don't need meta_bg anymore because the number of block groups are reduced? |
| Comment by Shuichi Ihara [ 03/Jul/19 ] |
|
uploaded dumpe2fs.out.gz. i've tested without 'meta_bg' before, but it was same and took a long time. Tasks: 237 total, 2 running, 235 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 6.6 sy, 0.0 ni, 93.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 15456899+total, 15270185+free, 1553196 used, 313944 buff/cache
KiB Swap: 5472252 total, 5472252 free, 0 used. 15216664+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13008 root 20 0 19940 1052 868 R 100.0 0.0 0:29.86 mount
1 root 20 0 44604 4908 2552 S 0.0 0.0 0:02.00 systemd
it looks ldiskfs_get_group_desc() and ldiskfs_calculate_overhead() are taking most of CPU cycle a long while during mount. Samples: 108K of event 'cycles', Event count (approx.): 26372312997 Overhead Shared Object Symbol 52.20% [kernel] [k] ldiskfs_get_group_desc 45.13% [kernel] [k] ldiskfs_calculate_overhead 0.31% [kernel] [k] native_write_msr_safe 0.23% [kernel] [k] crc16 0.21% [kernel] [k] apic_timer_interrupt 0.19% [kernel] [k] arch_cpu_idle |
| Comment by Andreas Dilger [ 03/Jul/19 ] |
|
It looks like the problem is in ext4_calculate_overhead() and count_overhead(), since there is a simple calculation for normal filesystems, and a complex one that loads and checks every group in the bigalloc case, and ext4_calculate_ovehead() calls count_overhead() for every group as well: static int count_overhead(struct super_block *sb, ext4_group_t grp, char *buf) { if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_BIGALLOC)) return (ext4_bg_has_super(sb, grp) + ext4_bg_num_gdb(sb, grp) + sbi->s_itb_per_group + 2); first_block = le32_to_cpu(sbi->s_es->s_first_data_block) + (grp * EXT4_BLOCKS_PER_GROUP(sb)); last_block = first_block + EXT4_BLOCKS_PER_GROUP(sb) - 1; for (i = 0; i < ngroups; i++) { gdp = ext4_get_group_desc(sb, i, NULL); : int ext4_calculate_overhead(struct super_block *sb) { /* Compute the overhead (FS structures). This is constant * for a given filesystem unless the number of block groups * changes so we cache the previous value until it does. */ /* All of the blocks before first_data_block are overhead */ overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block)); /* Add the overhead found in each block group */ for (i = 0; i < ngroups; i++) { blks = count_overhead(sb, i, buf); overhead += blks; That means for a 1024 TiB filesystem (num_groups = 1024TB / (32768 * chunk_size/group) = 256K groups) it will do 256K*256K = 68B checks, which would be very slow and pointless. I did read somewhere that mke2fs should store this overhead into the superblock at format time, so the kernel can avoid doing this pointless operation, but possibly that isn't in the kernel you are using, or it isn't working properly and nobody noticed for small filesystems? |
| Comment by Shuichi Ihara [ 03/Jul/19 ] |
|
maybe, it would be better to test with newer kernel if same behavior reproduced? without bigalloc # time mkfs.lustre --ost --servicenode=127.0.0.2@tcp --fsname=scratch0 --index=2 --mgsnode=127.0.0.2@tcp --mkfsoptions='-E lazy_itable_init=0,lazy_journal_init=0,stripe_width=512,stride=512 -O meta_bg,^resize_inode -m1 -J size=4096' --reformat --backfstype=ldiskfs /dev/ddn/scratch0_ost0tune2fs -E mmp_update_interval=5 /dev/ddn/scratch0_ost0002 real 9m11.614s user 0m59.894s sys 7m10.594s with bigalloc # time mkfs.lustre --ost --servicenode=127.0.0.2@tcp --fsname=scratch0 --index=0 --mgsnode=127.0.0.2@tcp --mkfsoptions='-E lazy_itable_init=0,lazy_journal_init=0,stripe_width=512,stride=512 -O bigalloc -C 131072 -m1 -J size=4096' --reformat --backfstype=ldiskfs /dev/ddn/scratch0_ost0000 real 43m5.349s user 24m29.652s sys 18m35.058s The most of CPU time are consumed at the following functions which I didn't see mke2fs without '-O bigalloc'. Samples: 24K of event 'cycles', Event count (approx.): 14154870804 Overhead Shared Object Symbol 46.30% libext2fs.so.2.4 [.] rb_test_bmap 32.98% libext2fs.so.2.4 [.] ext2fs_test_generic_bmap 13.10% libext2fs.so.2.4 [.] ext2fs_convert_subcluster_bitmap 6.96% libext2fs.so.2.4 [.] ext2fs_test_generic_bmap@plt |
| Comment by Andreas Dilger [ 03/Jul/19 ] |
|
It wouldn't be a bad idea to post an email to linux-ext4 with this information. Maybe we can get some input on how to fix it, or Ted will "just know" the best way to fix the problem. |
| Comment by Gerrit Updater [ 31/Jul/19 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/35659 |
| Comment by Alexey Lyashkov [ 01/Aug/19 ] |
|
>it looks ldiskfs_get_group_desc() and ldiskfs_calculate_overhead() are taking most of CPU cycle a long while during mount. only once and store to super block for later use. >46.30% libext2fs.so.2.4 [.] rb_test_bmap it's know problem. bitmaps on e2fsprogs isn't good designed in case word have a several bits set, replace with IDR (from kernel) can improve speed dramatically. |
| Comment by Gerrit Updater [ 13/Aug/19 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/35781 |
| Comment by Andreas Dilger [ 01/Nov/19 ] |
|
Dongyang, have these patches been submitted upstream yet? |
| Comment by Andreas Dilger [ 27/May/20 ] |
|
To answer my own question, the bigalloc patches are on the master branch of the e2fsprogs repo, but not in the maint branch for 1.45.6. |
| Comment by Andreas Dilger [ 10/May/23 ] |
|
Patch was landed upstream for 1.46 via commit 59037c5357d39c6d0f14a0aff70e67dc13eafc84 |