[LU-15319] Weird mballoc behaviour Created: 06/Dec/21  Updated: 25/Sep/23  Resolved: 25/Sep/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Alexander Zarochentsev Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16162 ldiskfs: use low disk tracks for bloc... Open
is related to LU-12970 improve mballoc for huge filesystems Open
is related to LU-14438 backport ldiskfs mballoc patches Open
Rank (Obsolete): 9223372036854775807

 Description   

A weird mballoc behavior in sudden STREAM_ALLOC allocator head jump after a target mount:

# grep -H "" /proc/fs/ldiskfs/md*/mb_last_group
/proc/fs/ldiskfs/md0/mb_last_group:0
/proc/fs/ldiskfs/md2/mb_last_group:0
# echo > /sys/kernel/debug/tracing/trace
# nobjlo=2 nobjhi=2 thrlo=1024 thrhi=1024 size=393216 rszlo=4096 rszhi=4096 tests_str="write" obdfilter-survey 2>&1 | tee /root/obdfilter-survey.log
Fri Dec  3 12:25:19 UTC 2021 Obdfilter-survey for case=disk from kjlmo1304
ost  2 sz 805306368K rsz 4096K obj    4 thr 2048 write 16552.35 [4580.64, 9382.91] 
/usr/bin/iokit-libecho: line 236: 253095 Killed                  remote_shell $host "vmstat 5 >> $host_vmstatf" &>/dev/null
done!
# grep -H "" /proc/fs/ldiskfs/md*/mb_last_group
/proc/fs/ldiskfs/md0/mb_last_group:114337
/proc/fs/ldiskfs/md2/mb_last_group:130831
#

The streaming allocator head jumped right to the first non-initialized group and now it is the last inited group (the target fs is almost empty):

[root@kjlmo1304 ~]# dumpe2fs /dev/md0 | sed '/BLOCK/q' | tail -24
....
Group 114335: (Blocks 3746529280-3746562047) csum 0x1b7a [INODE_UNINIT, ITABLE_ZEROED]
  Block bitmap at 3741319328 (bg #114176 + 160)
  Inode bitmap at 3741319584 (bg #114176 + 416)
  Inode table at 3741322225-3741322240 (bg #114176 + 3057)
  32768 free blocks, 128 free inodes, 0 directories, 128 unused inodes
  Free blocks: 3746529280-3746562047
  Free inodes: 14634881-14635008
Group 114336: (Blocks 3746562048-3746594815) csum 0x37c1 [INODE_UNINIT, ITABLE_ZEROED]
  Block bitmap at 3741319329 (bg #114176 + 161)
  Inode bitmap at 3741319585 (bg #114176 + 417)
  Inode table at 3741322241-3741322256 (bg #114176 + 3073)
  32768 free blocks, 128 free inodes, 0 directories, 128 unused inodes
  Free blocks: 3746562048-3746594815
  Free inodes: 14635009-14635136
Group 114337: (Blocks 3746594816-3746627583) csum 0xbacd [INODE_UNINIT, ITABLE_ZEROED]
  Block bitmap at 3741319330 (bg #114176 + 162)
  Inode bitmap at 3741319586 (bg #114176 + 418)
  Inode table at 3741322257-3741322272 (bg #114176 + 3089)
  32768 free blocks, 128 free inodes, 0 directories, 128 unused inodes
  Free blocks: 3746594816-3746627583
  Free inodes: 14635137-14635264
Group 114338: (Blocks 3746627584-3746660351) csum 0xca57 [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]

The above jump is not big enough to cause performance impact, but the same behavior was observed on another system with 2M block group initialized, that mb_last_group jump shifted block allocations on an empty fs over the middle of the disk device with approximately 15% write / read slowdown.

Looks like it was due to the following checks in ldiksfs_mb_good_group()

        /* We only do this if the grp has never been initialized */
        if (unlikely(LDISKFS_MB_GRP_NEED_INIT(grp))) {
                int ret;

                /* cr=0/1 is a very optimistic search to find large
                 * good chunks almost for free. if buddy data is
                 * not ready, then this optimization makes no sense */

                if (cr < 2 && !ldiskfs_mb_uninit_on_disk(ac->ac_sb, group))
                        return 0;
                ret = ldiskfs_mb_init_group(ac->ac_sb, group);
                if (ret)
                        return 0;
        }

introduced by

ecb68b8 LU-13291 ldiskfs: mballoc don't skip uninit-on-disk groups
6a7a700 LU-12988 ldiskfs: skip non-loaded groups at cr=0/1 


 Comments   
Comment by Andreas Dilger [ 10/May/23 ]

I suspect that this issue could be resolved with the new mballoc allocator from upstream kernels.

Comment by Andreas Dilger [ 25/Sep/23 ]

The mballoc array-based group selection is almost ready to land in LU-14438 and I think that any development in that area should first start with backporting the next set of mballoc patches from upstream ext4, which address most of these issues.

Generated at Sat Feb 10 03:17:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.