Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • 9223372036854775807

    Description

      A weird mballoc behavior in sudden STREAM_ALLOC allocator head jump after a target mount:

      # grep -H "" /proc/fs/ldiskfs/md*/mb_last_group
      /proc/fs/ldiskfs/md0/mb_last_group:0
      /proc/fs/ldiskfs/md2/mb_last_group:0
      # echo > /sys/kernel/debug/tracing/trace
      # nobjlo=2 nobjhi=2 thrlo=1024 thrhi=1024 size=393216 rszlo=4096 rszhi=4096 tests_str="write" obdfilter-survey 2>&1 | tee /root/obdfilter-survey.log
      Fri Dec  3 12:25:19 UTC 2021 Obdfilter-survey for case=disk from kjlmo1304
      ost  2 sz 805306368K rsz 4096K obj    4 thr 2048 write 16552.35 [4580.64, 9382.91] 
      /usr/bin/iokit-libecho: line 236: 253095 Killed                  remote_shell $host "vmstat 5 >> $host_vmstatf" &>/dev/null
      done!
      # grep -H "" /proc/fs/ldiskfs/md*/mb_last_group
      /proc/fs/ldiskfs/md0/mb_last_group:114337
      /proc/fs/ldiskfs/md2/mb_last_group:130831
      #
      

      The streaming allocator head jumped right to the first non-initialized group and now it is the last inited group (the target fs is almost empty):

      [root@kjlmo1304 ~]# dumpe2fs /dev/md0 | sed '/BLOCK/q' | tail -24
      ....
      Group 114335: (Blocks 3746529280-3746562047) csum 0x1b7a [INODE_UNINIT, ITABLE_ZEROED]
        Block bitmap at 3741319328 (bg #114176 + 160)
        Inode bitmap at 3741319584 (bg #114176 + 416)
        Inode table at 3741322225-3741322240 (bg #114176 + 3057)
        32768 free blocks, 128 free inodes, 0 directories, 128 unused inodes
        Free blocks: 3746529280-3746562047
        Free inodes: 14634881-14635008
      Group 114336: (Blocks 3746562048-3746594815) csum 0x37c1 [INODE_UNINIT, ITABLE_ZEROED]
        Block bitmap at 3741319329 (bg #114176 + 161)
        Inode bitmap at 3741319585 (bg #114176 + 417)
        Inode table at 3741322241-3741322256 (bg #114176 + 3073)
        32768 free blocks, 128 free inodes, 0 directories, 128 unused inodes
        Free blocks: 3746562048-3746594815
        Free inodes: 14635009-14635136
      Group 114337: (Blocks 3746594816-3746627583) csum 0xbacd [INODE_UNINIT, ITABLE_ZEROED]
        Block bitmap at 3741319330 (bg #114176 + 162)
        Inode bitmap at 3741319586 (bg #114176 + 418)
        Inode table at 3741322257-3741322272 (bg #114176 + 3089)
        32768 free blocks, 128 free inodes, 0 directories, 128 unused inodes
        Free blocks: 3746594816-3746627583
        Free inodes: 14635137-14635264
      Group 114338: (Blocks 3746627584-3746660351) csum 0xca57 [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
      

      The above jump is not big enough to cause performance impact, but the same behavior was observed on another system with 2M block group initialized, that mb_last_group jump shifted block allocations on an empty fs over the middle of the disk device with approximately 15% write / read slowdown.

      Looks like it was due to the following checks in ldiksfs_mb_good_group()

              /* We only do this if the grp has never been initialized */
              if (unlikely(LDISKFS_MB_GRP_NEED_INIT(grp))) {
                      int ret;
      
                      /* cr=0/1 is a very optimistic search to find large
                       * good chunks almost for free. if buddy data is
                       * not ready, then this optimization makes no sense */
      
                      if (cr < 2 && !ldiskfs_mb_uninit_on_disk(ac->ac_sb, group))
                              return 0;
                      ret = ldiskfs_mb_init_group(ac->ac_sb, group);
                      if (ret)
                              return 0;
              }
      
      

      introduced by

      ecb68b8 LU-13291 ldiskfs: mballoc don't skip uninit-on-disk groups
      6a7a700 LU-12988 ldiskfs: skip non-loaded groups at cr=0/1 
      

      Attachments

        Issue Links

          Activity

            [LU-15319] Weird mballoc behaviour

            The mballoc array-based group selection is almost ready to land in LU-14438 and I think that any development in that area should first start with backporting the next set of mballoc patches from upstream ext4, which address most of these issues.

            adilger Andreas Dilger added a comment - The mballoc array-based group selection is almost ready to land in LU-14438 and I think that any development in that area should first start with backporting the next set of mballoc patches from upstream ext4, which address most of these issues.

            I suspect that this issue could be resolved with the new mballoc allocator from upstream kernels.

            adilger Andreas Dilger added a comment - I suspect that this issue could be resolved with the new mballoc allocator from upstream kernels.

            People

              bzzz Alex Zhuravlev
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: