Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12976

Bigalloc sub cluster allocation for ldiskfs

Details

    • 9223372036854775807

    Description

      For large filesystems (over 256TB) meta_bg is always required, as the 
      GDT is larger than a single block group.  However, with bigalloc it 
      is possible to avoid meta_bg since the block group size increases by a factor 
      of the chunk size as well.  That means a 1PiB filesystem could avoid meta_bg 
      if it is using a bigalloc chunk size of 16KB or large. 

      Large cluster size is not good for small files, because we need to occupy whole cluster even if we require some blocks only.

      There is idea of an interesting project idea that called “sub cluster allocation”. This is “bigalloc” + special bitmap inside bigalloc cluster. This gives less metadata, but small blocks if needed.

      Andreas suggesting creating an issue in the jira and start to discuss. Then move discussion to the ext4 email list. I created this issue for initial desussion.

      Attachments

        Issue Links

          Activity

            [LU-12976] Bigalloc sub cluster allocation for ldiskfs

            Hi Artem, thanks for filing this ticket. I agree that some sort of solution with bigalloc is needed, since the use of meta_bg is punishing at large scale because each group descriptor read needs a seek. Also, loading the block bitmaps for large filesystems is very slow after mount (LU-12970) and using bigalloc will reduce the number of bitmaps to be loaded.

            As previously discussed on the ext4 concall, one major issue with bigalloc is that it inflated the block allocations for metadata. This is ok for data blocks since files are typically large, and the file size can help to determine if all of the blocks in a cluster are used. Currently, ext4+bigalloc results in only a single metadata block being used per cluster, which is inefficient for space, and also causes extra seeking when reading the metadata from disk, or extra (unused) blocks being read from the cluster.

            I think even without a disk format change it would be possible for multiple blocks per cluster to be used for the metadata of a single file (eg. directory or extent tree). This would involve building an in-memory sub-cluster bitmap by reading all the metadata of the file/directory before "allocating" a block in the cluster. Depending on the usage, this may be acceptable for performance, since ext4 has very dense metadata and the number of blocks to be read will typically be small. The reading of metadata to construct the sub-cluster bitmap could stop if all of the blocks in a cluster are already in use.

            Limiting a cluster to use by only a single inode has some benefits and drawbacks. The benefit is that it is easy to create the sub-cluster bitmap in memory by scanning only a single inode. Since ext4 inodes have some in-inode space for metadata already (4 inline extents) this would only need to be for large inodes (index tree) or directory leaf blocks (OSTs have few, but very large directories), so I don't think this would be bad. It would also be possible to share a cluster between inodes in the same itable block, but that makes it harder to know when a cluster might be freed.

            Constructing the sub-cluster bitmap in memory has the benefit that it does not need any format change, and all of the blocks in a cluster could be allocated (ie. no need to waste a whole block just for a few bits), but has more overhead. Reading the file metadata may not be much overhead since this will typically already be done for normal access. For directories, it might make sense to treat indirect/index clusters separately from leaf clusters, since that can speed up building the in-memory bitmap by looking only for a specific class of blocks. However, it also increases space overhead for small directories significantly.

            Having an in-memory sub-cluster bitmap could also be a starting point for allocation of sub-blocks in a new cluster in memory before the bitmap is written to disk. If the bitmap is written to disk, it should have a good checksum (eg. include an and inode number as is typical for ext4, plus the inode version?) to determine if the bitmap is valid on load. Optionally, the bitmap could be overwritten by the last block allocated in the cluster, but this would make it harder to free the cluster until the whole inode is deleted (assuming only a single inode is allocating blocks from the cluster). The sub-cluster bitmap could be written to the last block in the cluster, but that has high space overhead (1/16 or 1/64 of the cluster). It could also be written as an xattr on the inode if the cluster is limited to one inode, but this may become complex if there are multiple clusters in use by a single inode.

            I think it makes sense to start with in-memory single-inode sub-cluster allocation, then move on to rebuilding the bitmap in memory from inode metadata (which would be needed for any existing filesystem using bigalloc), and only move to an on-disk bitmap once we know it is needed for performance.

            adilger Andreas Dilger added a comment - Hi Artem, thanks for filing this ticket. I agree that some sort of solution with bigalloc is needed, since the use of meta_bg is punishing at large scale because each group descriptor read needs a seek. Also, loading the block bitmaps for large filesystems is very slow after mount ( LU-12970 ) and using bigalloc will reduce the number of bitmaps to be loaded. As previously discussed on the ext4 concall, one major issue with bigalloc is that it inflated the block allocations for metadata. This is ok for data blocks since files are typically large, and the file size can help to determine if all of the blocks in a cluster are used. Currently, ext4+bigalloc results in only a single metadata block being used per cluster, which is inefficient for space, and also causes extra seeking when reading the metadata from disk, or extra (unused) blocks being read from the cluster. I think even without a disk format change it would be possible for multiple blocks per cluster to be used for the metadata of a single file (eg. directory or extent tree). This would involve building an in-memory sub-cluster bitmap by reading all the metadata of the file/directory before "allocating" a block in the cluster. Depending on the usage, this may be acceptable for performance, since ext4 has very dense metadata and the number of blocks to be read will typically be small. The reading of metadata to construct the sub-cluster bitmap could stop if all of the blocks in a cluster are already in use. Limiting a cluster to use by only a single inode has some benefits and drawbacks. The benefit is that it is easy to create the sub-cluster bitmap in memory by scanning only a single inode. Since ext4 inodes have some in-inode space for metadata already (4 inline extents) this would only need to be for large inodes (index tree) or directory leaf blocks (OSTs have few, but very large directories), so I don't think this would be bad. It would also be possible to share a cluster between inodes in the same itable block, but that makes it harder to know when a cluster might be freed. Constructing the sub-cluster bitmap in memory has the benefit that it does not need any format change, and all of the blocks in a cluster could be allocated (ie. no need to waste a whole block just for a few bits), but has more overhead. Reading the file metadata may not be much overhead since this will typically already be done for normal access. For directories, it might make sense to treat indirect/index clusters separately from leaf clusters, since that can speed up building the in-memory bitmap by looking only for a specific class of blocks. However, it also increases space overhead for small directories significantly. Having an in-memory sub-cluster bitmap could also be a starting point for allocation of sub-blocks in a new cluster in memory before the bitmap is written to disk. If the bitmap is written to disk, it should have a good checksum (eg. include an and inode number as is typical for ext4, plus the inode version?) to determine if the bitmap is valid on load. Optionally, the bitmap could be overwritten by the last block allocated in the cluster, but this would make it harder to free the cluster until the whole inode is deleted (assuming only a single inode is allocating blocks from the cluster). The sub-cluster bitmap could be written to the last block in the cluster, but that has high space overhead (1/16 or 1/64 of the cluster). It could also be written as an xattr on the inode if the cluster is limited to one inode, but this may become complex if there are multiple clusters in use by a single inode. I think it makes sense to start with in-memory single-inode sub-cluster allocation, then move on to rebuilding the bitmap in memory from inode metadata (which would be needed for any existing filesystem using bigalloc), and only move to an on-disk bitmap once we know it is needed for performance.

            People

              wc-triage WC Triage
              artem_blagodarenko Artem Blagodarenko (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: