Hi Artem, thanks for filing this ticket. I agree that some sort of solution with bigalloc is needed, since the use of meta_bg is punishing at large scale because each group descriptor read needs a seek. Also, loading the block bitmaps for large filesystems is very slow after mount (LU-12970) and using bigalloc will reduce the number of bitmaps to be loaded.
As previously discussed on the ext4 concall, one major issue with bigalloc is that it inflated the block allocations for metadata. This is ok for data blocks since files are typically large, and the file size can help to determine if all of the blocks in a cluster are used. Currently, ext4+bigalloc results in only a single metadata block being used per cluster, which is inefficient for space, and also causes extra seeking when reading the metadata from disk, or extra (unused) blocks being read from the cluster.
I think even without a disk format change it would be possible for multiple blocks per cluster to be used for the metadata of a single file (eg. directory or extent tree). This would involve building an in-memory sub-cluster bitmap by reading all the metadata of the file/directory before "allocating" a block in the cluster. Depending on the usage, this may be acceptable for performance, since ext4 has very dense metadata and the number of blocks to be read will typically be small. The reading of metadata to construct the sub-cluster bitmap could stop if all of the blocks in a cluster are already in use.
Limiting a cluster to use by only a single inode has some benefits and drawbacks. The benefit is that it is easy to create the sub-cluster bitmap in memory by scanning only a single inode. Since ext4 inodes have some in-inode space for metadata already (4 inline extents) this would only need to be for large inodes (index tree) or directory leaf blocks (OSTs have few, but very large directories), so I don't think this would be bad. It would also be possible to share a cluster between inodes in the same itable block, but that makes it harder to know when a cluster might be freed.
Constructing the sub-cluster bitmap in memory has the benefit that it does not need any format change, and all of the blocks in a cluster could be allocated (ie. no need to waste a whole block just for a few bits), but has more overhead. Reading the file metadata may not be much overhead since this will typically already be done for normal access. For directories, it might make sense to treat indirect/index clusters separately from leaf clusters, since that can speed up building the in-memory bitmap by looking only for a specific class of blocks. However, it also increases space overhead for small directories significantly.
Having an in-memory sub-cluster bitmap could also be a starting point for allocation of sub-blocks in a new cluster in memory before the bitmap is written to disk. If the bitmap is written to disk, it should have a good checksum (eg. include an and inode number as is typical for ext4, plus the inode version?) to determine if the bitmap is valid on load. Optionally, the bitmap could be overwritten by the last block allocated in the cluster, but this would make it harder to free the cluster until the whole inode is deleted (assuming only a single inode is allocating blocks from the cluster). The sub-cluster bitmap could be written to the last block in the cluster, but that has high space overhead (1/16 or 1/64 of the cluster). It could also be written as an xattr on the inode if the cluster is limited to one inode, but this may become complex if there are multiple clusters in use by a single inode.
I think it makes sense to start with in-memory single-inode sub-cluster allocation, then move on to rebuilding the bitmap in memory from inode metadata (which would be needed for any existing filesystem using bigalloc), and only move to an on-disk bitmap once we know it is needed for performance.
Hi Artem, thanks for filing this ticket. I agree that some sort of solution with bigalloc is needed, since the use of meta_bg is punishing at large scale because each group descriptor read needs a seek. Also, loading the block bitmaps for large filesystems is very slow after mount (LU-12970) and using bigalloc will reduce the number of bitmaps to be loaded.
As previously discussed on the ext4 concall, one major issue with bigalloc is that it inflated the block allocations for metadata. This is ok for data blocks since files are typically large, and the file size can help to determine if all of the blocks in a cluster are used. Currently, ext4+bigalloc results in only a single metadata block being used per cluster, which is inefficient for space, and also causes extra seeking when reading the metadata from disk, or extra (unused) blocks being read from the cluster.
I think even without a disk format change it would be possible for multiple blocks per cluster to be used for the metadata of a single file (eg. directory or extent tree). This would involve building an in-memory sub-cluster bitmap by reading all the metadata of the file/directory before "allocating" a block in the cluster. Depending on the usage, this may be acceptable for performance, since ext4 has very dense metadata and the number of blocks to be read will typically be small. The reading of metadata to construct the sub-cluster bitmap could stop if all of the blocks in a cluster are already in use.
Limiting a cluster to use by only a single inode has some benefits and drawbacks. The benefit is that it is easy to create the sub-cluster bitmap in memory by scanning only a single inode. Since ext4 inodes have some in-inode space for metadata already (4 inline extents) this would only need to be for large inodes (index tree) or directory leaf blocks (OSTs have few, but very large directories), so I don't think this would be bad. It would also be possible to share a cluster between inodes in the same itable block, but that makes it harder to know when a cluster might be freed.
Constructing the sub-cluster bitmap in memory has the benefit that it does not need any format change, and all of the blocks in a cluster could be allocated (ie. no need to waste a whole block just for a few bits), but has more overhead. Reading the file metadata may not be much overhead since this will typically already be done for normal access. For directories, it might make sense to treat indirect/index clusters separately from leaf clusters, since that can speed up building the in-memory bitmap by looking only for a specific class of blocks. However, it also increases space overhead for small directories significantly.
Having an in-memory sub-cluster bitmap could also be a starting point for allocation of sub-blocks in a new cluster in memory before the bitmap is written to disk. If the bitmap is written to disk, it should have a good checksum (eg. include an and inode number as is typical for ext4, plus the inode version?) to determine if the bitmap is valid on load. Optionally, the bitmap could be overwritten by the last block allocated in the cluster, but this would make it harder to free the cluster until the whole inode is deleted (assuming only a single inode is allocating blocks from the cluster). The sub-cluster bitmap could be written to the last block in the cluster, but that has high space overhead (1/16 or 1/64 of the cluster). It could also be written as an xattr on the inode if the cluster is limited to one inode, but this may become complex if there are multiple clusters in use by a single inode.
I think it makes sense to start with in-memory single-inode sub-cluster allocation, then move on to rebuilding the bitmap in memory from inode metadata (which would be needed for any existing filesystem using bigalloc), and only move to an on-disk bitmap once we know it is needed for performance.