Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
Upstream, Lustre 2.16.1
-
3
-
9223372036854775807
Description
Lustre directories on ZFS always use the FatZAP format because each directory entry (luz_direntry) stores both the ZFS dnode ID and a Lustre FID, which exceeds the MicroZAP single uint64 value limitation. The FatZAP leaf block size was hardcoded to 14 (16K blocks) in Lustre, due to which even a single empty directory takes a lot of space.
On dRAID or RAIDZ pools, this results in 90-110K dsize per empty directory due to stripe alignment and parity overhead. For a typical MDT with millions of directories, this wastes significant pool space.
I did a minimal test by introducing osd_fzap_blockshift as a module parameter, replacing the hardcoded 14. The parameter is exposed via /sys/module/osd_zfs/parameters/osd_fzap_blockshift.
Before:
# mkdir testdir1 && touch testfile1 # du --si test* 100k testdir1 1.1k testfile1
After:
# mkdir testdir1 && touch testfile1 # du --si test* 67k testdir1 1.1k testfile1
Testing on the dRAID2:9d:12c:1s pool with different leaf_blockshift values showed the following dsize per empty directory:
blockshift=14 (16K): dsize=~100K (Currentdefault)
Vary FatZAP leaf block size:
blockshift=12 (4K): dsize=~67K
blockshift=13 (8K): dsize=~67K
blockshift=14 (16K): dsize=~100K
blockshift=15 (32K): dsize=~100K
I think blockshift values from 12 to 15 make a lot of sense. The limits are already in place on ZFS, so I'm not explicitly adding any.
Future Work:
The real fix would be to store luz_direntry (dnode + FID), allowing small directories to avoid FatZAP entirely. This TinyZAP implementation would require changes on both the ZFS and Lustre sides. This would reduce empty directory dsize to a very low value by storing entries in the existing dnode bonus buffer with no additional block allocation.