Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20010

Control FatZAP leaf blocksize

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Lustre 2.18.0
    • Upstream, Lustre 2.16.1
    • 3
    • 9223372036854775807

    Description

      Lustre directories on ZFS always use the FatZAP format because each directory entry (luz_direntry) stores both the ZFS dnode ID and a Lustre FID, which exceeds the MicroZAP single uint64 value limitation. The FatZAP leaf block size was hardcoded to 14 (16K blocks) in Lustre, due to which even a single empty directory takes a lot of space.

      On dRAID or RAIDZ pools, this results in 90-110K dsize per empty directory due to stripe alignment and parity overhead. For a typical MDT with millions of directories, this wastes significant pool space.

      I did a minimal test by introducing osd_fzap_blockshift as a module parameter, replacing the hardcoded 14. The parameter is exposed via /sys/module/osd_zfs/parameters/osd_fzap_blockshift. 

      Before:

      # mkdir testdir1 && touch testfile1
      # du --si test*
      100k    testdir1
      1.1k    testfile1

      After:

      # mkdir testdir1 && touch testfile1
      # du --si test*
      67k     testdir1
      1.1k    testfile1

      Testing on the dRAID2:9d:12c:1s pool with different leaf_blockshift values showed the following dsize per empty directory:
        blockshift=14 (16K): dsize=~100K (Currentdefault)

      Vary FatZAP leaf block size:
        blockshift=12 (4K):  dsize=~67K
        blockshift=13 (8K):  dsize=~67K 
        blockshift=14 (16K): dsize=~100K
        blockshift=15 (32K): dsize=~100K

      I think blockshift values from 12 to 15 make a lot of sense. The limits are already in place on ZFS, so I'm not explicitly adding any.

      Future Work: 
      The real fix would be to store luz_direntry (dnode + FID), allowing small directories to avoid FatZAP entirely. This TinyZAP implementation would require changes on both the ZFS and Lustre sides. This would reduce empty directory dsize to a very low value by storing entries in the existing dnode bonus buffer with no additional block allocation. 

      Attachments

        Activity

          People

            akash-b Akash B
            akash-b Akash B
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: