Do you know how many inodes are actually created from mdtest -i 10 -z 10 -n 32768? It would seem to be north of 10^9=1B directories (depth=10, fanout=10), but not a large number of regular files (or any regular files at all, since my testing shows the files are created at each level of the tree, but only if files > sum(depth^i, i = 0..fanout)). It appears with 1B directories and only 32k files that no regular files would be created in this test.
The "df -i" and "lfs df -i" are somewhat inconsistent, showing either 13M or 10M inodes allocated. Similarly, the clients also show fewer blocks allocated on the MDT (421MB vs. 327MB), which makes me think that the client is showing slightly stale statfs data? If anything, that is a different issue I think.
In both cases, however, the ratio of space used per inode is about the same - 32KB per allocated inode. If the majority of "files" created by mdtest are actually directories (as my local testing shows) then this space usage may not be unusual since there is a FAT ZAP allocated for each directory, which includes an 8KB index block and at least one 8KB leaf block, x2 for metadata ditto copies.
What does seem confusing, and I believe is the root of your concern, is that while df is showing the MDT to be 27% (or 21%) full at the block level, df -i is showing only 5% (or 3%) full at the inode level. Looking at the "Available" values (1238155616 KB * 1024 / 309754058 inodes) ~= 4096 bytes/inode which is why there is a discrepancy. Looking at the osd_statfs code for osd-zfs, the 4096 bytes/inode value comes from the recordsize=4096 dataset parameter.
The statfs calculations should be taking the actual space used per inode into account when computing the "total" inode count, but this appears to be confused by the low recordsize parameter, since it constrains the average dnode space usage to at most block (e.g. if a large number of small files were created). Normally, the maximum blocksize is 128KB and it will use the actual space usage (32KB per dnode, which is less than one block/dnode) when computing the number of "free" dnodes. It should be computing avg_bytes_per_inode = used_bytes / used_inodes (which is the only information that ZFS actually tracks) and then total_inodes = available_blocks / avg_bytes_per_inode, in this case (1238155616 KB * 1024 / 32768 bytes/inode ~= 38M inodes) and 10M/48M ~= 21% full as the blocks usage would suggest.
For this particular test case, I suspect removing the recordsize=4096 setting will fix the total dnode count to align with the average 32KB/dnode used by each directory. In the real-world case where there are 8KB of spill blocks per dnode instead of the estimated 1KB per dnode, this will depend on the large dnode support in order to fix, but removing the recordsize parameter would at least allow the code to compute the estimate more accurately.
I am re-running the mdtest on a new LFS with the exact configuration, just omitting the recordsize=4096 zpool option. So far the percentages are tracking nearly identically, currently at 16% and climbing.
[root@n0002 ~]# lfs df && lfs df -i
UUID 1K-blocks Used Available Use% Mounted on
lustre2-MDT0000_UUID 1566225280 246041344 1320181888 16% /lustre[MDT:0]
lustre2-OST0000_UUID 42844636160 15360 42844618752 0% /lustre[OST:0]
lustre2-OST0001_UUID 42844636160 15360 42844618752 0% /lustre[OST:1]
filesystem summary: 85689272320 30720 85689237504 0% /lustre
UUID Inodes IUsed IFree IUse% Mounted on
lustre2-MDT0000_UUID 48371166 7598884 40772282 16% /lustre[MDT:0]
lustre2-OST0000_UUID 193185425 223 193185202 0% /lustre[OST:0]
lustre2-OST0001_UUID 193185425 223 193185202 0% /lustre[OST:1]
filesystem summary: 48371166 7598884 40772282 16% /lustre
As I read your dnode breakdown in your last comment I am left with the sense that in the osd-zfs environment the MDT inode utilization and monitoring high water mark of available inodes is a bit of black magic as opposed to a ldiskfs MDT. Based on the distribution of files and directories running out of inodes on the MDT is less predictable until you are close to running out.
Does it make sense to use a recordsize=32768 or recordsize=65536 as a way to avoid loss of so much potential capacity to the 128K default?