Do you know how many inodes are actually created from mdtest -i 10 -z 10 -n 32768? It would seem to be north of 10^9=1B directories (depth=10, fanout=10), but not a large number of regular files (or any regular files at all, since my testing shows the files are created at each level of the tree, but only if files > sum(depth^i, i = 0..fanout)). It appears with 1B directories and only 32k files that no regular files would be created in this test.
The "df -i" and "lfs df -i" are somewhat inconsistent, showing either 13M or 10M inodes allocated. Similarly, the clients also show fewer blocks allocated on the MDT (421MB vs. 327MB), which makes me think that the client is showing slightly stale statfs data? If anything, that is a different issue I think.
In both cases, however, the ratio of space used per inode is about the same - 32KB per allocated inode. If the majority of "files" created by mdtest are actually directories (as my local testing shows) then this space usage may not be unusual since there is a FAT ZAP allocated for each directory, which includes an 8KB index block and at least one 8KB leaf block, x2 for metadata ditto copies.
What does seem confusing, and I believe is the root of your concern, is that while df is showing the MDT to be 27% (or 21%) full at the block level, df -i is showing only 5% (or 3%) full at the inode level. Looking at the "Available" values (1238155616 KB * 1024 / 309754058 inodes) ~= 4096 bytes/inode which is why there is a discrepancy. Looking at the osd_statfs code for osd-zfs, the 4096 bytes/inode value comes from the recordsize=4096 dataset parameter.
The statfs calculations should be taking the actual space used per inode into account when computing the "total" inode count, but this appears to be confused by the low recordsize parameter, since it constrains the average dnode space usage to at most block (e.g. if a large number of small files were created). Normally, the maximum blocksize is 128KB and it will use the actual space usage (32KB per dnode, which is less than one block/dnode) when computing the number of "free" dnodes. It should be computing avg_bytes_per_inode = used_bytes / used_inodes (which is the only information that ZFS actually tracks) and then total_inodes = available_blocks / avg_bytes_per_inode, in this case (1238155616 KB * 1024 / 32768 bytes/inode ~= 38M inodes) and 10M/48M ~= 21% full as the blocks usage would suggest.
For this particular test case, I suspect removing the recordsize=4096 setting will fix the total dnode count to align with the average 32KB/dnode used by each directory. In the real-world case where there are 8KB of spill blocks per dnode instead of the estimated 1KB per dnode, this will depend on the large dnode support in order to fix, but removing the recordsize parameter would at least allow the code to compute the estimate more accurately.
Just to clarify, the recordsize on the MDT affects the statfs calculation by limiting the maximum "bytes per inode" that it will use to estimate the number of inodes that can fit into the remaining free space. It would also potentially affect the blocksize of regular files stored on the MDT (OI files, last_rcvd, changelogs, etc). It probably makes sense to allow using recordsize=4096 on the MDT without affecting the statfs calculation, and I can work on a patch to fix the statfs reporting in this case, but it won't affect the actual space usage. In the meantime, there would be a benefit to increasing the recordsize to 16KB so that the current statfs free inodes estimation can work properly.
I don't think that changing the recordsize (which allows blocks up to 128KB) will actually lose capacity in any way. I think the major reason for higher per-dnode space consumption for regular files is the use of ashift=12 (or underlying 4KB sector size drives) forcing the minimum block allocation to be 4KB (x2 for ditto copies) for data that doesn't fit directly into the dnode. The only solutions I see here are to (unfortunately) reformat without ashift=12 or to use the large dnode patch (which will create new dnodes more efficiently, and eventually the space consumption will decline as old dnodes are recycled).
As for the question about backup and restore of the MDT data, currently only zfs send and zfs recv can be used to do this. This will preserve the required ZFS filesystem structure, but allow changes to the underlying zpool configuration (e.g. ashift, VDEVs, etc) as needed. This is documented to some extent in
LUDOC-161, but hasn't made it into the user manual yet.For the mdtest I think you were only creating directories the whole time, since it kept at 32KB of space per dnode until the filesystem was full, and the parameters specified didn't allow for any files to be created. You should probably pick a smaller depth and/or fanout (e.g. -i 1 -n 100000 to create about 100k files in a single directory and see how much space that is consuming per inode. I expect 8KB/inode for ashift=12 and 2KB/inode for ashift=9 (assuming that is possible for your disks).