Details
Description
When allocating new objects for a ZFS-based server the dnode should be sized large enough to store all of the base Lustre extended attributes (trusted.lma, trusted.fid, trusted.version) in order to avoid requiring a spill block. In practice, this means a minimum size of 1K is required to accommodate the packed xattr nvlist in the bonus area.
> zdb -e -p /tmp/ -dddd lustre-ost1/ost1 1500
Dataset lustre-ost2/ost2 [ZPL], ID 391, cr_txg 8, 18.8M, 614 objects, rootbp DVA[0]=<0:16440a00:200> DVA[1]=<0:ada00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=89L/89P fill=614 cksum=dd5e644f1:50403505c9e:f29fa6075ec6:1fd1540cabc5f5
Object lvl iblk dblk dsize dnsize lsize %full type
1500 1 128K 64K 64K 1K 64K 100.00 ZFS plain file
356 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
...
SA xattrs: 204 bytes, 3 entries
trusted.lma = \010\000\000\000\000\000\000\000\000\000\001\000\001\000\000\000\360\000\000\000\000\000\000\000
trusted.fid = \001\004\000\000\002\000\000\000\335\001\000\000\000\000\000\000\000\000\020\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000
trusted.version = \357\000\000\000\001\000\000\000
However, by default 512b dnodes are created forcing a spill block to be allocated for each object which significantly increases the required storage and reduces performance due to the additional I/O.
It appears the issue is caused by OSD_BASE_EA_IN_BONUS being set incorrectly. This value needs to account not just for the data size (which it does), but also for the xattr key text and XDR encoding overhead for the packed nvlist. After everything is taken in to account the correct size is 204 bytes according to zdb.
It seems to me the correct fix here is to update OSD_BASE_EA_IN_BONUS accordingly. Making the follow change fixes the issue in my testing. It'd be great if someone else could review this change to make sure additional changes aren't needed.
+/*
+ * The base extended attribute SA size including the keys, values,
+ * and XDR encoding overhead as reported by zdb.
+ *
+ * SA xattrs: 204 bytes, 3 entries
+ * trusted.lma = \000\000\000\000...
+ * trusted.fid = \000\000\000\000...
+ * trusted.version = \000\000\000...
+ */
+#define OSD_BASE_EA_IN_BONUS (ZFS_SA_BASE_ATTR_SIZE + 204)
Until a fix is merged a reasonable workaround for this is to explicitly set the dnodesize property for the ZFS datasets to 1K on pools containing OSTs, and possibly larger on pools containing MDTs.
zfs set dnodesize=1k pool/ost