[LU-8342] ZFS dnodesize and recordsize should be set at file system creation Created: 28/Jun/16 Updated: 13/Sep/17 Resolved: 13/Sep/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Giuseppe Di Natale (Inactive) | Assignee: | Giuseppe Di Natale (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
ZFS dnodesize and recordsize should be set to appropriate defaults upon dataset creation at filesystem creation time. We can set dnodesize=auto and recordsize=1M by default if the installed version of zfs supports it. |
| Comments |
| Comment by Andreas Dilger [ 28/Jun/16 ] |
|
See |
| Comment by Giuseppe Di Natale (Inactive) [ 28/Jun/16 ] |
|
Andreas, I have a different way of setting the recordsize and dnodesize properties which will avoid the runtime detection. I'll submit it shortly so it's out there and can be commented on. I did have a question. You only set the recordsize property only on OSTs in your version. Why not on MDTs as well? Are there performance concerns? From my understanding, the recordsize property is more of a maximum. I also was looking at object creation in osd-zfs and ultimately zfs object allocation is called with a blocksize of 0 which results in a minimum sized block being allocated. Seems like no harm in having recordsize set to 1M on the MDTs as well. |
| Comment by Gerrit Updater [ 28/Jun/16 ] |
|
Giuseppe Di Natale (dinatale2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/21055 |
| Comment by Andreas Dilger [ 28/Jun/16 ] |
|
The reason for not setting the recordsize on the MDT is that there are few reasons to have such large IOs on the MDT, and because almost all files on the MDT are modified in small chunks so having a large blocksize could potentially hurt metadata performance significantly (eg. log files, directories (though they have a separate tunable), config files, etc.) so I'd rather avoid that complexity and risk. We've always tuned ldiskfs differently for the MDT and OST for exactly this reason, for example not having extent-mapped files on the MDT, having different ratios of space to inodes, etc. |
| Comment by Andreas Dilger [ 29/Jun/16 ] |
|
Joe, are you going to also set the ashift=12 for ZFS? I think there no good reasons to have ashift=9 on the OSTs, but this can have significant negative performance impact of not auto detected correctly, and blocks the ability to update to 4KB drives in the future. |
| Comment by Giuseppe Di Natale (Inactive) [ 29/Jun/16 ] |
|
Are you suggesting that I set ashift=12 by default? I talked with Brian about the ashift property. Based on that discussion, zfs attempts to pick the right ashift based on the hardware. Is there certain hardware you're experiencing issues with? |
| Comment by Giuseppe Di Natale (Inactive) [ 29/Jun/16 ] |
|
Andreas, what is your opinion on setting dnodesize on both OSTs and MDTs as well? |
| Comment by Andreas Dilger [ 29/Jun/16 ] |
|
There will definitely be more xattrs in the MDT, but there are also some on the OST. AFAIK there shouldn't be any harm in dnodesize=auto on for the OSTs, even if they only use the minimum dnode size today. There are definitely a few more xattrs that will be stored on the OST objects in the future in order to improve LFSCK support with composite file layouts. For the MDT, it definitely makes sense to enable at least dnodesize=auto, but it might make sense to reserve more space of the median xattr size is larger. IMHO it wouldn't be terrible to track the size of xattrs stored on a dnode in a histogram of percpu counters (to avoid contention) so that there are enough dnode slots reserved for each new dnode even if the setxattr doesn't happen atomically with the create. Alex was working on a patch to do this by accumulating the size of xattrs declared during the file create. http://review.whamcloud.com/19101 that could use some review. |
| Comment by Giuseppe Di Natale (Inactive) [ 30/Jun/16 ] |
|
Ok, then I will go ahead and set dnodesize=auto at file system creation time. If the dnode size needs to be different that can be handled after the fact. Brian had also mentioned that in the future, dnodesize=auto could be a bit more intelligent and choose the appropriate dnode size if one wasn't specified. But, currently I believe auto just results in dnode sizes of 1K. Andreas, can you please comment on the ashift questions above? I just want to make sure I understand the ashift changes proposed. |
| Comment by Andreas Dilger [ 30/Jun/16 ] |
|
Based on past postings on the ZFS mailing lists, users have reported terrible performance when ZFS doesn't auto-detect the 4KB sector size correctly (usually because drives are advertising 512-byte sectors for "maximum compatibility" even when they have 4KB sectors internally). Not only does this hurt performance, it can potentially lead to data reliability problems if sectors are being modified that do not belong to the current block. Also, even if drives are correctly reporting 512-byte sectors, there is a long-term maintenance problem if those drives need to be replaced by newer drives, because all newer/larger drives have 4KB sectors and it isn't possible to replace any drives in a 512-byte sector VDEV with 4096-byte sector drives without a full backup/restore. That makes maintenance more problematic, as well as prevents VDEV "autoexpand" to work if existing drives are replaced with larger drives. While I understand Brian's concern that changing the OpenZFS default to ashift=12 since it would increase space usage for some workloads (despite repeated requests to change it), this is less of a concern for Lustre OSTs. From a support and "best performance out of the box" point of view I'd prefer setting ashift=12 on OSTs by default. |
| Comment by Olaf Faaland [ 08/Jul/16 ] |
|
Even if these pool and dataset settings are correct for all cases for now, they may become incorrect with future changes in either lustre or zfs. Furthermore, there are likely unusual cases (e.g. testing, see Jinshan's concern about test data often being highly compressible) where one or more of these settings are undesirable. Both http://review.whamcloud.com/#/c/19892/2 and http://review.whamcloud.com/#/c/21055/3 set the desired settings by hard-coding specific values into mkfs.lustre. How about putting the settings themselves into a configuration file, e.g. /etc/sysconfig/lustre or distro-specific equivalent which is parsed by mkfs.lustre? Then they are visible to the user, can easily be changed when appropriate, and the defaults can be changed with a trivial patch that is easy to review. |
| Comment by Olaf Faaland [ 08/Jul/16 ] |
|
I see that lustre/conf/lustre already has ZPOOL_IMPORT_DIR and ZPOOL_IMPORT_ARGS. So perhaps ZPOOL_CREATE_ARGS and ZFS_CREATE_ARGS? |
| Comment by Olaf Faaland [ 08/Jul/16 ] |
|
Perhaps lustre/conf/lustre is not the right place for such settings; I see there is an /etc/mke2fs.conf with ini-style contents, and some userspace apps seem to use /etc/default/foo. Anyway, basic proposal that these settings be put in a config file, instead of in the code, still stands. |
| Comment by Andreas Dilger [ 08/Jul/16 ] |
|
There are already - |
| Comment by Olaf Faaland [ 08/Jul/16 ] |
|
Andreas, |
| Comment by Andreas Dilger [ 12/Jul/16 ] |
|
Olaf, I'm not against that, but it would definitely be more work than the current patch, and likely push the change out into 2.10. |
| Comment by Gu Zheng (Inactive) [ 28/Dec/16 ] |
|
Hi Giuseppe, Any update about the patch http://review.whamcloud.com/21055? |
| Comment by Giuseppe Di Natale (Inactive) [ 04/Jan/17 ] |
|
Currently no updates to report. I will try to revisit this soon. |
| Comment by Gerrit Updater [ 13/Sep/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21055/ |
| Comment by Peter Jones [ 13/Sep/17 ] |
|
Landed for 2.11 |