[LU-8342] ZFS dnodesize and recordsize should be set at file system creation Created: 28/Jun/16  Updated: 13/Sep/17  Resolved: 13/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Improvement Priority: Minor
Reporter: Giuseppe Di Natale (Inactive) Assignee: Giuseppe Di Natale (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Duplicate
Related
is related to LU-8042 mkfs.lustre should set ashift=12 reco... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

ZFS dnodesize and recordsize should be set to appropriate defaults upon dataset creation at filesystem creation time. We can set dnodesize=auto and recordsize=1M by default if the installed version of zfs supports it.



 Comments   
Comment by Andreas Dilger [ 28/Jun/16 ]

See LU-8042 and http://review.whamcloud.com/19892 for most of this. I just haven't had time to finish that up, if you wanted to use my patch as a starting point. I don't think it has the dnode_size option yet as that needs some configure and runtime detection.

Comment by Giuseppe Di Natale (Inactive) [ 28/Jun/16 ]

Andreas,

I have a different way of setting the recordsize and dnodesize properties which will avoid the runtime detection. I'll submit it shortly so it's out there and can be commented on.

I did have a question. You only set the recordsize property only on OSTs in your version. Why not on MDTs as well? Are there performance concerns? From my understanding, the recordsize property is more of a maximum. I also was looking at object creation in osd-zfs and ultimately zfs object allocation is called with a blocksize of 0 which results in a minimum sized block being allocated. Seems like no harm in having recordsize set to 1M on the MDTs as well.

Comment by Gerrit Updater [ 28/Jun/16 ]

Giuseppe Di Natale (dinatale2@llnl.gov) uploaded a new patch: http://review.whamcloud.com/21055
Subject: LU-8342 utils: Set dnodesize and recordsize at dataset creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 408424c909552b599340953fa92e31b54948249b

Comment by Andreas Dilger [ 28/Jun/16 ]

The reason for not setting the recordsize on the MDT is that there are few reasons to have such large IOs on the MDT, and because almost all files on the MDT are modified in small chunks so having a large blocksize could potentially hurt metadata performance significantly (eg. log files, directories (though they have a separate tunable), config files, etc.) so I'd rather avoid that complexity and risk.

We've always tuned ldiskfs differently for the MDT and OST for exactly this reason, for example not having extent-mapped files on the MDT, having different ratios of space to inodes, etc.

Comment by Andreas Dilger [ 29/Jun/16 ]

Joe, are you going to also set the ashift=12 for ZFS? I think there no good reasons to have ashift=9 on the OSTs, but this can have significant negative performance impact of not auto detected correctly, and blocks the ability to update to 4KB drives in the future.

Comment by Giuseppe Di Natale (Inactive) [ 29/Jun/16 ]

Are you suggesting that I set ashift=12 by default? I talked with Brian about the ashift property. Based on that discussion, zfs attempts to pick the right ashift based on the hardware. Is there certain hardware you're experiencing issues with?

Comment by Giuseppe Di Natale (Inactive) [ 29/Jun/16 ]

Andreas, what is your opinion on setting dnodesize on both OSTs and MDTs as well?

Comment by Andreas Dilger [ 29/Jun/16 ]

There will definitely be more xattrs in the MDT, but there are also some on the OST. AFAIK there shouldn't be any harm in dnodesize=auto on for the OSTs, even if they only use the minimum dnode size today. There are definitely a few more xattrs that will be stored on the OST objects in the future in order to improve LFSCK support with composite file layouts.

For the MDT, it definitely makes sense to enable at least dnodesize=auto, but it might make sense to reserve more space of the median xattr size is larger. IMHO it wouldn't be terrible to track the size of xattrs stored on a dnode in a histogram of percpu counters (to avoid contention) so that there are enough dnode slots reserved for each new dnode even if the setxattr doesn't happen atomically with the create. Alex was working on a patch to do this by accumulating the size of xattrs declared during the file create. http://review.whamcloud.com/19101 that could use some review.

Comment by Giuseppe Di Natale (Inactive) [ 30/Jun/16 ]

Ok, then I will go ahead and set dnodesize=auto at file system creation time. If the dnode size needs to be different that can be handled after the fact.

Brian had also mentioned that in the future, dnodesize=auto could be a bit more intelligent and choose the appropriate dnode size if one wasn't specified. But, currently I believe auto just results in dnode sizes of 1K.

Andreas, can you please comment on the ashift questions above? I just want to make sure I understand the ashift changes proposed.

Comment by Andreas Dilger [ 30/Jun/16 ]

Based on past postings on the ZFS mailing lists, users have reported terrible performance when ZFS doesn't auto-detect the 4KB sector size correctly (usually because drives are advertising 512-byte sectors for "maximum compatibility" even when they have 4KB sectors internally). Not only does this hurt performance, it can potentially lead to data reliability problems if sectors are being modified that do not belong to the current block.

Also, even if drives are correctly reporting 512-byte sectors, there is a long-term maintenance problem if those drives need to be replaced by newer drives, because all newer/larger drives have 4KB sectors and it isn't possible to replace any drives in a 512-byte sector VDEV with 4096-byte sector drives without a full backup/restore. That makes maintenance more problematic, as well as prevents VDEV "autoexpand" to work if existing drives are replaced with larger drives.

While I understand Brian's concern that changing the OpenZFS default to ashift=12 since it would increase space usage for some workloads (despite repeated requests to change it), this is less of a concern for Lustre OSTs. From a support and "best performance out of the box" point of view I'd prefer setting ashift=12 on OSTs by default.

Comment by Olaf Faaland [ 08/Jul/16 ]

Even if these pool and dataset settings are correct for all cases for now, they may become incorrect with future changes in either lustre or zfs. Furthermore, there are likely unusual cases (e.g. testing, see Jinshan's concern about test data often being highly compressible) where one or more of these settings are undesirable.

Both http://review.whamcloud.com/#/c/19892/2 and http://review.whamcloud.com/#/c/21055/3 set the desired settings by hard-coding specific values into mkfs.lustre. How about putting the settings themselves into a configuration file, e.g. /etc/sysconfig/lustre or distro-specific equivalent which is parsed by mkfs.lustre? Then they are visible to the user, can easily be changed when appropriate, and the defaults can be changed with a trivial patch that is easy to review.

Comment by Olaf Faaland [ 08/Jul/16 ]

I see that lustre/conf/lustre already has ZPOOL_IMPORT_DIR and ZPOOL_IMPORT_ARGS. So perhaps ZPOOL_CREATE_ARGS and ZFS_CREATE_ARGS?

Comment by Olaf Faaland [ 08/Jul/16 ]

Perhaps lustre/conf/lustre is not the right place for such settings; I see there is an /etc/mke2fs.conf with ini-style contents, and some userspace apps seem to use /etc/default/foo. Anyway, basic proposal that these settings be put in a config file, instead of in the code, still stands.

Comment by Andreas Dilger [ 08/Jul/16 ]

There are already -mkfsoptions and -mountfsoptions that can be used to pass extra options to mkfs.lustre and to the internal mount command for the back-end filesystem. They should be able to override the default options specified internally by mkfs.lustre. My goal in specifying these options internally is that the majority of users should get the best performance out of the box if possible, rather than having to specify extra options.

Comment by Olaf Faaland [ 08/Jul/16 ]

Andreas,
I understand. I'm suggesting that if those good defaults are encoded in a config file instead of in code, they (a) are visible to the user and (b) require trival code review to change. Also, the existing options do not distinguish between pool properties and dataset properties.

Comment by Andreas Dilger [ 12/Jul/16 ]

Olaf, I'm not against that, but it would definitely be more work than the current patch, and likely push the change out into 2.10.

Comment by Gu Zheng (Inactive) [ 28/Dec/16 ]

Hi Giuseppe,

Any update about the patch http://review.whamcloud.com/21055?

Comment by Giuseppe Di Natale (Inactive) [ 04/Jan/17 ]

Currently no updates to report. I will try to revisit this soon.

Comment by Gerrit Updater [ 13/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21055/
Subject: LU-8342 utils: Set dnodesize/recordsize at zfs dataset create
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1617b8f6b6cdd0f5b74d7bfb8166d74b63cfed81

Comment by Peter Jones [ 13/Sep/17 ]

Landed for 2.11

Generated at Sat Feb 10 02:16:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.