[LU-8342] ZFS dnodesize and recordsize should be set at file system creation - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.11.0
Affects Version/s: None
Labels:
- llnl

Rank (Obsolete):
9223372036854775807

Description

ZFS dnodesize and recordsize should be set to appropriate defaults upon dataset creation at filesystem creation time. We can set dnodesize=auto and recordsize=1M by default if the installed version of zfs supports it.

Attachments

Issue Links

is related to

LU-8042 mkfs.lustre should set ashift=12 recordsize=1024k compression=lz4" by default for new ZFS OSTs

Resolved

mentioned in: Page Loading...

Activity

[LU-8342] ZFS dnodesize and recordsize should be set at file system creation

Gu Zheng (Inactive) added a comment - 28/Dec/16 1:58 AM

Hi Giuseppe,

Any update about the patch http://review.whamcloud.com/21055?

Gu Zheng (Inactive) added a comment - 28/Dec/16 1:58 AM Hi Giuseppe, Any update about the patch http://review.whamcloud.com/21055?

Andreas Dilger added a comment - 12/Jul/16 4:19 PM

Olaf, I'm not against that, but it would definitely be more work than the current patch, and likely push the change out into 2.10.

Andreas Dilger added a comment - 12/Jul/16 4:19 PM Olaf, I'm not against that, but it would definitely be more work than the current patch, and likely push the change out into 2.10.

Olaf Faaland added a comment - 08/Jul/16 10:25 PM

Andreas,
I understand. I'm suggesting that if those good defaults are encoded in a config file instead of in code, they (a) are visible to the user and (b) require trival code review to change. Also, the existing options do not distinguish between pool properties and dataset properties.

Olaf Faaland added a comment - 08/Jul/16 10:25 PM Andreas, I understand. I'm suggesting that if those good defaults are encoded in a config file instead of in code, they (a) are visible to the user and (b) require trival code review to change. Also, the existing options do not distinguish between pool properties and dataset properties.

Andreas Dilger added a comment - 08/Jul/16 10:02 PM

There are already -mkfsoptions and -mountfsoptions that can be used to pass extra options to mkfs.lustre and to the internal mount command for the back-end filesystem. They should be able to override the default options specified internally by mkfs.lustre. My goal in specifying these options internally is that the majority of users should get the best performance out of the box if possible, rather than having to specify extra options.

Andreas Dilger added a comment - 08/Jul/16 10:02 PM There are already - mkfsoptions and -mountfsoptions that can be used to pass extra options to mkfs.lustre and to the internal mount command for the back-end filesystem. They should be able to override the default options specified internally by mkfs.lustre . My goal in specifying these options internally is that the majority of users should get the best performance out of the box if possible, rather than having to specify extra options.

Olaf Faaland added a comment - 08/Jul/16 6:38 PM

Perhaps lustre/conf/lustre is not the right place for such settings; I see there is an /etc/mke2fs.conf with ini-style contents, and some userspace apps seem to use /etc/default/foo. Anyway, basic proposal that these settings be put in a config file, instead of in the code, still stands.

Olaf Faaland added a comment - 08/Jul/16 6:38 PM Perhaps lustre/conf/lustre is not the right place for such settings; I see there is an /etc/mke2fs.conf with ini-style contents, and some userspace apps seem to use /etc/default/foo. Anyway, basic proposal that these settings be put in a config file, instead of in the code, still stands.

Olaf Faaland added a comment - 08/Jul/16 4:54 PM

I see that lustre/conf/lustre already has ZPOOL_IMPORT_DIR and ZPOOL_IMPORT_ARGS. So perhaps ZPOOL_CREATE_ARGS and ZFS_CREATE_ARGS?

Olaf Faaland added a comment - 08/Jul/16 4:54 PM I see that lustre/conf/lustre already has ZPOOL_IMPORT_DIR and ZPOOL_IMPORT_ARGS. So perhaps ZPOOL_CREATE_ARGS and ZFS_CREATE_ARGS?

Olaf Faaland added a comment - 08/Jul/16 4:51 PM

Even if these pool and dataset settings are correct for all cases for now, they may become incorrect with future changes in either lustre or zfs. Furthermore, there are likely unusual cases (e.g. testing, see Jinshan's concern about test data often being highly compressible) where one or more of these settings are undesirable.

Both http://review.whamcloud.com/#/c/19892/2 and http://review.whamcloud.com/#/c/21055/3 set the desired settings by hard-coding specific values into mkfs.lustre. How about putting the settings themselves into a configuration file, e.g. /etc/sysconfig/lustre or distro-specific equivalent which is parsed by mkfs.lustre? Then they are visible to the user, can easily be changed when appropriate, and the defaults can be changed with a trivial patch that is easy to review.

Olaf Faaland added a comment - 08/Jul/16 4:51 PM Even if these pool and dataset settings are correct for all cases for now, they may become incorrect with future changes in either lustre or zfs. Furthermore, there are likely unusual cases (e.g. testing, see Jinshan's concern about test data often being highly compressible) where one or more of these settings are undesirable. Both http://review.whamcloud.com/#/c/19892/2 and http://review.whamcloud.com/#/c/21055/3 set the desired settings by hard-coding specific values into mkfs.lustre. How about putting the settings themselves into a configuration file, e.g. /etc/sysconfig/lustre or distro-specific equivalent which is parsed by mkfs.lustre? Then they are visible to the user, can easily be changed when appropriate, and the defaults can be changed with a trivial patch that is easy to review.

Andreas Dilger added a comment - 30/Jun/16 5:22 PM - edited

Based on past postings on the ZFS mailing lists, users have reported terrible performance when ZFS doesn't auto-detect the 4KB sector size correctly (usually because drives are advertising 512-byte sectors for "maximum compatibility" even when they have 4KB sectors internally). Not only does this hurt performance, it can potentially lead to data reliability problems if sectors are being modified that do not belong to the current block.

Also, even if drives are correctly reporting 512-byte sectors, there is a long-term maintenance problem if those drives need to be replaced by newer drives, because all newer/larger drives have 4KB sectors and it isn't possible to replace any drives in a 512-byte sector VDEV with 4096-byte sector drives without a full backup/restore. That makes maintenance more problematic, as well as prevents VDEV "autoexpand" to work if existing drives are replaced with larger drives.

While I understand Brian's concern that changing the OpenZFS default to ashift=12 since it would increase space usage for some workloads (despite repeated requests to change it), this is less of a concern for Lustre OSTs. From a support and "best performance out of the box" point of view I'd prefer setting ashift=12 on OSTs by default.

Andreas Dilger added a comment - 30/Jun/16 5:22 PM - edited Based on past postings on the ZFS mailing lists, users have reported terrible performance when ZFS doesn't auto-detect the 4KB sector size correctly (usually because drives are advertising 512-byte sectors for "maximum compatibility" even when they have 4KB sectors internally). Not only does this hurt performance, it can potentially lead to data reliability problems if sectors are being modified that do not belong to the current block. Also, even if drives are correctly reporting 512-byte sectors, there is a long-term maintenance problem if those drives need to be replaced by newer drives, because all newer/larger drives have 4KB sectors and it isn't possible to replace any drives in a 512-byte sector VDEV with 4096-byte sector drives without a full backup/restore. That makes maintenance more problematic, as well as prevents VDEV "autoexpand" to work if existing drives are replaced with larger drives. While I understand Brian's concern that changing the OpenZFS default to ashift=12 since it would increase space usage for some workloads (despite repeated requests to change it), this is less of a concern for Lustre OSTs. From a support and "best performance out of the box" point of view I'd prefer setting ashift=12 on OSTs by default.

Giuseppe Di Natale (Inactive) added a comment - 30/Jun/16 3:14 PM

Ok, then I will go ahead and set dnodesize=auto at file system creation time. If the dnode size needs to be different that can be handled after the fact.

Brian had also mentioned that in the future, dnodesize=auto could be a bit more intelligent and choose the appropriate dnode size if one wasn't specified. But, currently I believe auto just results in dnode sizes of 1K.

Andreas, can you please comment on the ashift questions above? I just want to make sure I understand the ashift changes proposed.

Giuseppe Di Natale (Inactive) added a comment - 30/Jun/16 3:14 PM Ok, then I will go ahead and set dnodesize=auto at file system creation time. If the dnode size needs to be different that can be handled after the fact. Brian had also mentioned that in the future, dnodesize=auto could be a bit more intelligent and choose the appropriate dnode size if one wasn't specified. But, currently I believe auto just results in dnode sizes of 1K. Andreas, can you please comment on the ashift questions above? I just want to make sure I understand the ashift changes proposed.

Andreas Dilger added a comment - 29/Jun/16 9:39 PM

There will definitely be more xattrs in the MDT, but there are also some on the OST. AFAIK there shouldn't be any harm in dnodesize=auto on for the OSTs, even if they only use the minimum dnode size today. There are definitely a few more xattrs that will be stored on the OST objects in the future in order to improve LFSCK support with composite file layouts.

For the MDT, it definitely makes sense to enable at least dnodesize=auto, but it might make sense to reserve more space of the median xattr size is larger. IMHO it wouldn't be terrible to track the size of xattrs stored on a dnode in a histogram of percpu counters (to avoid contention) so that there are enough dnode slots reserved for each new dnode even if the setxattr doesn't happen atomically with the create. Alex was working on a patch to do this by accumulating the size of xattrs declared during the file create. http://review.whamcloud.com/19101 that could use some review.

Andreas Dilger added a comment - 29/Jun/16 9:39 PM There will definitely be more xattrs in the MDT, but there are also some on the OST. AFAIK there shouldn't be any harm in dnodesize=auto on for the OSTs, even if they only use the minimum dnode size today. There are definitely a few more xattrs that will be stored on the OST objects in the future in order to improve LFSCK support with composite file layouts. For the MDT, it definitely makes sense to enable at least dnodesize=auto, but it might make sense to reserve more space of the median xattr size is larger. IMHO it wouldn't be terrible to track the size of xattrs stored on a dnode in a histogram of percpu counters (to avoid contention) so that there are enough dnode slots reserved for each new dnode even if the setxattr doesn't happen atomically with the create. Alex was working on a patch to do this by accumulating the size of xattrs declared during the file create. http://review.whamcloud.com/19101 that could use some review.

Giuseppe Di Natale (Inactive) added a comment - 29/Jun/16 9:07 PM

Andreas, what is your opinion on setting dnodesize on both OSTs and MDTs as well?

Giuseppe Di Natale (Inactive) added a comment - 29/Jun/16 9:07 PM Andreas, what is your opinion on setting dnodesize on both OSTs and MDTs as well?

People

Assignee:: Giuseppe Di Natale (Inactive)

Reporter:: Giuseppe Di Natale (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 28/Jun/16 9:07 PM

Updated:: 13/Sep/17 3:49 AM

Resolved:: 13/Sep/17 3:49 AM