Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8342

ZFS dnodesize and recordsize should be set at file system creation

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • None
    • 9223372036854775807

    Description

      ZFS dnodesize and recordsize should be set to appropriate defaults upon dataset creation at filesystem creation time. We can set dnodesize=auto and recordsize=1M by default if the installed version of zfs supports it.

      Attachments

        Issue Links

          Activity

            [LU-8342] ZFS dnodesize and recordsize should be set at file system creation

            Hi Giuseppe,

            Any update about the patch http://review.whamcloud.com/21055?

            cengku9660 Gu Zheng (Inactive) added a comment - Hi Giuseppe, Any update about the patch http://review.whamcloud.com/21055?

            Olaf, I'm not against that, but it would definitely be more work than the current patch, and likely push the change out into 2.10.

            adilger Andreas Dilger added a comment - Olaf, I'm not against that, but it would definitely be more work than the current patch, and likely push the change out into 2.10.

            Andreas,
            I understand. I'm suggesting that if those good defaults are encoded in a config file instead of in code, they (a) are visible to the user and (b) require trival code review to change. Also, the existing options do not distinguish between pool properties and dataset properties.

            ofaaland Olaf Faaland added a comment - Andreas, I understand. I'm suggesting that if those good defaults are encoded in a config file instead of in code, they (a) are visible to the user and (b) require trival code review to change. Also, the existing options do not distinguish between pool properties and dataset properties.

            There are already -mkfsoptions and -mountfsoptions that can be used to pass extra options to mkfs.lustre and to the internal mount command for the back-end filesystem. They should be able to override the default options specified internally by mkfs.lustre. My goal in specifying these options internally is that the majority of users should get the best performance out of the box if possible, rather than having to specify extra options.

            adilger Andreas Dilger added a comment - There are already - mkfsoptions and -mountfsoptions that can be used to pass extra options to mkfs.lustre and to the internal mount command for the back-end filesystem. They should be able to override the default options specified internally by mkfs.lustre . My goal in specifying these options internally is that the majority of users should get the best performance out of the box if possible, rather than having to specify extra options.
            ofaaland Olaf Faaland added a comment -

            Perhaps lustre/conf/lustre is not the right place for such settings; I see there is an /etc/mke2fs.conf with ini-style contents, and some userspace apps seem to use /etc/default/foo. Anyway, basic proposal that these settings be put in a config file, instead of in the code, still stands.

            ofaaland Olaf Faaland added a comment - Perhaps lustre/conf/lustre is not the right place for such settings; I see there is an /etc/mke2fs.conf with ini-style contents, and some userspace apps seem to use /etc/default/foo. Anyway, basic proposal that these settings be put in a config file, instead of in the code, still stands.
            ofaaland Olaf Faaland added a comment -

            I see that lustre/conf/lustre already has ZPOOL_IMPORT_DIR and ZPOOL_IMPORT_ARGS. So perhaps ZPOOL_CREATE_ARGS and ZFS_CREATE_ARGS?

            ofaaland Olaf Faaland added a comment - I see that lustre/conf/lustre already has ZPOOL_IMPORT_DIR and ZPOOL_IMPORT_ARGS. So perhaps ZPOOL_CREATE_ARGS and ZFS_CREATE_ARGS?
            ofaaland Olaf Faaland added a comment -

            Even if these pool and dataset settings are correct for all cases for now, they may become incorrect with future changes in either lustre or zfs. Furthermore, there are likely unusual cases (e.g. testing, see Jinshan's concern about test data often being highly compressible) where one or more of these settings are undesirable.

            Both http://review.whamcloud.com/#/c/19892/2 and http://review.whamcloud.com/#/c/21055/3 set the desired settings by hard-coding specific values into mkfs.lustre. How about putting the settings themselves into a configuration file, e.g. /etc/sysconfig/lustre or distro-specific equivalent which is parsed by mkfs.lustre? Then they are visible to the user, can easily be changed when appropriate, and the defaults can be changed with a trivial patch that is easy to review.

            ofaaland Olaf Faaland added a comment - Even if these pool and dataset settings are correct for all cases for now, they may become incorrect with future changes in either lustre or zfs. Furthermore, there are likely unusual cases (e.g. testing, see Jinshan's concern about test data often being highly compressible) where one or more of these settings are undesirable. Both http://review.whamcloud.com/#/c/19892/2 and http://review.whamcloud.com/#/c/21055/3 set the desired settings by hard-coding specific values into mkfs.lustre. How about putting the settings themselves into a configuration file, e.g. /etc/sysconfig/lustre or distro-specific equivalent which is parsed by mkfs.lustre? Then they are visible to the user, can easily be changed when appropriate, and the defaults can be changed with a trivial patch that is easy to review.
            adilger Andreas Dilger added a comment - - edited

            Based on past postings on the ZFS mailing lists, users have reported terrible performance when ZFS doesn't auto-detect the 4KB sector size correctly (usually because drives are advertising 512-byte sectors for "maximum compatibility" even when they have 4KB sectors internally). Not only does this hurt performance, it can potentially lead to data reliability problems if sectors are being modified that do not belong to the current block.

            Also, even if drives are correctly reporting 512-byte sectors, there is a long-term maintenance problem if those drives need to be replaced by newer drives, because all newer/larger drives have 4KB sectors and it isn't possible to replace any drives in a 512-byte sector VDEV with 4096-byte sector drives without a full backup/restore. That makes maintenance more problematic, as well as prevents VDEV "autoexpand" to work if existing drives are replaced with larger drives.

            While I understand Brian's concern that changing the OpenZFS default to ashift=12 since it would increase space usage for some workloads (despite repeated requests to change it), this is less of a concern for Lustre OSTs. From a support and "best performance out of the box" point of view I'd prefer setting ashift=12 on OSTs by default.

            adilger Andreas Dilger added a comment - - edited Based on past postings on the ZFS mailing lists, users have reported terrible performance when ZFS doesn't auto-detect the 4KB sector size correctly (usually because drives are advertising 512-byte sectors for "maximum compatibility" even when they have 4KB sectors internally). Not only does this hurt performance, it can potentially lead to data reliability problems if sectors are being modified that do not belong to the current block. Also, even if drives are correctly reporting 512-byte sectors, there is a long-term maintenance problem if those drives need to be replaced by newer drives, because all newer/larger drives have 4KB sectors and it isn't possible to replace any drives in a 512-byte sector VDEV with 4096-byte sector drives without a full backup/restore. That makes maintenance more problematic, as well as prevents VDEV "autoexpand" to work if existing drives are replaced with larger drives. While I understand Brian's concern that changing the OpenZFS default to ashift=12 since it would increase space usage for some workloads (despite repeated requests to change it), this is less of a concern for Lustre OSTs. From a support and "best performance out of the box" point of view I'd prefer setting ashift=12 on OSTs by default.

            Ok, then I will go ahead and set dnodesize=auto at file system creation time. If the dnode size needs to be different that can be handled after the fact.

            Brian had also mentioned that in the future, dnodesize=auto could be a bit more intelligent and choose the appropriate dnode size if one wasn't specified. But, currently I believe auto just results in dnode sizes of 1K.

            Andreas, can you please comment on the ashift questions above? I just want to make sure I understand the ashift changes proposed.

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - Ok, then I will go ahead and set dnodesize=auto at file system creation time. If the dnode size needs to be different that can be handled after the fact. Brian had also mentioned that in the future, dnodesize=auto could be a bit more intelligent and choose the appropriate dnode size if one wasn't specified. But, currently I believe auto just results in dnode sizes of 1K. Andreas, can you please comment on the ashift questions above? I just want to make sure I understand the ashift changes proposed.

            There will definitely be more xattrs in the MDT, but there are also some on the OST. AFAIK there shouldn't be any harm in dnodesize=auto on for the OSTs, even if they only use the minimum dnode size today. There are definitely a few more xattrs that will be stored on the OST objects in the future in order to improve LFSCK support with composite file layouts.

            For the MDT, it definitely makes sense to enable at least dnodesize=auto, but it might make sense to reserve more space of the median xattr size is larger. IMHO it wouldn't be terrible to track the size of xattrs stored on a dnode in a histogram of percpu counters (to avoid contention) so that there are enough dnode slots reserved for each new dnode even if the setxattr doesn't happen atomically with the create. Alex was working on a patch to do this by accumulating the size of xattrs declared during the file create. http://review.whamcloud.com/19101 that could use some review.

            adilger Andreas Dilger added a comment - There will definitely be more xattrs in the MDT, but there are also some on the OST. AFAIK there shouldn't be any harm in dnodesize=auto on for the OSTs, even if they only use the minimum dnode size today. There are definitely a few more xattrs that will be stored on the OST objects in the future in order to improve LFSCK support with composite file layouts. For the MDT, it definitely makes sense to enable at least dnodesize=auto, but it might make sense to reserve more space of the median xattr size is larger. IMHO it wouldn't be terrible to track the size of xattrs stored on a dnode in a histogram of percpu counters (to avoid contention) so that there are enough dnode slots reserved for each new dnode even if the setxattr doesn't happen atomically with the create. Alex was working on a patch to do this by accumulating the size of xattrs declared during the file create. http://review.whamcloud.com/19101 that could use some review.

            Andreas, what is your opinion on setting dnodesize on both OSTs and MDTs as well?

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - Andreas, what is your opinion on setting dnodesize on both OSTs and MDTs as well?

            People

              dinatale2 Giuseppe Di Natale (Inactive)
              dinatale2 Giuseppe Di Natale (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: