Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8124

MDT zpool capacity consumed at greater rate than inode allocation

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.7.0, Lustre 2.8.0, Lustre 2.9.0
    • CentOS 6.7, Lustre 2.8, ZFS 6.5.3
    • 3
    • 9223372036854775807

    Description

      When running mdtest to create zero byte files on a new LFS to benchmark the MDS I noticed that in creating 600K zero byte files I only used a small percentage of Lustre inodes but the MDT zpool capacity was 65% used. The ratio of inodes used to capacity used not only seems way off but it appears on track to run out of zpool space before Lustre thinks it's out of inodes.

      I'm on a plane and can give specifics later tonight. I know one large LFS is seeing similar on a production LFS.

      In my case the MDT is a five-plex mirror vdev made in the following way:
      zpool create -o ashift=12 -O recordsize=4096 mdt.pool mirror A1 A2 mirror A3 A4 mirror A5 A6 mirror A7 A8 mirror A9 A10

      I have also seen the behavior using default recordsize.

      Maybe I'm overlooking something but it seems the capacity consumption is overtaking inode allocation. This could be something in the way ZFS reports capacity used when Lustre is hooking in at a level below ZFS layer but from the cockpit it looks like my MDT tops out while Lustre thinks there are inodes available.

      Attachments

        Issue Links

          Activity

            [LU-8124] MDT zpool capacity consumed at greater rate than inode allocation
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20123/
            Subject: LU-8124 osd-zfs: fix statfs small blocksize inode estimate
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 141037baa112f63f42c0f558ad3eec038712714d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20123/ Subject: LU-8124 osd-zfs: fix statfs small blocksize inode estimate Project: fs/lustre-release Branch: master Current Patch Set: Commit: 141037baa112f63f42c0f558ad3eec038712714d

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20123
            Subject: LU-8124 osd-zfs: fix statfs small blocksize inode estimate
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bc138a92b6deac5e4227e8212e1ff0f275384765

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20123 Subject: LU-8124 osd-zfs: fix statfs small blocksize inode estimate Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bc138a92b6deac5e4227e8212e1ff0f275384765

            Reopen for ZFS dnode estimation patch landing.

            adilger Andreas Dilger added a comment - Reopen for ZFS dnode estimation patch landing.

            Just to clarify, the recordsize on the MDT affects the statfs calculation by limiting the maximum "bytes per inode" that it will use to estimate the number of inodes that can fit into the remaining free space. It would also potentially affect the blocksize of regular files stored on the MDT (OI files, last_rcvd, changelogs, etc). It probably makes sense to allow using recordsize=4096 on the MDT without affecting the statfs calculation, and I can work on a patch to fix the statfs reporting in this case, but it won't affect the actual space usage. In the meantime, there would be a benefit to increasing the recordsize to 16KB so that the current statfs free inodes estimation can work properly.

            I don't think that changing the recordsize (which allows blocks up to 128KB) will actually lose capacity in any way. I think the major reason for higher per-dnode space consumption for regular files is the use of ashift=12 (or underlying 4KB sector size drives) forcing the minimum block allocation to be 4KB (x2 for ditto copies) for data that doesn't fit directly into the dnode. The only solutions I see here are to (unfortunately) reformat without ashift=12 or to use the large dnode patch (which will create new dnodes more efficiently, and eventually the space consumption will decline as old dnodes are recycled).

            As for the question about backup and restore of the MDT data, currently only zfs send and zfs recv can be used to do this. This will preserve the required ZFS filesystem structure, but allow changes to the underlying zpool configuration (e.g. ashift, VDEVs, etc) as needed. This is documented to some extent in LUDOC-161, but hasn't made it into the user manual yet.

            For the mdtest I think you were only creating directories the whole time, since it kept at 32KB of space per dnode until the filesystem was full, and the parameters specified didn't allow for any files to be created. You should probably pick a smaller depth and/or fanout (e.g. -i 1 -n 100000 to create about 100k files in a single directory and see how much space that is consuming per inode. I expect 8KB/inode for ashift=12 and 2KB/inode for ashift=9 (assuming that is possible for your disks).

            adilger Andreas Dilger added a comment - Just to clarify, the recordsize on the MDT affects the statfs calculation by limiting the maximum "bytes per inode" that it will use to estimate the number of inodes that can fit into the remaining free space. It would also potentially affect the blocksize of regular files stored on the MDT (OI files, last_rcvd, changelogs, etc). It probably makes sense to allow using recordsize=4096 on the MDT without affecting the statfs calculation, and I can work on a patch to fix the statfs reporting in this case, but it won't affect the actual space usage. In the meantime, there would be a benefit to increasing the recordsize to 16KB so that the current statfs free inodes estimation can work properly. I don't think that changing the recordsize (which allows blocks up to 128KB) will actually lose capacity in any way. I think the major reason for higher per-dnode space consumption for regular files is the use of ashift=12 (or underlying 4KB sector size drives) forcing the minimum block allocation to be 4KB (x2 for ditto copies) for data that doesn't fit directly into the dnode. The only solutions I see here are to (unfortunately) reformat without ashift=12 or to use the large dnode patch (which will create new dnodes more efficiently, and eventually the space consumption will decline as old dnodes are recycled). As for the question about backup and restore of the MDT data, currently only zfs send and zfs recv can be used to do this. This will preserve the required ZFS filesystem structure, but allow changes to the underlying zpool configuration (e.g. ashift, VDEVs, etc) as needed. This is documented to some extent in LUDOC-161 , but hasn't made it into the user manual yet. For the mdtest I think you were only creating directories the whole time, since it kept at 32KB of space per dnode until the filesystem was full, and the parameters specified didn't allow for any files to be created. You should probably pick a smaller depth and/or fanout (e.g. -i 1 -n 100000 to create about 100k files in a single directory and see how much space that is consuming per inode. I expect 8KB/inode for ashift=12 and 2KB/inode for ashift=9 (assuming that is possible for your disks).

            I am re-running the mdtest on a new LFS with the exact configuration, just omitting the recordsize=4096 zpool option. So far the percentages are tracking nearly identically, currently at 16% and climbing.

            [root@n0002 ~]# lfs df && lfs df -i
            UUID 1K-blocks Used Available Use% Mounted on
            lustre2-MDT0000_UUID 1566225280 246041344 1320181888 16% /lustre[MDT:0]
            lustre2-OST0000_UUID 42844636160 15360 42844618752 0% /lustre[OST:0]
            lustre2-OST0001_UUID 42844636160 15360 42844618752 0% /lustre[OST:1]

            filesystem summary: 85689272320 30720 85689237504 0% /lustre

            UUID Inodes IUsed IFree IUse% Mounted on
            lustre2-MDT0000_UUID 48371166 7598884 40772282 16% /lustre[MDT:0]
            lustre2-OST0000_UUID 193185425 223 193185202 0% /lustre[OST:0]
            lustre2-OST0001_UUID 193185425 223 193185202 0% /lustre[OST:1]

            filesystem summary: 48371166 7598884 40772282 16% /lustre

            As I read your dnode breakdown in your last comment I am left with the sense that in the osd-zfs environment the MDT inode utilization and monitoring high water mark of available inodes is a bit of black magic as opposed to a ldiskfs MDT. Based on the distribution of files and directories running out of inodes on the MDT is less predictable until you are close to running out.

            Does it make sense to use a recordsize=32768 or recordsize=65536 as a way to avoid loss of so much potential capacity to the 128K default?

            aeonjeffj Jeff Johnson (Inactive) added a comment - I am re-running the mdtest on a new LFS with the exact configuration, just omitting the recordsize=4096 zpool option. So far the percentages are tracking nearly identically, currently at 16% and climbing. [root@n0002 ~] # lfs df && lfs df -i UUID 1K-blocks Used Available Use% Mounted on lustre2-MDT0000_UUID 1566225280 246041344 1320181888 16% /lustre [MDT:0] lustre2-OST0000_UUID 42844636160 15360 42844618752 0% /lustre [OST:0] lustre2-OST0001_UUID 42844636160 15360 42844618752 0% /lustre [OST:1] filesystem summary: 85689272320 30720 85689237504 0% /lustre UUID Inodes IUsed IFree IUse% Mounted on lustre2-MDT0000_UUID 48371166 7598884 40772282 16% /lustre [MDT:0] lustre2-OST0000_UUID 193185425 223 193185202 0% /lustre [OST:0] lustre2-OST0001_UUID 193185425 223 193185202 0% /lustre [OST:1] filesystem summary: 48371166 7598884 40772282 16% /lustre As I read your dnode breakdown in your last comment I am left with the sense that in the osd-zfs environment the MDT inode utilization and monitoring high water mark of available inodes is a bit of black magic as opposed to a ldiskfs MDT. Based on the distribution of files and directories running out of inodes on the MDT is less predictable until you are close to running out. Does it make sense to use a recordsize=32768 or recordsize=65536 as a way to avoid loss of so much potential capacity to the 128K default?

            People

              adilger Andreas Dilger
              aeonjeffj Jeff Johnson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: