Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 3
    • 23,099
    • Orion
    • 2188

    Description

      osd-zfs has lack of fiemap support. That was discussed in bugzilla 23099 originally. This is not blocker for DMU milestone, this task is mostly improvement.

      In sanity.sh test_130* it is verifying that FIEMAP (file extent map) is working properly. This allows clients to determine the disk block allocation layout for a particular file.

      In 1.x and 2.x FIEMAP is supported for ldiskfs filesystems,.

      Once the "fiemap" request is passed through to the OSD it should be trivial to call the ldiskfs ->fiemap() method to fill in the data structure and return it to the caller. For ZFS this will need some code (possibly a new DMU interface?) to walk the file's data blocks and return the block pointer(s?) for each block.

      Open questions include:

      • which blockpointer should be returned in case of ditto blocks? It is possible to return multiple overlapping extents (one for each DVA), but it may be confusing to some users
      • while FIEMAP has space for a "device" for each extent, how will we map different ZFS VDEV devices and Lustre OST devices into the single 32-bit device field?
        • We could use 16-bit "major:minor" with OST index being "major" and VDEV being "minor", but I don't think there is a simple index for the VDEVs.
        • We could use the low 16-bit value of the VDEV UUID (assuming it is largely unique) so that users can identify this fairly easily from "zfs" output if needed.
        • We could try and map the VDEV to the underlying Linux block device major/minor, though it is a major layering violation.
      • should/can the extents be returned to the user in some "device" (VDEV) order so that it is more clear if the extents are contiguous on disk or not, or will we get $((filesize * ditto / 128k)) extents returned to the client, possibly millions for large (128GB) files?

      Even for local ZFS filesystem mounts, FIEMAP (via filefrag) output would provide useful insight into the on-disk allocation of files and would be needed to improve the ZFS allocation policies.

      Attachments

        Issue Links

          Activity

            [LU-1941] ZFS FIEMAP support

            There is some work restarting on the upstream kernel to add compressed file support to FIEMAP:
            https://marc.info/?l=linux-doc&m=170992090817490&w=2

            It isn't quite ready, but once the fields and flags are committed upstream then we might be able to backport this to Lustre on older kernels...

            adilger Andreas Dilger added a comment - There is some work restarting on the upstream kernel to add compressed file support to FIEMAP: https://marc.info/?l=linux-doc&m=170992090817490&w=2 It isn't quite ready, but once the fields and flags are committed upstream then we might be able to backport this to Lustre on older kernels...

            ZFS issue for FIEMAP is tracked at https://github.com/zfsonlinux/zfs/issues/264

            adilger Andreas Dilger added a comment - ZFS issue for FIEMAP is tracked at https://github.com/zfsonlinux/zfs/issues/264
            adilger Andreas Dilger added a comment - - edited

            Upstream patch prototype for FIEMAP_FLAG_DATA_COMPRESSED/fe_phys_length and discussion on how the patches should be fixed for upstream kernel acceptance: https://lore.kernel.org/linux-fsdevel/cover.1406739708.git.dsterba@suse.cz/

            The patch series was discussed and had some improvements that need to be made, but it was never updated after the last time the series was pushed.

            adilger Andreas Dilger added a comment - - edited Upstream patch prototype for FIEMAP_FLAG_DATA_COMPRESSED/fe_phys_length and discussion on how the patches should be fixed for upstream kernel acceptance: https://lore.kernel.org/linux-fsdevel/cover.1406739708.git.dsterba@suse.cz/ The patch series was discussed and had some improvements that need to be made, but it was never updated after the last time the series was pushed.

            Selected comments from Ricardo:

            I don't think it's really possible to retrieve the list of "extents" sorted by device order, at
            least not with the current on-disk format and not if you care about performance in any way.

            Currently, to achieve that you'd need to do a global sort of all the DVAs/block pointers in the
            entire file, which for large files could require a huge amount of I/O, memory resources and/or
            time.

            There is still a problem, though: we're only thinking about the top-level vdevs.

            If you have a RAID-Z vdev with 10 disks, then a single block can be split across the 10 disks.

            So the way we're thinking, for each block you'll only get 1 extent, where the offset is the
            "logical" offset of the RAID-Z vdev. But in this way you won't get the actual per-disk offsets.

            In theory you could return N FIEMAP extents per DMU block (where N is the number of disks you have
            in the RAID-Z vdev), but this won't be as simple as looking at the DVAs (you'd need to do some
            calculations), and I'd suspect the output would get a bit too verbose...

            So maybe for now I'd suggest to only return 1 extent per block, with the logical offset, because if
            the logical offsets are contiguous, then the per-disk offsets will also be contiguous.

            Another question is - does filefrag understand that an allocated extent size may not correspond to
            a logical extent size?

            Because I was thinking that we need to return the actual "allocated" on-disk size, not the logical
            block size, otherwise we won't know whether the blocks are actually allocated contiguously or if
            they have holes between them (e.g. a RAID-Zed 128K block actually has an allocated size of
            128K+parity, so it would seem that there are holes between RAID-Z blocks).

            Another problem is that if we'd report the logical block size instead of the allocated size, it'd
            get confusing if you have compression (it would look like some extents would be overlapping...).

            But even then, I'm not sure if we can only get a per-block allocated size or if it's possible to
            get a per-DVA allocated size...

            adilger Andreas Dilger added a comment - Selected comments from Ricardo: I don't think it's really possible to retrieve the list of "extents" sorted by device order, at least not with the current on-disk format and not if you care about performance in any way. Currently, to achieve that you'd need to do a global sort of all the DVAs/block pointers in the entire file, which for large files could require a huge amount of I/O, memory resources and/or time. There is still a problem, though: we're only thinking about the top-level vdevs. If you have a RAID-Z vdev with 10 disks, then a single block can be split across the 10 disks. So the way we're thinking, for each block you'll only get 1 extent, where the offset is the "logical" offset of the RAID-Z vdev. But in this way you won't get the actual per-disk offsets. In theory you could return N FIEMAP extents per DMU block (where N is the number of disks you have in the RAID-Z vdev), but this won't be as simple as looking at the DVAs (you'd need to do some calculations), and I'd suspect the output would get a bit too verbose... So maybe for now I'd suggest to only return 1 extent per block, with the logical offset, because if the logical offsets are contiguous, then the per-disk offsets will also be contiguous. Another question is - does filefrag understand that an allocated extent size may not correspond to a logical extent size? Because I was thinking that we need to return the actual "allocated" on-disk size, not the logical block size, otherwise we won't know whether the blocks are actually allocated contiguously or if they have holes between them (e.g. a RAID-Zed 128K block actually has an allocated size of 128K+parity, so it would seem that there are holes between RAID-Z blocks). Another problem is that if we'd report the logical block size instead of the allocated size, it'd get confusing if you have compression (it would look like some extents would be overlapping...). But even then, I'm not sure if we can only get a per-block allocated size or if it's possible to get a per-DVA allocated size...

            Patch to disable filefrag tests on master: http://review.whamcloud.com/3998

            adilger Andreas Dilger added a comment - Patch to disable filefrag tests on master: http://review.whamcloud.com/3998

            People

              wc-triage WC Triage
              tappro Mikhail Pershin
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: