[LU-1941] ZFS FIEMAP support Created: 08/Oct/11  Updated: 29/Jun/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Mikhail Pershin Assignee: WC Triage
Resolution: Unresolved Votes: 1
Labels: always_except, prz, zfs

Issue Links:
Related
is related to LU-12336 Update ZFS Version to 0.8.2 Resolved
is related to LU-6007 FIEMAP fails xfstests's fiemap-tester Open
is related to LU-10810 SEEK_HOLE and SEEK_DATA support for l... Resolved
Epic/Theme: ORI-12
Story Points: 3
Severity: 3
Bugzilla ID: 23,099
Project: Orion
Rank (Obsolete): 2188

 Description   

osd-zfs has lack of fiemap support. That was discussed in bugzilla 23099 originally. This is not blocker for DMU milestone, this task is mostly improvement.

In sanity.sh test_130* it is verifying that FIEMAP (file extent map) is working properly. This allows clients to determine the disk block allocation layout for a particular file.

In 1.x and 2.x FIEMAP is supported for ldiskfs filesystems,.

Once the "fiemap" request is passed through to the OSD it should be trivial to call the ldiskfs ->fiemap() method to fill in the data structure and return it to the caller. For ZFS this will need some code (possibly a new DMU interface?) to walk the file's data blocks and return the block pointer(s?) for each block.

Open questions include:

  • which blockpointer should be returned in case of ditto blocks? It is possible to return multiple overlapping extents (one for each DVA), but it may be confusing to some users
  • while FIEMAP has space for a "device" for each extent, how will we map different ZFS VDEV devices and Lustre OST devices into the single 32-bit device field?
    • We could use 16-bit "major:minor" with OST index being "major" and VDEV being "minor", but I don't think there is a simple index for the VDEVs.
    • We could use the low 16-bit value of the VDEV UUID (assuming it is largely unique) so that users can identify this fairly easily from "zfs" output if needed.
    • We could try and map the VDEV to the underlying Linux block device major/minor, though it is a major layering violation.
  • should/can the extents be returned to the user in some "device" (VDEV) order so that it is more clear if the extents are contiguous on disk or not, or will we get $((filesize * ditto / 128k)) extents returned to the client, possibly millions for large (128GB) files?

Even for local ZFS filesystem mounts, FIEMAP (via filefrag) output would provide useful insight into the on-disk allocation of files and would be needed to improve the ZFS allocation policies.



 Comments   
Comment by Andreas Dilger [ 14/Sep/12 ]

Patch to disable filefrag tests on master: http://review.whamcloud.com/3998

Comment by Andreas Dilger [ 16/Sep/12 ]

Selected comments from Ricardo:

I don't think it's really possible to retrieve the list of "extents" sorted by device order, at
least not with the current on-disk format and not if you care about performance in any way.

Currently, to achieve that you'd need to do a global sort of all the DVAs/block pointers in the
entire file, which for large files could require a huge amount of I/O, memory resources and/or
time.

There is still a problem, though: we're only thinking about the top-level vdevs.

If you have a RAID-Z vdev with 10 disks, then a single block can be split across the 10 disks.

So the way we're thinking, for each block you'll only get 1 extent, where the offset is the
"logical" offset of the RAID-Z vdev. But in this way you won't get the actual per-disk offsets.

In theory you could return N FIEMAP extents per DMU block (where N is the number of disks you have
in the RAID-Z vdev), but this won't be as simple as looking at the DVAs (you'd need to do some
calculations), and I'd suspect the output would get a bit too verbose...

So maybe for now I'd suggest to only return 1 extent per block, with the logical offset, because if
the logical offsets are contiguous, then the per-disk offsets will also be contiguous.

Another question is - does filefrag understand that an allocated extent size may not correspond to
a logical extent size?

Because I was thinking that we need to return the actual "allocated" on-disk size, not the logical
block size, otherwise we won't know whether the blocks are actually allocated contiguously or if
they have holes between them (e.g. a RAID-Zed 128K block actually has an allocated size of
128K+parity, so it would seem that there are holes between RAID-Z blocks).

Another problem is that if we'd report the logical block size instead of the allocated size, it'd
get confusing if you have compression (it would look like some extents would be overlapping...).

But even then, I'm not sure if we can only get a per-block allocated size or if it's possible to
get a per-DVA allocated size...

Comment by Andreas Dilger [ 17/Jul/14 ]

Upstream patch prototype for FIEMAP_FLAG_DATA_COMPRESSED/fe_phys_length and discussion on how the patches should be fixed for upstream kernel acceptance: https://lore.kernel.org/linux-fsdevel/cover.1406739708.git.dsterba@suse.cz/

The patch series was discussed and had some improvements that need to be made, but it was never updated after the last time the series was pushed.

Comment by Andreas Dilger [ 21/Aug/15 ]

ZFS issue for FIEMAP is tracked at https://github.com/zfsonlinux/zfs/issues/264

Generated at Sat Feb 10 01:21:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.