Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • Lustre 2.14.0, Lustre 2.12.6, Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      Creating a DoM PFL file and then running "filefrag -v" on it (ioctl(FIEMAP)) does not work:

      # lfs setstripe -E 1M -L mdt -E 16m -c 4 -E eof -c -1 /mnt/testfs/dom-pfl
      # dd if=/dev/zero of=/mnt/testfs/dom-pfl2 bs=1M count=1
      # filefrag -v /mnt/testfs/dom-pfl2
      Filesystem type is: bd00bd0
      File size of /mnt/testfs/dom-pfl2 is 1048576 (1024 blocks of 1024 bytes)
      /mnt/testfs/dom-pfl2: FIBMAP unsupported
      

      The "FIBMAP unsupported" message is a bit misleading, since it tries FIEMAP first, but falls back to FIBMAP if that doesn't work.

      Attachments

        Issue Links

          Activity

            [LU-14510] FIEMAP does not work on DoM files
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55221/
            Subject: LU-14510 dom: fiemap support for DoM files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5921e1571f2ddcc8ebc7cee481f75fe8fc458b45

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55221/ Subject: LU-14510 dom: fiemap support for DoM files Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5921e1571f2ddcc8ebc7cee481f75fe8fc458b45

            The "4 extents" might be counting the number of physically-disjoint allocations on disk? The example you show has contiguous allocations on disk, even if they are logically discontiguous.

            The MDT does not need to have "extent" format for FIEMAP to work. The code will logically stitch contiguous block allocations together and add the "FIEMAP_EXTENT_MERGED" flag in this case, as can be seen in the output above.

            For the device numbering, as you can see we are already using the high bits to indicate the PFL component number, so we will need to decide how to indicate an MDT number. The real question is how important it is to indicate that this is a DoM component vs. an OST component? Since the PFL component number is part of the device, then it will be clear that these are separate objects, so it really depends on what the user expects? An existing option that is "close" to a DoM file component is FIEMAP_EXTENT_DATA_INLINE, which is intended for use by data stored inside the inode, but from a Lustre point of view this is nearly true.

            I was thinking it might be possible to get a "FIEMAP_EXTENT_NONROT" flag accepted into the upstream FIEMAP code, but this wouldn't help us for the increasingly common case of all-flash OSTs. Using a different FIEMAP flag that is not upstream is definitely subject to risk of being used for something else.

            adilger Andreas Dilger added a comment - The "4 extents" might be counting the number of physically-disjoint allocations on disk? The example you show has contiguous allocations on disk, even if they are logically discontiguous. The MDT does not need to have "extent" format for FIEMAP to work. The code will logically stitch contiguous block allocations together and add the " FIEMAP_EXTENT_MERGED " flag in this case, as can be seen in the output above. For the device numbering, as you can see we are already using the high bits to indicate the PFL component number, so we will need to decide how to indicate an MDT number. The real question is how important it is to indicate that this is a DoM component vs. an OST component? Since the PFL component number is part of the device, then it will be clear that these are separate objects, so it really depends on what the user expects? An existing option that is "close" to a DoM file component is FIEMAP_EXTENT_DATA_INLINE , which is intended for use by data stored inside the inode, but from a Lustre point of view this is nearly true. I was thinking it might be possible to get a " FIEMAP_EXTENT_NONROT " flag accepted into the upstream FIEMAP code, but this wouldn't help us for the increasingly common case of all-flash OSTs. Using a different FIEMAP flag that is not upstream is definitely subject to risk of being used for something else.
            tappro Mikhail Pershin added a comment - - edited

            I've just pushed initial patch for fiemap support to start with. The problem with device number is mostly about device number in fiemap output, after PFL changes that device number is not being used anywhere but absolute stripe number is being used. So technically we could use always 0 here or add some flag to indicate that is MDT index or extend bitfield by another bit as that is done in patch. The only question here is limit for absolute stripe number which was 65536 and is 32768 now. I see that LOV_MAX_STRIPE_COUNT is set to 2000, but for PFL that is one component limit I assume, so total amount is (max component)*(max stripes in component), and that gives us about 16 components with 2000 stripes. I am not sure how real is this but previous number was 32 which is the same margin I'd say

            Another question I have about fiemap for Lustre in general is about reported extents:

             File size of /mnt/lustre/f130h.sanity is 3141632 (3068 blocks of 1024 bytes)
             ext:     device_logical:        physical_offset: length:  dev: flags:
               0:        0..       3:     132244..    132247:      4: 10000: merged,net
               1:        8..      11:     132248..    132251:      4: 10000: merged,net
               2:       16..      19:     132252..    132255:      4: 10000: merged,net
            ...
             381:     1512..    1515:     380476..    380479:      4: 40001: net
             382:     1520..    1523:     380480..    380483:      4: 40001: net
             383:     1528..    1531:     380484..    380487:      4: 40001: last,net
            /mnt/lustre/f130h.sanity: 4 extents found
            

            That is not regression of this patch but old behavior and I wonder what does "N extents found" really means there as there are 384 real fragments and 3 devices - MDT and 2 OSTs

            Another problem not related with patch but revealed by fiemap output is MDT format options without 'extents', I wonder how practical is that if we support DoM files? Alex mentioned that technically we can always enable extents for files on MDT and for directories keep dependence on 'extents' option. 

            tappro Mikhail Pershin added a comment - - edited I've just pushed initial patch for fiemap support to start with. The problem with device number is mostly about device number in fiemap output, after PFL changes that device number is not being used anywhere but absolute stripe number is being used. So technically we could use always 0 here or add some flag to indicate that is MDT index or extend bitfield by another bit as that is done in patch. The only question here is limit for absolute stripe number which was 65536 and is 32768 now. I see that LOV_MAX_STRIPE_COUNT is set to 2000, but for PFL that is one component limit I assume, so total amount is (max component)*(max stripes in component), and that gives us about 16 components with 2000 stripes. I am not sure how real is this but previous number was 32 which is the same margin I'd say Another question I have about fiemap for Lustre in general is about reported extents: File size of /mnt/lustre/f130h.sanity is 3141632 (3068 blocks of 1024 bytes)  ext:     device_logical:        physical_offset: length:  dev: flags:    0:        0..       3:     132244..    132247:      4: 10000: merged,net    1:        8..      11:     132248..    132251:      4: 10000: merged,net    2:       16..      19:     132252..    132255:      4: 10000: merged,net ...  381:     1512..    1515:     380476..    380479:      4: 40001: net  382:     1520..    1523:     380480..    380483:      4: 40001: net  383:     1528..    1531:     380484..    380487:      4: 40001: last,net /mnt/lustre/f130h.sanity: 4 extents found That is not regression of this patch but old behavior and I wonder what does "N extents found" really means there as there are 384 real fragments and 3 devices - MDT and 2 OSTs Another problem not related with patch but revealed by fiemap output is MDT format options without 'extents', I wonder how practical is that if we support DoM files? Alex mentioned that technically we can always enable extents for files on MDT and for directories keep dependence on 'extents' option. 

            "Mikhail Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55221
            Subject: LU-14510 dom: fiemap support for DoM files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 53f2849a3dfc5d77c14f50cd3d8b5e4142a26474

            gerrit Gerrit Updater added a comment - "Mikhail Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55221 Subject: LU-14510 dom: fiemap support for DoM files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 53f2849a3dfc5d77c14f50cd3d8b5e4142a26474

            One issue that we need to decide is what the "device number" should be for MDT components of a file? Currently, we use device "0..nnnn" to represent OST0000..OSTxxxx, so we would have to use some other numbering for the MDT. While this is a 32-bit number and we only allow up to 65536 OSTs, unfortunately the high 16 bits were already consumed by storing the component number for LU-11848 to handle PFL files.

            Another option would be to use an extent flag like FIEMAP_EXTENT_DATA_INLINE to indicate the data is being stored on the MDS. Strictly speaking, the FIEMAP_EXTENT_DATA_INLINE flag indicates that the data is stored inside the inode, and at some point we may want to allow this with the inline_data feature, but it might be a convenient short-term hack. A better solution would be to have a new flag like FIEMAP_EXTENT_METADATA=0x40000000, but it has some chance of conflict with other flags in the future unless we can add some upstream functionality that also uses this flag.

            adilger Andreas Dilger added a comment - One issue that we need to decide is what the "device number" should be for MDT components of a file? Currently, we use device "0..nnnn" to represent OST0000..OSTxxxx, so we would have to use some other numbering for the MDT. While this is a 32-bit number and we only allow up to 65536 OSTs, unfortunately the high 16 bits were already consumed by storing the component number for LU-11848 to handle PFL files. Another option would be to use an extent flag like FIEMAP_EXTENT_DATA_INLINE to indicate the data is being stored on the MDS. Strictly speaking, the FIEMAP_EXTENT_DATA_INLINE flag indicates that the data is stored inside the inode, and at some point we may want to allow this with the inline_data feature, but it might be a convenient short-term hack. A better solution would be to have a new flag like FIEMAP_EXTENT_METADATA=0x40000000 , but it has some chance of conflict with other flags in the future unless we can add some upstream functionality that also uses this flag.

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: