Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13062

OST offset defaults to 0 when coying a PFL via xattrs

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • Lustre 2.12.3
    • 3
    • 9223372036854775807

    Description

       I have a small tool (attached) that duplicates Lustre files using raw xattrs (i.e standard mknod + setxattr).

      With this method, PFL components that were initialized on the original file seem to always be created on OST index 0 on the duplicated file.

      For example, let's say that we have 4 OSTs and we create a PFL file comp_file with 2 components and explicit OST offsets 2 and 3.

        $ lfs setstripe -E 1M -c 2 -i 2 -E -1 -c -1 -i 3 comp_file
        $ lfs getstripe comp_file
        comp_file
        lcm_layout_gen:    2
        lcm_mirror_count:  1
        lcm_entry_count:   2
          lcme_id:             1
          lcme_mirror_id:      0
          lcme_flags:          init
          lcme_extent.e_start: 0
          lcme_extent.e_end:   1048576
            lmm_stripe_count:  2
            lmm_stripe_size:   65536
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 2
            lmm_objects:
            - 0: { l_ost_idx: 2, l_fid: [0x100020000:0x7c:0x0] }
            - 1: { l_ost_idx: 3, l_fid: [0x100030000:0x53:0x0] }
      
          lcme_id:             2
          lcme_mirror_id:      0
          lcme_flags:          0
          lcme_extent.e_start: 1048576
          lcme_extent.e_end:   EOF
            lmm_stripe_count:  -1
            lmm_stripe_size:   65536
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 3
      

      Calling the small tool to duplicate this layout into a new file comp_file_dup sets the first lmm_stripe_offset to 0.

      $ ./simple_lustre_dup comp_file comp_file_dup
      $ lfs getstripe comp_file_dup
       comp_file_dup
       lcm_layout_gen: 2
       lcm_mirror_count: 1
       lcm_entry_count: 2
       lcme_id: 1
       lcme_mirror_id: 0
       lcme_flags: init
       lcme_extent.e_start: 0
       lcme_extent.e_end: 1048576
       lmm_stripe_count: 2
       lmm_stripe_size: 65536
       lmm_pattern: raid0
       lmm_layout_gen: 0
       lmm_stripe_offset: 0
       lmm_objects:
       - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x357:0x0] }
       - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x367:0x0] }
      lcme_id: 2
       lcme_mirror_id: 0
       lcme_flags: 0
       lcme_extent.e_start: 1048576
       lcme_extent.e_end: EOF
       lmm_stripe_count: -1
       lmm_stripe_size: 65536
       lmm_pattern: raid0
       lmm_layout_gen: 0
       lmm_stripe_offset: 3
      

       

      As was explained to me by Andreas, it makes sense to ignore the offset on duplication but -1 would be a better default value, letting the MDS choose the most appropriate OST.

      According to Andreas this might be related to

      LU-2809 llite: Do not return layout_gen for getxattr

      LU-9484 llite: eat -EEXIST on setting trusted.lov

      Attachments

        Issue Links

          Activity

            [LU-13062] OST offset defaults to 0 when coying a PFL via xattrs
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45252/
            Subject: LU-13062 llite: return stripe_offset -1 in trusted.lov
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 25240a4146e9d57b22cee0c3f117a6ab46c5bd71

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45252/ Subject: LU-13062 llite: return stripe_offset -1 in trusted.lov Project: fs/lustre-release Branch: master Current Patch Set: Commit: 25240a4146e9d57b22cee0c3f117a6ab46c5bd71

            I've returned back check for !is_composite , so composite layout entries are taking stripe_offset as is, but instead, I dropping offset to -1 for entry when it is taken by getxattr. So current logic is that setxattr doesn't interfere as it is unknown is it from lfs valid call or 3rd-side tool direct setxattr. But getxattr drops entries offset to -1 always, so it will be restored with that value. So far test_46 from sanity-flr.sh fails due to lfs setstripe --copy found copied layout is not the same - due to offset of course. There is existing $SKIP_INDEX test option to skip stripe_offset check, so test can be changed in that way, if we agree that lfs setstripe --copy can produce different start offset

            tappro Mikhail Pershin added a comment - I've returned back check for !is_composite , so composite layout entries are taking stripe_offset as is, but instead, I dropping offset to -1 for entry when it is taken by getxattr. So current logic is that setxattr doesn't interfere as it is unknown is it from lfs valid call or 3rd-side tool direct setxattr. But getxattr drops entries offset to -1 always, so it will be restored with that value. So far test_46 from sanity-flr.sh fails due to lfs setstripe --copy found copied layout is not the same - due to offset of course. There is existing $SKIP_INDEX test option to skip stripe_offset check, so test can be changed in that way, if we agree that lfs setstripe --copy can produce different start offset

            I am refreshing patch in context of DDN-5322 and wonder about removal of "!is_composite" check, quick test shows that it affects also normal "lfs setstripe -E -c1 -i0 .." as well. So literally for composite layout we can't set mandatory first OST as initial, but can any other. That make option -i just not working

            I am not sure if there is way to distinguish lfs setattr from direct setxattr from userspace, but probably we might need special flag passing from lfs and insisting on particular ost start index

            tappro Mikhail Pershin added a comment - I am refreshing patch in context of DDN-5322 and wonder about removal of " !is_composite " check, quick test shows that it affects also normal " lfs setstripe -E -c1 -i0 .." as well. So literally for composite layout we can't set mandatory first OST as initial, but can any other. That make option -i just not working I am not sure if there is way to distinguish lfs setattr from direct setxattr from userspace, but probably we might need special flag passing from lfs and insisting on particular ost start index
            adilger Andreas Dilger added a comment - - edited

            This issue came up again in the context of MPFileUtils dsync utility, so it could impact anyone using this fairly popular utility. https://github.com/hpc/mpifileutils/issues/584

            It might make sense to also handle this on setxattr for trusted.lov to change incoming PFL layouts that have offset=0 for all of the initialized components and replace them with -1.

            adilger Andreas Dilger added a comment - - edited This issue came up again in the context of MPFileUtils dsync utility, so it could impact anyone using this fairly popular utility. https://github.com/hpc/mpifileutils/issues/584 It might make sense to also handle this on setxattr for trusted.lov to change incoming PFL layouts that have offset=0 for all of the initialized components and replace them with -1.

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45252
            Subject: LU-13062 llite: return stripe_offset -1 in trusted.lov
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0a422e8706764f7bfa8e2123e09b4a21ff0c0acf

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45252 Subject: LU-13062 llite: return stripe_offset -1 in trusted.lov Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0a422e8706764f7bfa8e2123e09b4a21ff0c0acf

            One way to fix this would be to also set lmm_stripe_index=-1 for composite layouts. That would do the right thing for most use cases (ie. normal user backup/restore). The one problematic case would be when trying to explicitly create a file on OST0000, which would no longer work with this interface.

            Disallowing explicit PFL file creation on OST0000 would not be a problem for regular users, but a lot of test scripts explicitly create files on OST0000 so that the test knows which OST to inject a fault on, or which one has consumed space. It is worthwhile trying a simple patch to remove the "!is_composite" check and seeing how many tests break, but I suspect it will be a fair number since OST0000 is used preferentially because we know it always exists, unlike higher-numbered OSTs.

            It would be possible to fix these tests to use $((OSTCOUNT - 1)), and in the case that only a single OST is in the test config and this evaluates to 0, then it doesn't matter because that is the only OST the MDS can allocate on anyway.

            Another option would be to set lmm_layout_gen=-1 instead of 0 on getxattr. That has the dual benefit of actually expressing the correct intent (allocate the file on any OST on restore), and does not break the setstripe functionality.

            As a workaround, it seems possible to set lmm_stripe_offset=-1 in userspace before restoring the file to avoid this bug in older versions of Lustre.

            adilger Andreas Dilger added a comment - One way to fix this would be to also set lmm_stripe_index=-1 for composite layouts. That would do the right thing for most use cases (ie. normal user backup/restore). The one problematic case would be when trying to explicitly create a file on OST0000, which would no longer work with this interface. Disallowing explicit PFL file creation on OST0000 would not be a problem for regular users, but a lot of test scripts explicitly create files on OST0000 so that the test knows which OST to inject a fault on, or which one has consumed space. It is worthwhile trying a simple patch to remove the " !is_composite " check and seeing how many tests break, but I suspect it will be a fair number since OST0000 is used preferentially because we know it always exists, unlike higher-numbered OSTs. It would be possible to fix these tests to use $((OSTCOUNT - 1)), and in the case that only a single OST is in the test config and this evaluates to 0, then it doesn't matter because that is the only OST the MDS can allocate on anyway. Another option would be to set lmm_layout_gen=-1 instead of 0 on getxattr. That has the dual benefit of actually expressing the correct intent (allocate the file on any OST on restore), and does not break the setstripe functionality. As a workaround, it seems possible to set lmm_stripe_offset=-1 in userspace before restoring the file to avoid this bug in older versions of Lustre.

            It appears that the code to fix up lmm_stripe_index from setxattr is only enabled for plain layout files:

                            /* Attributes that are saved via getxattr will always
                             * have the stripe_offset as 0.  Instead, the MDS
                             * should be allowed to pick the starting OST index.
                             * b=17846 */
                            if (!is_composite && v1->lmm_stripe_offset == 0)
                                    v1->lmm_stripe_offset = -1;
            

            This is probably because the new llapi_layout_ interface is also using setxattr() and depends on the value passed for lmm_stripe_offset, but getxattr is always clearing it:

                            /* Do not return layout gen for getxattr() since
                             * otherwise it would confuse tar --xattr by
                             * recognizing layout gen as stripe offset when the
                             * file is restored. See LU-2809. */
            

            adilger Andreas Dilger added a comment - It appears that the code to fix up lmm_stripe_index from setxattr is only enabled for plain layout files: /* Attributes that are saved via getxattr will always * have the stripe_offset as 0. Instead, the MDS * should be allowed to pick the starting OST index. * b=17846 */ if (!is_composite && v1->lmm_stripe_offset == 0) v1->lmm_stripe_offset = -1; This is probably because the new llapi_layout_ interface is also using setxattr() and depends on the value passed for lmm_stripe_offset, but getxattr is always clearing it: /* Do not return layout gen for getxattr() since * otherwise it would confuse tar --xattr by * recognizing layout gen as stripe offset when the * file is restored. See LU-2809. */

            In addition : using an OST pool that does not contain the OST 0 will (predictably) result in EINVAL when copying the extended attributes.

            barthelemy Clément Barthelemy (Inactive) added a comment - In addition : using an OST pool that does not contain the OST 0 will (predictably) result in EINVAL when copying the extended attributes.

            People

              tappro Mikhail Pershin
              barthelemy Clément Barthelemy (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: