[LU-13062] OST offset defaults to 0 when coying a PFL via xattrs Created: 11/Dec/19  Updated: 15/Oct/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Clément Barthelemy (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File simple_lustre_dup.c    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

 I have a small tool (attached) that duplicates Lustre files using raw xattrs (i.e standard mknod + setxattr).

With this method, PFL components that were initialized on the original file seem to always be created on OST index 0 on the duplicated file.

For example, let's say that we have 4 OSTs and we create a PFL file comp_file with 2 components and explicit OST offsets 2 and 3.

  $ lfs setstripe -E 1M -c 2 -i 2 -E -1 -c -1 -i 3 comp_file
  $ lfs getstripe comp_file
  comp_file
  lcm_layout_gen:    2
  lcm_mirror_count:  1
  lcm_entry_count:   2
    lcme_id:             1
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  2
      lmm_stripe_size:   65536
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 2
      lmm_objects:
      - 0: { l_ost_idx: 2, l_fid: [0x100020000:0x7c:0x0] }
      - 1: { l_ost_idx: 3, l_fid: [0x100030000:0x53:0x0] }

    lcme_id:             2
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  -1
      lmm_stripe_size:   65536
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 3

Calling the small tool to duplicate this layout into a new file comp_file_dup sets the first lmm_stripe_offset to 0.

$ ./simple_lustre_dup comp_file comp_file_dup
$ lfs getstripe comp_file_dup
 comp_file_dup
 lcm_layout_gen: 2
 lcm_mirror_count: 1
 lcm_entry_count: 2
 lcme_id: 1
 lcme_mirror_id: 0
 lcme_flags: init
 lcme_extent.e_start: 0
 lcme_extent.e_end: 1048576
 lmm_stripe_count: 2
 lmm_stripe_size: 65536
 lmm_pattern: raid0
 lmm_layout_gen: 0
 lmm_stripe_offset: 0
 lmm_objects:
 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x357:0x0] }
 - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x367:0x0] }
lcme_id: 2
 lcme_mirror_id: 0
 lcme_flags: 0
 lcme_extent.e_start: 1048576
 lcme_extent.e_end: EOF
 lmm_stripe_count: -1
 lmm_stripe_size: 65536
 lmm_pattern: raid0
 lmm_layout_gen: 0
 lmm_stripe_offset: 3

 

As was explained to me by Andreas, it makes sense to ignore the offset on duplication but -1 would be a better default value, letting the MDS choose the most appropriate OST.

According to Andreas this might be related to

LU-2809 llite: Do not return layout_gen for getxattr

LU-9484 llite: eat -EEXIST on setting trusted.lov



 Comments   
Comment by Clément Barthelemy (Inactive) [ 11/Dec/19 ]

In addition : using an OST pool that does not contain the OST 0 will (predictably) result in EINVAL when copying the extended attributes.

Comment by Andreas Dilger [ 11/Dec/19 ]

It appears that the code to fix up lmm_stripe_index from setxattr is only enabled for plain layout files:

                /* Attributes that are saved via getxattr will always
                 * have the stripe_offset as 0.  Instead, the MDS
                 * should be allowed to pick the starting OST index.
                 * b=17846 */
                if (!is_composite && v1->lmm_stripe_offset == 0)
                        v1->lmm_stripe_offset = -1;

This is probably because the new llapi_layout_ interface is also using setxattr() and depends on the value passed for lmm_stripe_offset, but getxattr is always clearing it:

                /* Do not return layout gen for getxattr() since
                 * otherwise it would confuse tar --xattr by
                 * recognizing layout gen as stripe offset when the
                 * file is restored. See LU-2809. */

Comment by Andreas Dilger [ 11/Dec/19 ]

One way to fix this would be to also set lmm_stripe_index=-1 for composite layouts. That would do the right thing for most use cases (ie. normal user backup/restore). The one problematic case would be when trying to explicitly create a file on OST0000, which would no longer work with this interface.

Disallowing explicit PFL file creation on OST0000 would not be a problem for regular users, but a lot of test scripts explicitly create files on OST0000 so that the test knows which OST to inject a fault on, or which one has consumed space. It is worthwhile trying a simple patch to remove the "!is_composite" check and seeing how many tests break, but I suspect it will be a fair number since OST0000 is used preferentially because we know it always exists, unlike higher-numbered OSTs.

It would be possible to fix these tests to use $((OSTCOUNT - 1)), and in the case that only a single OST is in the test config and this evaluates to 0, then it doesn't matter because that is the only OST the MDS can allocate on anyway.

Another option would be to set lmm_layout_gen=-1 instead of 0 on getxattr. That has the dual benefit of actually expressing the correct intent (allocate the file on any OST on restore), and does not break the setstripe functionality.

As a workaround, it seems possible to set lmm_stripe_offset=-1 in userspace before restoring the file to avoid this bug in older versions of Lustre.

Comment by Gerrit Updater [ 15/Oct/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45252
Subject: LU-13062 llite: return stripe_offset -1 in trusted.lov
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0a422e8706764f7bfa8e2123e09b4a21ff0c0acf

Generated at Sat Feb 10 02:58:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.