[LU-15834] "lfs mirror extend" should take current OSTs into account Created: 09/May/22  Updated: 17/Nov/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9007 Improved object allocator for FLR com... Resolved
is related to LU-15841 sanity-flr test 47 is failing with 'c... Resolved
is related to LU-10158 FLR: Define a replica choosing policy... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When running "lfs mirror extend" to add a mirror to an existing file, it appears that the OST(s) used for the original layout are not taken into account, if OST pools or indices are not explicitly specified:

$ lfs getstripe /myth/tmp/flr
/myth/tmp/flr
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 1
	obdidx		 objid		 objid		 group
	     1	       2352362	     0x23e4ea	             0
$ lfs mirror extend -N -c 1 /myth/tmp/flr
$ lfs getstripe /myth/tmp/flr
/myth/tmp/flr
  lcm_layout_gen:    1
  lcm_mirror_count:  2
  lcm_entry_count:   2
    lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x23e4ea:0x0] }

    lcme_id:             131073
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 1
      lmm_objects:
      - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x23e4eb:0x0] }

in this case, both the original and mirror copy of the file are allocated on OST0001 because it has the most free space, which isn't helpful in terms of availability or performance for this file.

A previous patch https://review.whamcloud.com/32404 "LU-9007 lod: improve obj alloc for FLR file" added lod_should_avoid_ost() to handle new OST object allocations for different components of a single mirror, but does not appear to handle allocations across FLR mirrors.

Requirements for OST selection on components with overlapping extents should be, in order of decreasing priority:

  1. objects with overlapping components must not share the same OST. Implies mirror_count <= ost_count / component_stripe_count. In theory this could be relaxed if all replicas have the same stripe_count and stripe_size, then it would only require that the same OST cannot be at the same stripe_index of different components, in which case max replica count == OST count, but this is more difficult to control.
  2. objects with overlapping components should not share OSTs on the same OSS node (by NID from imp->imp_connection->c_peer.nid, as qos_add_tgt() does) to avoid the shared node failure domain.
  3. objects with overlapping components should not share OSTs on the same OSS failover pair (by failover NID from imp->imp_conn_list.oic_conn->c_peer.nid, as lprocfs_import_seq_show() does) to avoid the shared storage enclosure/controller failure domain. There may be other OSS nodes that share the same storage enclosure/controller, but there isn't any way for the client to determine this automatically.
  4. objects with overlapping components should not be on OSTs on the same network switch, power supply, rack, etc. but this depends on external information that is not currently available to Lustre. That could optionally be added via a separate configuration file/options, but the above cases will automatically cover the most important failure scenarios


 Comments   
Comment by Andreas Dilger [ 13/May/22 ]

Is this also fixed by the LU-15841 patch?

Comment by Zhenyu Xu [ 13/May/22 ]

I think not, as mirror extent w/o specifying victim_file, the mirror_extend_layout() just create a volatile file using the basic stripe_size/stripe_count from the source file and the object allocation of the volatile file does not consider those of the source file, and mirror merge just append the layout component of the volatile file as a new mirror to the source file.

Comment by Gerrit Updater [ 13/Apr/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50622
Subject: LU-15834 lfs: mirror extend take current OSTs into account
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0c1dad7e0e9b533516637e3aa457ad9d263c30dc

Generated at Sat Feb 10 03:21:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.