[LU-9007] Improved object allocator for FLR composite files Created: 11/Jan/17  Updated: 13/May/22  Resolved: 09/Aug/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.0

Type: Improvement Priority: Minor
Reporter: Joseph Gmitter (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: FLR2

Issue Links:
Related
is related to LU-8998 Progressive File Layout (PFL) Resolved
is related to LU-10158 FLR: Define a replica choosing policy... Open
is related to LU-15834 "lfs mirror extend" should take curre... Open
is related to LU-15841 sanity-flr test 47 is failing with 'c... Resolved
is related to LU-11238 sanity-flr test 47 fails with “compon... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The current MDS object allocator is designed only to allocate objects for one file at the time the file is first created. For progressive file layouts, at a minimum the allocator will need to be enhanced in order to avoid allocating objects on OSTs that are already part of a file's other components. If files have multiple objects allocated to the same OSTs before objects are allocated from unused OSTs, there may be a significant performance loss due to oversubscribing the bandwidth on that OST compared to the other OSTs. The only exception may be for a fully-striped component at the end of the file (see Example Progressive Layouts for more detail), where it would be acceptable to allocate objects across all of the available OSTs to maximize the bandwidth available for the file.



 Comments   
Comment by Andreas Dilger [ 01/Mar/17 ]

This work is also a pre-requisite for FLR-related improvements to the MDS object allocator. While PFL requires that the objects are preferably not on the same OSTs between components, this is not a hard requirement. At worst this impacts performance, and in some cases (e.g. widely striped last component) it may even be desirable to re-use the same OSTs in order to maximize the bandwidth of large files.

FLR has similar, but more specific requirements for OST selection on components with overlapping extents, in order of decreasing priority:

  1. objects with overlapping components must not share the same OST.  Implies max replica count == OST count/component stripe count.  In theory this could be relaxed if all replicas have the same stripe_count and stripe_size, then it would only require that the same OST cannot be at the same stripe_index of different components, in which case max replica count == OST count.
  2. objects with overlapping components should not share OSTs on the same OSS node (by NID from imp->imp_connection->c_peer.nid, as qos_add_tgt() does) to avoid the shared node failure domain.
  3. objects with overlapping components should not share OSTs on the same OSS failover pair (by failover NID from imp->imp_conn_list.oic_conn->c_peer.nid, as lprocfs_import_seq_show() does) to avoid the shared storage enclosure/controller failure domain.  There may be other OSS nodes that share the same storage enclosure/controller, but there isn't any way for the client to determine this automatically.
  4. objects with overlapping components should not be on OSTs on the same network switch, power supply, rack, etc. but this depends on external information that is not currently available to Lustre. That could optionally be added via a separate configuration file/options, but the above cases will automatically cover
Comment by Jinshan Xiong (Inactive) [ 02/Mar/17 ]

Do we actually distinguish if the components are for generic PFL or FLR? it seems like to be a bad idea to me to know that information at LOD layer. I would like to make this allocation policy as generic and best-effort.

Sparse OST index has been supported for a long time. How do you think if we partition OST indices based on the distance? The distance is defined by servers, racks, and switches. Anyway, more information the allocation policy can get the better decision it can make.

Comment by Andreas Dilger [ 02/Mar/17 ]

I think encoding anything into the OST index is a non-starter. This would totally break for existing filesystems, and administrators would have a hard time getting it right, and then it would break if they needed to move nodes around for some reason. We already have OSS and OSS failover information in the LOD, so we may as well use it. In fact, the QOS RR allocator already spreads stripes across OSS nodes to avoid contention if possible. We can add in other rack/switch/power information later if we actually need it.

I don't think that understanding "PFL" vs. "FLR" in LOD is quite the right thing, but rather it will understand layout components and whether they are sequential of overlapping, and select the best OSTs in that case.

Comment by Andreas Dilger [ 03/Apr/18 ]

One simple proposal is to check the NID of each OST to put automatically separate OSTs into fault domains based on which OSS that are located on. This is not perfect, but is simple and works for all existing systems without additional input from the administrator. The existing QOS RR allocator will already prefer to distribute allocations across OSS nodes if possible, but this can be extended to actually and a requirement for all allocations.

Secondly, a simple integer domain value can be assigned to each OST by the administrator, and the LOD can use this to separate OSTs into independent groups. OSTs with the same domain number should not be used for redundancy for other components.

Comment by Gerrit Updater [ 15/May/18 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32404
Subject: LU-9007 lod: improve obj alloc for FLR file
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1a0c093c55d034ba6013f05f7c2a68664d3d0901

Comment by Gerrit Updater [ 12/Jul/18 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32813
Subject: LU-9007 lod: get rid of comp ost in use array
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 73c65d7e3402e2465a0b1042eb7ccaf185730b87

Comment by Gerrit Updater [ 24/Jul/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32404/
Subject: LU-9007 lod: improve obj alloc for FLR file
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fabf3fe7ac06d916d8c433a99f1f4a4bd3632638

Comment by Gerrit Updater [ 09/Aug/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32813/
Subject: LU-9007 lod: get rid of comp ost in use array
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a277952c65d4aad1abb9ac9f759af16a43902068

Comment by Peter Jones [ 09/Aug/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:22:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.