[LU-9435] DNE2 - object placement QoS policy Created: 02/May/17  Updated: 06/Mar/19  Resolved: 04/Aug/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Jinshan Xiong (Inactive) Assignee: Lai Siyao
Resolution: Won't Fix Votes: 0
Labels: dne3

Issue Links:
Related
is related to LU-10277 DNE3: allow 'lfs mkdir' to create dir... Resolved
is related to LU-11213 DNE3: remote mkdir() in ROOT/ by default Resolved
is related to LU-7827 DNE3: automatically select MDT for lf... Resolved
is related to LU-10784 DNE3: mkdir() automatically create re... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

In current implementation, when a file is created, the file's inode must be in the same MDT where the name entry locates. It's more desirable to allocate MDT objects with the QoS policies like what we do for OST objects.



 Comments   
Comment by Andreas Dilger [ 02/May/17 ]

I don't think it is as straight forward as always creating the name on one MDT and allocating the inode on another arbitrary MDT. One of the major problems that would arise is that having remote directory entries for every file would hurt file creation performance, as well as every lookup or unlink of that file in the future. With a remote entry, the client first has to do name->FID lookup in the parent directory, and then separately do FID->MDT lookup in the FID Location Database (FLDB, typically very fast since it is compact and cached on the client), and then fetch attributes/layout/xattrs for the FID from the second MDT. This would double the number of RPCs needed to access the majority of files. Keeping the directory entries and inodes on the same MDT is far more efficient for creation, lookup, and deletion.

Instead, there are several mechanisms that can be used to distribute metadata loads/allocations across MDTs while keeping names/inodes mostly local to a single MDT.

  • automatic MDT selection for striped directories: if the shards of a DNE2 directory are load balanced across MDTs then the names created in those shards will also be balanced. Currently (AFAIK) the shards are allocated sequentially from the master MDT index unless otherwise specified, which is not ideal:
    lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
    mdtidx           FID[seq:oid:ver]
         0           [0x200000400:0x2:0x0]
         1           [0x240000401:0x2:0x0]
    lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
    mdtidx           FID[seq:oid:ver]
         0           [0x200000400:0x4:0x0]
         1           [0x240000401:0x4:0x0]
    lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
    mdtidx           FID[seq:oid:ver]
         0           [0x200000400:0x6:0x0]       
         1           [0x240000401:0x6:0x0]
    
    

    This should be relatively easy to implement when striped directories are explicitly created, since all of this is decided on the MDS, and it can do MDS_STATFS RPCs to the other MDTs (as we already do with OSTs) to select MDTs based on free space, if the number of stripes is less than the number of MDTs.

  • automatic MDT selection for remote directories: is a bit more tricky, since the client specifies the FID for the remote directory, but one possibility is to have "lfs mkdir" get the MDT space usage on the client to decide which MDT to use, if it is not specified by the user. Another alternative would be for the MDS to just ignore the FID supplied by the client, and allocate its own remote directory and return the new FID to the client (this is already handled by clients, in case the file/directory already exists).
  • automatic remote MDT selection for new directories: once the above MDT selection mechanism exists, it would be possible to automatically create some subset of new directories on remote MDTs in order to balance the load across MDS nodes.
  • automatic restriping of large directories: this is related to LU-4684 "allow migrating DNE striped directory". Basically, when a directory grows too large (e.g. over 5000 entries), the LMV layout is changed to a striped directory so that it is automatically load balanced across MDS nodes. Either a PFL-like layout that keeps existing entries in the "master" directory and new entries are inserted into the shards (lower overhead at split time, higher overhead during later lookups), and/or migrating existing entries to the new shards (higher overhead at split, lower overhead during later lookups), or a combination of both (delayed migration from master to shards some arbitrary time after split). The benefit of automatically sharding large directories is that any subdirectories will also be distributed, and space used by Data-on-MDT objects will also be balanced naturally.

All of these options allow the majority of entries to remain local to the MDT where the inode is created, while distributing load across MDTs more evenly without user interaction.

Comment by Di Wang [ 03/May/17 ]

One of the major problems that would arise is that having remote directory entries for every file would hurt file creation performance, as well as every lookup or unlink of that file in the future. With a remote entry, the client first has to do name->FID lookup in the parent directory, and then separately do FID->MDT lookup in the FID Location Database (FLDB, typically very fast since it is compact and cached on the client), and then fetch attributes/layout/xattrs for the FID from the second MDT. This would double the number of RPCs needed to access the majority of files.

Indeed, so we only split name-entry and object for the directory (remote directory), that probably means we only do QOS thing for directory creation.

This should be relatively easy to implement when striped directories are explicitly created, since all of this is decided on the MDS, and it can do MDS_STATFS RPCs to the other MDTs (as we already do with OSTs) to select MDTs based on free space, if the number of stripes is less than the number of MDTs.

Another alternative would be for the MDS to just ignore the FID supplied by the client, and allocate its own remote directory and return the new FID to the client (this is already handled by clients, in case the file/directory already exists).

This really makes sense to me, and it probably also means we only need put MD QOS into LOD, (no need in LMV). which will allow us easily share MDT/OST QOS code.

automatic restriping of large directories: this is related to LU-4684 "allow migrating DNE striped directory".

Even curent migrate tool (rebalance the objects over MDTs) will suit a lot of QOS needs, though it is not automatic. Btw: we also need a ticket for migrating data-on-MDT objects.

Comment by Andreas Dilger [ 04/Aug/18 ]

This is probably best handled at the directory level, rather than making every file remote by default.

Comment by Andreas Dilger [ 04/Aug/18 ]

This will be handled via other tickets and per-directory balancing.

Generated at Sat Feb 10 02:26:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.