Details

    • Improvement
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      In current implementation, when a file is created, the file's inode must be in the same MDT where the name entry locates. It's more desirable to allocate MDT objects with the QoS policies like what we do for OST objects.

      Attachments

        Issue Links

          Activity

            [LU-9435] DNE2 - object placement QoS policy

            This will be handled via other tickets and per-directory balancing.

            adilger Andreas Dilger added a comment - This will be handled via other tickets and per-directory balancing.

            This is probably best handled at the directory level, rather than making every file remote by default.

            adilger Andreas Dilger added a comment - This is probably best handled at the directory level, rather than making every file remote by default.
            di.wang Di Wang added a comment - - edited

            One of the major problems that would arise is that having remote directory entries for every file would hurt file creation performance, as well as every lookup or unlink of that file in the future. With a remote entry, the client first has to do name->FID lookup in the parent directory, and then separately do FID->MDT lookup in the FID Location Database (FLDB, typically very fast since it is compact and cached on the client), and then fetch attributes/layout/xattrs for the FID from the second MDT. This would double the number of RPCs needed to access the majority of files.

            Indeed, so we only split name-entry and object for the directory (remote directory), that probably means we only do QOS thing for directory creation.

            This should be relatively easy to implement when striped directories are explicitly created, since all of this is decided on the MDS, and it can do MDS_STATFS RPCs to the other MDTs (as we already do with OSTs) to select MDTs based on free space, if the number of stripes is less than the number of MDTs.

            Another alternative would be for the MDS to just ignore the FID supplied by the client, and allocate its own remote directory and return the new FID to the client (this is already handled by clients, in case the file/directory already exists).

            This really makes sense to me, and it probably also means we only need put MD QOS into LOD, (no need in LMV). which will allow us easily share MDT/OST QOS code.

            automatic restriping of large directories: this is related to LU-4684 "allow migrating DNE striped directory".

            Even curent migrate tool (rebalance the objects over MDTs) will suit a lot of QOS needs, though it is not automatic. Btw: we also need a ticket for migrating data-on-MDT objects.

            di.wang Di Wang added a comment - - edited One of the major problems that would arise is that having remote directory entries for every file would hurt file creation performance, as well as every lookup or unlink of that file in the future. With a remote entry, the client first has to do name->FID lookup in the parent directory, and then separately do FID->MDT lookup in the FID Location Database (FLDB, typically very fast since it is compact and cached on the client), and then fetch attributes/layout/xattrs for the FID from the second MDT. This would double the number of RPCs needed to access the majority of files. Indeed, so we only split name-entry and object for the directory (remote directory), that probably means we only do QOS thing for directory creation. This should be relatively easy to implement when striped directories are explicitly created, since all of this is decided on the MDS, and it can do MDS_STATFS RPCs to the other MDTs (as we already do with OSTs) to select MDTs based on free space, if the number of stripes is less than the number of MDTs. Another alternative would be for the MDS to just ignore the FID supplied by the client, and allocate its own remote directory and return the new FID to the client (this is already handled by clients, in case the file/directory already exists). This really makes sense to me, and it probably also means we only need put MD QOS into LOD, (no need in LMV). which will allow us easily share MDT/OST QOS code. automatic restriping of large directories: this is related to LU-4684 "allow migrating DNE striped directory". Even curent migrate tool (rebalance the objects over MDTs) will suit a lot of QOS needs, though it is not automatic. Btw: we also need a ticket for migrating data-on-MDT objects.

            I don't think it is as straight forward as always creating the name on one MDT and allocating the inode on another arbitrary MDT. One of the major problems that would arise is that having remote directory entries for every file would hurt file creation performance, as well as every lookup or unlink of that file in the future. With a remote entry, the client first has to do name->FID lookup in the parent directory, and then separately do FID->MDT lookup in the FID Location Database (FLDB, typically very fast since it is compact and cached on the client), and then fetch attributes/layout/xattrs for the FID from the second MDT. This would double the number of RPCs needed to access the majority of files. Keeping the directory entries and inodes on the same MDT is far more efficient for creation, lookup, and deletion.

            Instead, there are several mechanisms that can be used to distribute metadata loads/allocations across MDTs while keeping names/inodes mostly local to a single MDT.

            • automatic MDT selection for striped directories: if the shards of a DNE2 directory are load balanced across MDTs then the names created in those shards will also be balanced. Currently (AFAIK) the shards are allocated sequentially from the master MDT index unless otherwise specified, which is not ideal:
              lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
              mdtidx           FID[seq:oid:ver]
                   0           [0x200000400:0x2:0x0]
                   1           [0x240000401:0x2:0x0]
              lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
              mdtidx           FID[seq:oid:ver]
                   0           [0x200000400:0x4:0x0]
                   1           [0x240000401:0x4:0x0]
              lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
              mdtidx           FID[seq:oid:ver]
                   0           [0x200000400:0x6:0x0]       
                   1           [0x240000401:0x6:0x0]
              
              

              This should be relatively easy to implement when striped directories are explicitly created, since all of this is decided on the MDS, and it can do MDS_STATFS RPCs to the other MDTs (as we already do with OSTs) to select MDTs based on free space, if the number of stripes is less than the number of MDTs.

            • automatic MDT selection for remote directories: is a bit more tricky, since the client specifies the FID for the remote directory, but one possibility is to have "lfs mkdir" get the MDT space usage on the client to decide which MDT to use, if it is not specified by the user. Another alternative would be for the MDS to just ignore the FID supplied by the client, and allocate its own remote directory and return the new FID to the client (this is already handled by clients, in case the file/directory already exists).
            • automatic remote MDT selection for new directories: once the above MDT selection mechanism exists, it would be possible to automatically create some subset of new directories on remote MDTs in order to balance the load across MDS nodes.
            • automatic restriping of large directories: this is related to LU-4684 "allow migrating DNE striped directory". Basically, when a directory grows too large (e.g. over 5000 entries), the LMV layout is changed to a striped directory so that it is automatically load balanced across MDS nodes. Either a PFL-like layout that keeps existing entries in the "master" directory and new entries are inserted into the shards (lower overhead at split time, higher overhead during later lookups), and/or migrating existing entries to the new shards (higher overhead at split, lower overhead during later lookups), or a combination of both (delayed migration from master to shards some arbitrary time after split). The benefit of automatically sharding large directories is that any subdirectories will also be distributed, and space used by Data-on-MDT objects will also be balanced naturally.

            All of these options allow the majority of entries to remain local to the MDT where the inode is created, while distributing load across MDTs more evenly without user interaction.

            adilger Andreas Dilger added a comment - I don't think it is as straight forward as always creating the name on one MDT and allocating the inode on another arbitrary MDT. One of the major problems that would arise is that having remote directory entries for every file would hurt file creation performance, as well as every lookup or unlink of that file in the future. With a remote entry, the client first has to do name->FID lookup in the parent directory, and then separately do FID->MDT lookup in the FID Location Database (FLDB, typically very fast since it is compact and cached on the client), and then fetch attributes/layout/xattrs for the FID from the second MDT. This would double the number of RPCs needed to access the majority of files. Keeping the directory entries and inodes on the same MDT is far more efficient for creation, lookup, and deletion. Instead, there are several mechanisms that can be used to distribute metadata loads/allocations across MDTs while keeping names/inodes mostly local to a single MDT. automatic MDT selection for striped directories : if the shards of a DNE2 directory are load balanced across MDTs then the names created in those shards will also be balanced. Currently (AFAIK) the shards are allocated sequentially from the master MDT index unless otherwise specified, which is not ideal: lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 0 [0x200000400:0x2:0x0] 1 [0x240000401:0x2:0x0] lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 0 [0x200000400:0x4:0x0] 1 [0x240000401:0x4:0x0] lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 0 [0x200000400:0x6:0x0] 1 [0x240000401:0x6:0x0] This should be relatively easy to implement when striped directories are explicitly created, since all of this is decided on the MDS, and it can do MDS_STATFS RPCs to the other MDTs (as we already do with OSTs) to select MDTs based on free space, if the number of stripes is less than the number of MDTs. automatic MDT selection for remote directories : is a bit more tricky, since the client specifies the FID for the remote directory, but one possibility is to have " lfs mkdir " get the MDT space usage on the client to decide which MDT to use, if it is not specified by the user. Another alternative would be for the MDS to just ignore the FID supplied by the client, and allocate its own remote directory and return the new FID to the client (this is already handled by clients, in case the file/directory already exists). automatic remote MDT selection for new directories : once the above MDT selection mechanism exists, it would be possible to automatically create some subset of new directories on remote MDTs in order to balance the load across MDS nodes. automatic restriping of large directories : this is related to LU-4684 "allow migrating DNE striped directory". Basically, when a directory grows too large (e.g. over 5000 entries), the LMV layout is changed to a striped directory so that it is automatically load balanced across MDS nodes. Either a PFL-like layout that keeps existing entries in the "master" directory and new entries are inserted into the shards (lower overhead at split time, higher overhead during later lookups), and/or migrating existing entries to the new shards (higher overhead at split, lower overhead during later lookups), or a combination of both (delayed migration from master to shards some arbitrary time after split). The benefit of automatically sharding large directories is that any subdirectories will also be distributed, and space used by Data-on-MDT objects will also be balanced naturally. All of these options allow the majority of entries to remain local to the MDT where the inode is created, while distributing load across MDTs more evenly without user interaction.

            People

              laisiyao Lai Siyao
              jay Jinshan Xiong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: