Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9007

Improved object allocator for FLR composite files

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      The current MDS object allocator is designed only to allocate objects for one file at the time the file is first created. For progressive file layouts, at a minimum the allocator will need to be enhanced in order to avoid allocating objects on OSTs that are already part of a file's other components. If files have multiple objects allocated to the same OSTs before objects are allocated from unused OSTs, there may be a significant performance loss due to oversubscribing the bandwidth on that OST compared to the other OSTs. The only exception may be for a fully-striped component at the end of the file (see Example Progressive Layouts for more detail), where it would be acceptable to allocate objects across all of the available OSTs to maximize the bandwidth available for the file.

      Attachments

        Issue Links

          Activity

            [LU-9007] Improved object allocator for FLR composite files
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32813/
            Subject: LU-9007 lod: get rid of comp ost in use array
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a277952c65d4aad1abb9ac9f759af16a43902068

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32813/ Subject: LU-9007 lod: get rid of comp ost in use array Project: fs/lustre-release Branch: master Current Patch Set: Commit: a277952c65d4aad1abb9ac9f759af16a43902068

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32404/
            Subject: LU-9007 lod: improve obj alloc for FLR file
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fabf3fe7ac06d916d8c433a99f1f4a4bd3632638

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32404/ Subject: LU-9007 lod: improve obj alloc for FLR file Project: fs/lustre-release Branch: master Current Patch Set: Commit: fabf3fe7ac06d916d8c433a99f1f4a4bd3632638

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32813
            Subject: LU-9007 lod: get rid of comp ost in use array
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 73c65d7e3402e2465a0b1042eb7ccaf185730b87

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32813 Subject: LU-9007 lod: get rid of comp ost in use array Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 73c65d7e3402e2465a0b1042eb7ccaf185730b87

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32404
            Subject: LU-9007 lod: improve obj alloc for FLR file
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1a0c093c55d034ba6013f05f7c2a68664d3d0901

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32404 Subject: LU-9007 lod: improve obj alloc for FLR file Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1a0c093c55d034ba6013f05f7c2a68664d3d0901

            One simple proposal is to check the NID of each OST to put automatically separate OSTs into fault domains based on which OSS that are located on. This is not perfect, but is simple and works for all existing systems without additional input from the administrator. The existing QOS RR allocator will already prefer to distribute allocations across OSS nodes if possible, but this can be extended to actually and a requirement for all allocations.

            Secondly, a simple integer domain value can be assigned to each OST by the administrator, and the LOD can use this to separate OSTs into independent groups. OSTs with the same domain number should not be used for redundancy for other components.

            adilger Andreas Dilger added a comment - One simple proposal is to check the NID of each OST to put automatically separate OSTs into fault domains based on which OSS that are located on. This is not perfect, but is simple and works for all existing systems without additional input from the administrator. The existing QOS RR allocator will already prefer to distribute allocations across OSS nodes if possible, but this can be extended to actually and a requirement for all allocations. Secondly, a simple integer domain value can be assigned to each OST by the administrator, and the LOD can use this to separate OSTs into independent groups. OSTs with the same domain number should not be used for redundancy for other components.

            I think encoding anything into the OST index is a non-starter. This would totally break for existing filesystems, and administrators would have a hard time getting it right, and then it would break if they needed to move nodes around for some reason. We already have OSS and OSS failover information in the LOD, so we may as well use it. In fact, the QOS RR allocator already spreads stripes across OSS nodes to avoid contention if possible. We can add in other rack/switch/power information later if we actually need it.

            I don't think that understanding "PFL" vs. "FLR" in LOD is quite the right thing, but rather it will understand layout components and whether they are sequential of overlapping, and select the best OSTs in that case.

            adilger Andreas Dilger added a comment - I think encoding anything into the OST index is a non-starter. This would totally break for existing filesystems, and administrators would have a hard time getting it right, and then it would break if they needed to move nodes around for some reason. We already have OSS and OSS failover information in the LOD, so we may as well use it. In fact, the QOS RR allocator already spreads stripes across OSS nodes to avoid contention if possible. We can add in other rack/switch/power information later if we actually need it. I don't think that understanding "PFL" vs. "FLR" in LOD is quite the right thing, but rather it will understand layout components and whether they are sequential of overlapping, and select the best OSTs in that case.

            Do we actually distinguish if the components are for generic PFL or FLR? it seems like to be a bad idea to me to know that information at LOD layer. I would like to make this allocation policy as generic and best-effort.

            Sparse OST index has been supported for a long time. How do you think if we partition OST indices based on the distance? The distance is defined by servers, racks, and switches. Anyway, more information the allocation policy can get the better decision it can make.

            jay Jinshan Xiong (Inactive) added a comment - Do we actually distinguish if the components are for generic PFL or FLR? it seems like to be a bad idea to me to know that information at LOD layer. I would like to make this allocation policy as generic and best-effort. Sparse OST index has been supported for a long time. How do you think if we partition OST indices based on the distance? The distance is defined by servers, racks, and switches. Anyway, more information the allocation policy can get the better decision it can make.

            This work is also a pre-requisite for FLR-related improvements to the MDS object allocator. While PFL requires that the objects are preferably not on the same OSTs between components, this is not a hard requirement. At worst this impacts performance, and in some cases (e.g. widely striped last component) it may even be desirable to re-use the same OSTs in order to maximize the bandwidth of large files.

            FLR has similar, but more specific requirements for OST selection on components with overlapping extents, in order of decreasing priority:

            1. objects with overlapping components must not share the same OST.  Implies max replica count == OST count/component stripe count.  In theory this could be relaxed if all replicas have the same stripe_count and stripe_size, then it would only require that the same OST cannot be at the same stripe_index of different components, in which case max replica count == OST count.
            2. objects with overlapping components should not share OSTs on the same OSS node (by NID from imp->imp_connection->c_peer.nid, as qos_add_tgt() does) to avoid the shared node failure domain.
            3. objects with overlapping components should not share OSTs on the same OSS failover pair (by failover NID from imp->imp_conn_list.oic_conn->c_peer.nid, as lprocfs_import_seq_show() does) to avoid the shared storage enclosure/controller failure domain.  There may be other OSS nodes that share the same storage enclosure/controller, but there isn't any way for the client to determine this automatically.
            4. objects with overlapping components should not be on OSTs on the same network switch, power supply, rack, etc. but this depends on external information that is not currently available to Lustre. That could optionally be added via a separate configuration file/options, but the above cases will automatically cover
            adilger Andreas Dilger added a comment - This work is also a pre-requisite for FLR-related improvements to the MDS object allocator. While PFL requires that the objects are preferably not on the same OSTs between components, this is not a hard requirement. At worst this impacts performance, and in some cases (e.g. widely striped last component) it may even be desirable to re-use the same OSTs in order to maximize the bandwidth of large files. FLR has similar, but more specific requirements for OST selection on components with overlapping extents, in order of decreasing priority: objects with overlapping components must not share the same OST.  Implies max replica count == OST count/component stripe count.  In theory this could be relaxed if all replicas have the same stripe_count and stripe_size , then it would only require that the same OST cannot be at the same  stripe_index of different components, in which case max replica count == OST count. objects with overlapping components should not share OSTs on the same OSS node (by NID from imp->imp_connection->c_peer.nid , as qos_add_tgt() does) to avoid the shared node failure domain. objects with overlapping components should not share OSTs on the same OSS failover pair (by failover NID from imp->imp_conn_list.oic_conn->c_peer.nid , as lprocfs_import_seq_show() does) to avoid the shared storage enclosure/controller failure domain.  There may be other OSS nodes that share the same storage enclosure/controller, but there isn't any way for the client to determine this automatically. objects with overlapping components should not be on OSTs on the same network switch, power supply, rack, etc. but this depends on external information that is not currently available to Lustre. That could optionally be added via a separate configuration file/options, but the above cases will automatically cover

            People

              bobijam Zhenyu Xu
              jgmitter Joseph Gmitter (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: