Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9961

FLR2: Relocating individual objects to a new OST

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.7.0, Lustre 2.9.0
    • 9223372036854775807

    Description

      This ticket is twofold: 1) how do we efficiently migrate data off a set of OSTs with current or quick-hack tools? 2) what features and tools can be developed in the future to improve usability?

      We have an older file system from which we are trying to remove a set of OSTs on a single server. To do this, we need to drain data off of the targets and onto the rest of the same file system. Initially we used the "lctl deactivate" method, but then switched to "lctl set_param osp.<OST>.max_create_count=0" due to the issues discussed in LU-4825 (and others). Unfortunately, users tend to keep data on this file system for a long time, so normal turnover is not significantly decreasing the usage of the OSTs.

      Our understanding is that (with current versions of Lustre) to move objects off of the desired targets all parts of the file would have to be read and rewritten by a client, even if only one of many stripes is on the target. This is unacceptably inefficient, as the existence of files striped with a count of more than one results in a huge increase in the total data to be rewritten. In our case it would be well over half of the data on the file system. We are able to gather file size and stripe data (and the lists of files) using e2scan, Lester, and wrapper scripts.

      1) What tools and methods are available to drain the OSTs more efficiently? Preferably it would involve a background process to copy just the individual objects on the OSS to be drained. And then we are open to taking a moderate downtime to just update the MDT to change layout info to point to the new object locations.

      2) Can this use case be supported in the future by a tool like lfs_migrate, with options or auto-detection, to only move the parts of the file that need to be moved to satisfy the new target requirements? As an interim step, a server-side administrator command to initiate the object relocation would be fine.

      Thanks!

       

      Attachments

        Issue Links

          Activity

            [LU-9961] FLR2: Relocating individual objects to a new OST

            tappro, is it possible today to migrate only from a DoM component to the first (overlapping) OST object of the first component, without migrating (copying) the rest of the file? Essentially, I'm wondering if there is some efficient way to "de-DoM" a file layout on large PFL files that have a DoM component (and use up MDT space) but don't really need it.

            IIRC, when you implemented DoM file migration this was implemented as "mirror MDT inode data to/from first OST object", but I'm not sure if there is an interface to run only that part of the migration. If not, how hard would it be to implement that? Presumably something like "lfs migrate -L mdt --SOMETHING FILE" and it would do this mirror over to the OST object and drop the DoM component from the file.

            adilger Andreas Dilger added a comment - tappro , is it possible today to migrate only from a DoM component to the first (overlapping) OST object of the first component, without migrating (copying) the rest of the file? Essentially, I'm wondering if there is some efficient way to "de-DoM" a file layout on large PFL files that have a DoM component (and use up MDT space) but don't really need it. IIRC, when you implemented DoM file migration this was implemented as "mirror MDT inode data to/from first OST object", but I'm not sure if there is an interface to run only that part of the migration. If not, how hard would it be to implement that? Presumably something like " lfs migrate -L mdt -- SOMETHING FILE " and it would do this mirror over to the OST object and drop the DoM component from the file.

            For implementing single-object data migration, I there are a few low-level interfaces that are needed:

            • being able to "open()" a single object of a file for read/write, similar to how "lfs mirror read/write" and "lfs mirror resync" are able to read/write a specific mirror copy of a file. Such file descriptors need to be opened with O_DIRECT so that they do not pollute the page cache with pages at the wrong offsets
            • ability to call SEEK_HOLE and SEEK_DATA on the single-object file descriptor, so that they make an "exact" copy of the object, and do not fill in sparse parts of the object during data transfer. This may "just work" once RPCs can be sent directly to the OST holding the object.
            • ability to perform a layout swap of the one OST object in the volatile "victim" file to the correct component/stripe of the source file, with proper data version checks to ensure that the target object has not been modified since the migration started. This cannot be a whole layout swap with a full copy of the source layout, because the victim file is an "open-unlinked" file and when it is closed all of the objects in it will be deleted/destroyed.
            adilger Andreas Dilger added a comment - For implementing single-object data migration, I there are a few low-level interfaces that are needed: being able to "open()" a single object of a file for read/write, similar to how " lfs mirror read/write " and " lfs mirror resync " are able to read/write a specific mirror copy of a file. Such file descriptors need to be opened with O_DIRECT so that they do not pollute the page cache with pages at the wrong offsets ability to call SEEK_HOLE and SEEK_DATA on the single-object file descriptor, so that they make an "exact" copy of the object, and do not fill in sparse parts of the object during data transfer. This may "just work" once RPCs can be sent directly to the OST holding the object. ability to perform a layout swap of the one OST object in the volatile "victim" file to the correct component/stripe of the source file, with proper data version checks to ensure that the target object has not been modified since the migration started. This cannot be a whole layout swap with a full copy of the source layout, because the victim file is an "open-unlinked" file and when it is closed all of the objects in it will be deleted/destroyed.
            charr Cameron Harr added a comment - - edited

            Andreas, it was with MDT-related migration, but I had to do some digging to remember the migrate issue: LU-11481.

            We'll look forward to the enhanced version - especially one that is parallelized.

             

            charr Cameron Harr added a comment - - edited Andreas, it was with MDT-related migration, but I had to do some digging to remember the migrate issue: LU-11481 . We'll look forward to the enhanced version - especially one that is parallelized.  

            Cameron, when you reference "lfs migrate", do you mean for OST or MDT
            object migration? AFAIK, there have not been any issues with OST object migration for a long time. The MDT object migration (directories and inodes) is indeed more complex and was rewritten in 2.12, and is being further enhanced in 2.14 to allow easier space balancing of directories and such.

            adilger Andreas Dilger added a comment - Cameron, when you reference " lfs migrate ", do you mean for OST or MDT object migration? AFAIK, there have not been any issues with OST object migration for a long time. The MDT object migration (directories and inodes) is indeed more complex and was rewritten in 2.12, and is being further enhanced in 2.14 to allow easier space balancing of directories and such.

            Thanks Andreas. I understand that lfs migrate isn't stable in 2.10, but likely we'll be >= 2.12 when we start adding new hardware. For now, my comments can be taken as an "up vote" for these functionalities.

            charr Cameron Harr added a comment - Thanks Andreas. I understand that lfs migrate isn't stable in 2.10, but likely we'll be >= 2.12 when we start adding new hardware. For now, my comments can be taken as an "up vote" for these functionalities.

            Cameron,
            thanks for your input. It is definitely useful to get feedback on what functionality is useful in the field.

            There are definitely already a number of ways to do transparent data movement with newer Lustre versions. In particular, "lfs migrate" and "lfs_migrate" work together to rebalance/evacuate OSTs, or "lfs mirror" can be used to handle storing files on multiple tiers of storage. Some of the issues described here (migrate only a single OST object) are not yet implemented, so there is still work to be done. I think we also need to be able to deactivate old OSTs and remove them from the filesystem a bit more easily than today.

            adilger Andreas Dilger added a comment - Cameron, thanks for your input. It is definitely useful to get feedback on what functionality is useful in the field. There are definitely already a number of ways to do transparent data movement with newer Lustre versions. In particular, " lfs migrate " and " lfs_migrate " work together to rebalance/evacuate OSTs, or " lfs mirror " can be used to handle storing files on multiple tiers of storage. Some of the issues described here (migrate only a single OST object) are not yet implemented, so there is still work to be done. I think we also need to be able to deactivate old OSTs and remove them from the filesystem a bit more easily than today.

            The features mentioned above, evacuating data from OSTs, shrinking the file system, and f/s-wide rebalancing behind the scenes, are all ones I've used on many occasions with a competing product. Another use case for shrinking the f/s I've used multiple times is to reallocate disks (OSTs) from one file system to another one. (Of course, with ZFS, you can create multiple datasets in the same pool, but some may want to dedicate OSSs to particular file systems). 

            I would love to see those functionalities available, especially as we look at doing in-place upgrades in the coming years: add in racks of new OST or MDT hardware, let Lustre rebalance across all OSTs, or evacuate the data of particular OSTs/MDTs  so they can be retired.

            charr Cameron Harr added a comment - The features mentioned above, evacuating data from OSTs, shrinking the file system, and f/s-wide rebalancing behind the scenes, are all ones I've used on many occasions with a competing product. Another use case for shrinking the f/s I've used multiple times is to reallocate disks (OSTs) from one file system to another one. (Of course, with ZFS, you can create multiple datasets in the same pool, but some may want to dedicate OSSs to particular file systems).  I would love to see those functionalities available, especially as we look at doing in-place upgrades in the coming years: add in racks of new OST or MDT hardware, let Lustre rebalance across all OSTs, or evacuate the data of particular OSTs/MDTs  so they can be retired.
            Bob.C Bob Ciotti (Inactive) added a comment - - edited

            The operational need for these become a little more important as filesystems/equipment ages. Some context might be useful. We grow filesystems opportunistically, that is, grow existing or create a new filesystem based on need to improve performance, budget, and technology readiness. If we simply add OSTs, thats the primary source of imbalance we are concerned with. Of course the existing OSTs are full and new at 0, so how to get them to be balanced in reasonable time/effort. Copy is straight forward,
            but its not clear that the result is really the layout thats optimal.

            Example: Our last expansion added 60 - 10TB drives into enclosures that had 120 8TB drives that were basically full (lets say 90%). So thats adding 420TB (usable) to 672TB (67TB free), so ~1.2PB/OSS x 20 OSS == 22PB (10PB free). Our desire is to eventually have access spread evenly over all the OSTs/HDA's. Imagine 100 users, each with piles of files. All at different levels of expertise WRT lustre knowledge of striping. So we can identify some outliers and move those for the user. Putting this on the user is hit or miss, they have better things to do. But, one thing we want to avoid, is that all the moved files end up exclusively on the new OSTs, since that is likely to limit how work gets spread across all the HDA's. Both suffer from administrator overhead and coordination with the user. In any event, a manual restripe is likely a months long process and fairly disruptive, and probably requires moving large portions of the data twice, once to level OST's so that round-robin allocator works as desired (engaging all OSTs/HDAs in work) for both newly created files and the files targeted for re-striping, and once to re-stripe the files that were first moved that all landed on the new OSTs. Keeping in mind that coordinating the re-striping with users is going to disrupt them, some of whom are working very tight deadlines on highly visible programs like the Orion launch system.

            Now to shrinking. Because we typically have multiple generations of ageing equipment, I have to manage depreciation and maintenance as maintenance costs and failure rates ramp. Think Frog in warm to boiling water. Lifetime of the underlying HW differs by component. Drives, DIMMs, servers, HCAs, etc. In addition to HW lifetimes, there are also budget cycles that can be very dynamic, and technology cycles, We are stuck between paying large maintenance bills or shutting down the entire filesystem. Keeping in mind that shutting down a filesystem is something like a year long process, and we need to provision new equipment before that. But, if I can spare out HW components as I need them, then that gives me a way to better time both the addition of new equipment and end-of-life for the old and maintain tighter control of my budget.

            There is also a third element of what we might call 'object shuffling' that has nothing to do with growing/shrinking filesystems and thats re-arranging layouts to either optimize access (e.g. a user didn't get it right the first time) or to isolate a user(s) to a subset of the underlying HW because they are disruptive to others and can't change their code.

            So, we are going to leverage just-in-time to gradually and efficiently ramp-up and ramp-down our various capabilities and of course want to do this with minimal impact to our user community and do this in byte-sizes chunks to keep our costs/budgeting under control.

            Bob.C Bob Ciotti (Inactive) added a comment - - edited The operational need for these become a little more important as filesystems/equipment ages. Some context might be useful. We grow filesystems opportunistically, that is, grow existing or create a new filesystem based on need to improve performance, budget, and technology readiness. If we simply add OSTs, thats the primary source of imbalance we are concerned with. Of course the existing OSTs are full and new at 0, so how to get them to be balanced in reasonable time/effort. Copy is straight forward, but its not clear that the result is really the layout thats optimal. Example: Our last expansion added 60 - 10TB drives into enclosures that had 120 8TB drives that were basically full (lets say 90%). So thats adding 420TB (usable) to 672TB (67TB free), so ~1.2PB/OSS x 20 OSS == 22PB (10PB free). Our desire is to eventually have access spread evenly over all the OSTs/HDA's. Imagine 100 users, each with piles of files. All at different levels of expertise WRT lustre knowledge of striping. So we can identify some outliers and move those for the user. Putting this on the user is hit or miss, they have better things to do. But, one thing we want to avoid, is that all the moved files end up exclusively on the new OSTs, since that is likely to limit how work gets spread across all the HDA's. Both suffer from administrator overhead and coordination with the user. In any event, a manual restripe is likely a months long process and fairly disruptive, and probably requires moving large portions of the data twice, once to level OST's so that round-robin allocator works as desired (engaging all OSTs/HDAs in work) for both newly created files and the files targeted for re-striping, and once to re-stripe the files that were first moved that all landed on the new OSTs. Keeping in mind that coordinating the re-striping with users is going to disrupt them, some of whom are working very tight deadlines on highly visible programs like the Orion launch system. Now to shrinking. Because we typically have multiple generations of ageing equipment, I have to manage depreciation and maintenance as maintenance costs and failure rates ramp. Think Frog in warm to boiling water. Lifetime of the underlying HW differs by component. Drives, DIMMs, servers, HCAs, etc. In addition to HW lifetimes, there are also budget cycles that can be very dynamic, and technology cycles, We are stuck between paying large maintenance bills or shutting down the entire filesystem. Keeping in mind that shutting down a filesystem is something like a year long process, and we need to provision new equipment before that. But, if I can spare out HW components as I need them, then that gives me a way to better time both the addition of new equipment and end-of-life for the old and maintain tighter control of my budget. There is also a third element of what we might call 'object shuffling' that has nothing to do with growing/shrinking filesystems and thats re-arranging layouts to either optimize access (e.g. a user didn't get it right the first time) or to isolate a user(s) to a subset of the underlying HW because they are disruptive to others and can't change their code. So, we are going to leverage just-in-time to gradually and efficiently ramp-up and ramp-down our various capabilities and of course want to do this with minimal impact to our user community and do this in byte-sizes chunks to keep our costs/budgeting under control.

            To be honest, outside of draining an OST for replacement, and migrating data between different tiers of storage, I think OST imbalance indicates a shortcoming with the object allocator. Today, OST imbalance will typically only happen if someone is writing a huge file onto a single OST, or if the OSTs are significantly different in size. The best way to avoid the need for data migration is to allocate fewer files to full OSTs on an ongoing basis. PFL will avoid this problem to a large extent, because files will be spread across multiple OSTs as they grow larger. Also, the weighted round-robin allocator (LU-9, no, not a typo) would help avoid significant OST imbalance from happening in the first place.

            I'm not trying to say that there isn't a need for efficient object migration in some cases, but IMHO it would be better to avoid moving large amounts of data around on an ongoing basis

            adilger Andreas Dilger added a comment - To be honest, outside of draining an OST for replacement, and migrating data between different tiers of storage, I think OST imbalance indicates a shortcoming with the object allocator. Today, OST imbalance will typically only happen if someone is writing a huge file onto a single OST, or if the OSTs are significantly different in size. The best way to avoid the need for data migration is to allocate fewer files to full OSTs on an ongoing basis. PFL will avoid this problem to a large extent, because files will be spread across multiple OSTs as they grow larger. Also, the weighted round-robin allocator ( LU-9 , no, not a typo) would help avoid significant OST imbalance from happening in the first place. I'm not trying to say that there isn't a need for efficient object migration in some cases, but IMHO it would be better to avoid moving large amounts of data around on an ongoing basis

            Andreas, thank you for your comments... good context and related issues.

            For our case, I should clarify that we were not really looking for "client-side interfaces".  Ideally this object relocation is done between OSSs for efficiency.

            I agree that there is commonality with the File Level Redundancy project.  We could use that framework if it handles the case of adding a redundant copy for just one extent in a progressive file layout, whereby it would copy the data to a new OST, and then we could remove the redundancy explicitly stating which copy to keep.  Maybe?

            Another use case for the functionality we seek is automatic/background OST capacity rebalancing.  The general process would be to copy objects from full OSTs, update the layout on the MDT, remove the original object.  (Our case is just that we want to specify that operation from an OST even if it is not full.)  Capacity rebalancing is definitely something to be concerned with when a filesystem is growing.

            ndauchy Nathan Dauchy (Inactive) added a comment - Andreas, thank you for your comments... good context and related issues. For our case, I should clarify that we were not really looking for "client-side interfaces".  Ideally this object relocation is done between OSSs for efficiency. I agree that there is commonality with the File Level Redundancy project.  We could use that framework if it handles the case of adding a redundant copy for just one extent in a progressive file layout, whereby it would copy the data to a new OST, and then we could remove the redundancy explicitly stating which copy to keep.  Maybe? Another use case for the functionality we seek is automatic/background OST capacity rebalancing.  The general process would be to copy objects from full OSTs, update the layout on the MDT, remove the original object.  (Our case is just that we want to specify that operation from an OST even if it is not full.)  Capacity rebalancing is definitely something to be concerned with when a filesystem is growing.

            Nathan,
            the current approach for file migration is, as you note, whole file migration rather than migrating individual objects. There are several reasons for this, including the ability to change the file layout during migration and to avoid exposing low-level layout details to userspace (which is growing more complex with PFL and FLR), but most importantly to minimize the risk of data corruption during the migration. If clients are able to arbitrarily change the file layout (rather than the current mechanism of doing a "layout swap" between two files) then it is possible to have multiple files using the same objects, leading to data corruption/loss if one of the files is deleted, or for users to potentially access data in objects where they do not have POSIX file access permission. I agree that the current mechanism is sub-optimal in some cases, but as yet there isn't any mechanism to do single object migration.

            In the case of whole-OST (or MDT) replacement, the most efficient process is to do a whole-device copy from the old storage to the new storage. If using ldiskfs + LVM/DM, it is possible to create a mirror of the OST which will be resynched in the background while the OST is in use, and then the old mirror would be removed. Alternately, it would be possible to use "pvmove" to migrate the LV incrementally. The mirror approach has the benefit that the original data will remain intact during the migration, but this will double IO bandwidth for writes until the mirror is removed, while the pvmove approach will have the OST split across two storage pools for some time (several days perhaps). If using ZFS, it is possible to do incremental zfs send/recv to do the bulk of the data migration online, and then only take a small downtime to do a final offline resync and swap to the new storage.

            If you are changing wide-stripe files from N stripes to N-1 stripes because OSTs are being removed, then the whole file needs to be rewritten in any case, since there isn't a 1:1 mapping of old objects to new objects. I agree that this migration may take a long time, but probably not longer than waiting for an OST to drain of its own accord.

            While we've previously discussed the desire to do single-object migration, this hasn't yet been a priority for anyone to implement.

            Bob,
            in terms of filesystem shrinking, this is generally a much lower priority than filesystem growing. It is possible to remove OSTs by migrating off all files contained thereon, but I agree that the interfaces for doing so are not as well polished as they could be. In 2.10 the handling of deactivated OSTs is slightly improved in that they are not listed in "lfs df" output.

            In terms of splitting a filesystem in half for QOS reasons, I'd rather address this by using QOS and OST pools within a single filesystem to provide maximum flexibility. The NRS TBF policy is being improved by DDN in each release, to allow server-side QOS policies based on client node, MPI job, RPC type, and soon UID/GID. Also, Uni Mainz and DDN are working on a global QOS scheduler that integrates with SLURM. In 2.10 the handling of OST pools has improved to allow specifying a global default OST pool, as well as improving how pools are inherited from the parent directory.

            In terms of what might be done in the short term to address individual object migration, I don't think there are any client-side interfaces which might be subverted to allow this to happen (intentionally so). In theory, if you know that the files in question are immutable, you could locally mount the OST filesystem(s) as type ldiskfs and copy the individual objects (which isn't something we ever test), or use debugfs -c -R "dump -p O/0/d$((objid %32))/$objid victim" /dev/OSTnnnn to copy each object directly from the source OST to a new single-stripe Lustre victim file (on another OST, assuming OSTnnnn is blocked from creating new objects). Copying the object directly from the OST avoids the need to understand the file layout. You would then need to rewrite the old file layout directly in the MDT xattr to reference the new OST object and OST index (or FID if using DNE), and also rewrite the victim layout to reference the old OST object so that when it is deleted it will not also destroy the newly-copied OST object. Alternately, you could delete the victim files directly from the MDT ldiskfs mount so that Lustre does not try to remove the associated OST object.

            For the long term, we would probably want an interface similar to "layout swap" that was "object swap" to allow a single object in a file layout to be replaced by another object from a victim file. This would be useful in the case of 1:1 file migration, as well as resync of mirrored files (objects) that have become stale for the File Level Redundancy project. However, we don't have any plans to implement this currently.

            adilger Andreas Dilger added a comment - Nathan, the current approach for file migration is, as you note, whole file migration rather than migrating individual objects. There are several reasons for this, including the ability to change the file layout during migration and to avoid exposing low-level layout details to userspace (which is growing more complex with PFL and FLR), but most importantly to minimize the risk of data corruption during the migration. If clients are able to arbitrarily change the file layout (rather than the current mechanism of doing a "layout swap" between two files) then it is possible to have multiple files using the same objects, leading to data corruption/loss if one of the files is deleted, or for users to potentially access data in objects where they do not have POSIX file access permission. I agree that the current mechanism is sub-optimal in some cases, but as yet there isn't any mechanism to do single object migration. In the case of whole-OST (or MDT) replacement , the most efficient process is to do a whole-device copy from the old storage to the new storage. If using ldiskfs + LVM/DM, it is possible to create a mirror of the OST which will be resynched in the background while the OST is in use, and then the old mirror would be removed. Alternately, it would be possible to use " pvmove " to migrate the LV incrementally. The mirror approach has the benefit that the original data will remain intact during the migration, but this will double IO bandwidth for writes until the mirror is removed, while the pvmove approach will have the OST split across two storage pools for some time (several days perhaps). If using ZFS, it is possible to do incremental zfs send/recv to do the bulk of the data migration online, and then only take a small downtime to do a final offline resync and swap to the new storage. If you are changing wide-stripe files from N stripes to N-1 stripes because OSTs are being removed, then the whole file needs to be rewritten in any case, since there isn't a 1:1 mapping of old objects to new objects. I agree that this migration may take a long time, but probably not longer than waiting for an OST to drain of its own accord. While we've previously discussed the desire to do single-object migration, this hasn't yet been a priority for anyone to implement. Bob, in terms of filesystem shrinking, this is generally a much lower priority than filesystem growing. It is possible to remove OSTs by migrating off all files contained thereon, but I agree that the interfaces for doing so are not as well polished as they could be. In 2.10 the handling of deactivated OSTs is slightly improved in that they are not listed in " lfs df " output. In terms of splitting a filesystem in half for QOS reasons, I'd rather address this by using QOS and OST pools within a single filesystem to provide maximum flexibility. The NRS TBF policy is being improved by DDN in each release, to allow server-side QOS policies based on client node, MPI job, RPC type, and soon UID/GID. Also, Uni Mainz and DDN are working on a global QOS scheduler that integrates with SLURM. In 2.10 the handling of OST pools has improved to allow specifying a global default OST pool, as well as improving how pools are inherited from the parent directory. In terms of what might be done in the short term to address individual object migration, I don't think there are any client-side interfaces which might be subverted to allow this to happen (intentionally so). In theory, if you know that the files in question are immutable, you could locally mount the OST filesystem(s) as type ldiskfs and copy the individual objects (which isn't something we ever test), or use debugfs -c -R "dump -p O/0/d$((objid %32))/$objid victim" /dev/OSTnnnn to copy each object directly from the source OST to a new single-stripe Lustre victim file (on another OST, assuming OSTnnnn is blocked from creating new objects). Copying the object directly from the OST avoids the need to understand the file layout. You would then need to rewrite the old file layout directly in the MDT xattr to reference the new OST object and OST index (or FID if using DNE), and also rewrite the victim layout to reference the old OST object so that when it is deleted it will not also destroy the newly-copied OST object. Alternately, you could delete the victim files directly from the MDT ldiskfs mount so that Lustre does not try to remove the associated OST object. For the long term, we would probably want an interface similar to "layout swap" that was "object swap" to allow a single object in a file layout to be replaced by another object from a victim file. This would be useful in the case of 1:1 file migration, as well as resync of mirrored files (objects) that have become stale for the File Level Redundancy project. However, we don't have any plans to implement this currently.

            People

              pjones Peter Jones
              ndauchy Nathan Dauchy (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated: