Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15211

lfs migrate metadata performance test plan

Details

    • Question/Request
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0
    • client:
      toss 3.7-14.1
      3.10.0-1160.45.1.1chaos.ch6.x86_64
      lustre 2.12.7_2.llnl

      server:
      toss 4.1-5
      4.18.0-240.22.1.1toss.t4.x86_64
      zfs 2.0.52_2llnl-1
      lustre 2.14.0_5.llnl

    • 9223372036854775807

    Description

      lfs-migrate Metadata Performance Testing

      While trying to use lfs-migrate for meta-data migration, we found that lfs-migrate perfomance does not scale well with additional processes. Even when using many processes and nodes, sustained performance was around 400 items/second, which is too slow to be practical for migrations of large numbers of files and directories.

      This testing plan is for performing additional tests to see if the above results are in fact the limit, or near the limit of lfs-migrate's performance.

      Overview

      The performance to be measured is the rate at which items (files and directories) can be migrated. These items will be in a tree (or trees) and migrated by many processes running lfs-migrate in parallel.

      The 3 basic parts of the test are:

      • create the trees
      • migrate the trees
      • analyze the data generated during the migration

      Create the Trees

      A single tree can be created using mdtest. mdtest has the ability to make trees of files and directories, and can parameterize those trees in most of the ways necessary for this test.

      The major shortcoming with mdtest is that it doesn't set the striping and directory striping of the trees it creates. This can be overcome by pre-creating directories, setting their striping and directory striping, and then having mdtest create trees within these directories so that each tree inherits these setting from its respective parent directory.

      The command to create the trees needs to be saved. This includes both the mdtest command per directory, and also the command to make the directories and set their striping and directory striping. Also, mdtest will be run with srun, so the whole srun command needs to be saved because the srun parameters will affect the size/shape of the tree.

      Migrate the Trees

      The migration is done in parallel by many processes, each running lfs-migrate on one of the directories that contains a tree created by mdtest. The many processes are created and spread across multiple client nodes using srun.

      Data needs to be collected during the run. Process 0 will record run-wide data such as total items migrated, and each process will write its own performance data. This will generate 1 file per processes, and 1 more for the run-wide data. Some of the collected data could be inferred from other data (or from the slurm database) but recording it simplifies post-processing.

      Data to Collect Per Run

      • total items migrated
      • total data migrated
      • the mdtest command and the striping/dirstriping commands
      • slurm jobid
      • the srun command that does the migration

      Data to Collect Per Process

      • start time (of lfs-migrate)
      • end time (of lfs-migrate)
      • source MDTs
      • destination MDTs
      • the lfs-migrate command
      • lfs getdirstripe output for the root of the tree the process will migrate

      Potential Parameters to Vary between Runs

      • total number of processes, nodes*ppn
        • the number of processes per node (2,8,16)
        • the number of nodes (1,8,32)
      • the kind of items that are migrated (files,directories)
      • how many items per process are migrated (1K, 8K, 64K configured with mdtest command)
      • file size = 0, fixed
      • DoM or not DoM

      Initial Runs Planned

      Note that the above is still probably a larger parameter space than is necessary to find first-order bottlenecks (3*3*3*3*1*2 == 162 tests). To reduce the amount of tests, and expected total run time, initially only the following tests will be run.  More complete testing of the parameter space will be performed as needed after developers are engaged.

      1. Find the values of nodes and ppn that maximize overall lfs-migrate rate for files only, 8K per process, without DoM (9 tests)
      2. Using those values for nodes and ppn, test for the above items per process.  Record the value of items/process (ipp) that maximizes overall lfs-migrate rate for files only, without DoM (3 tests)
      3. Using those values for nodes, ppn, and ipp, test with files with DoM and files without DoM (2 tests)

      Data Analysis

      The data recorded for each run will all go into a single directory, along with the trees(s) creation data. A script will read the meta-data and per-process performance data, and calculate the rate at which items are migrated. The important input parameters and corresponding results for all runs will be output as a csv.

      Performance Comparison

      For comparison, other performance metrics with the same file system and clients will be gathered:

      • mdtest will be run with the same node and ppn combinations and enough objects per process to make each mdtest stage (e.g. create, unlink, etc.) take at least 10 minutes.

      Attachments

        Issue Links

          Activity

            [LU-15211] lfs migrate metadata performance test plan
            ofaaland Olaf Faaland added a comment -

            Improved lfs migrate performance would be very useful for us, but we have worked out migration methods that are performant enough based on dsync(1) from mpifileutils. Removing topllnl.

            ofaaland Olaf Faaland added a comment - Improved lfs migrate performance would be very useful for us, but we have worked out migration methods that are performant enough based on dsync(1) from mpifileutils. Removing topllnl.

            For directory/inode migration, this is mostly done on the MDS, and is only triggered by the client, because the whole operation has to be handled within a filesystem transaction on the MDT, so having the client involved would not improve things. I think that using 1-level migrations may improve parallelism, but the MDS may also throttle the amount of work that is being done to avoid consuming a large number of MDS service threads, since this can take a long time.

            There are almost certainly improvements to be had in this operation, since it has not been a focus for improvement in the past.

            adilger Andreas Dilger added a comment - For directory/inode migration, this is mostly done on the MDS, and is only triggered by the client, because the whole operation has to be handled within a filesystem transaction on the MDT, so having the client involved would not improve things. I think that using 1-level migrations may improve parallelism, but the MDS may also throttle the amount of work that is being done to avoid consuming a large number of MDS service threads, since this can take a long time. There are almost certainly improvements to be had in this operation, since it has not been a focus for improvement in the past.

            Hi Andreas,

            Yes, this ticket is specific for inode/directory migration and uses the "lfs migrate -m" command. I was referring to "lfs migrate" as "lfs-migrate". The shell script for object/data migration has an underscore ("lfs_migrate"), but I see why that could be confusing. This issue came up when we were exploring ways to do a full file system migration to new hardware. It was to be part of a process that involves moving the data from the old to new hardware with "zfs send/receive", which was to be used because our tests showed that it's very fast. However, once the data is on the new hardware there's more to do, and one of those steps involves moving meta/object data around within the new hardware. We initially considered "lfs migrate" for this, but it seemed slow. The other utility we considered is "dsync", but that had an "xattr" issue, and I see that you've reviewed Olaf's patch for that.

            The goal is ultimately for both MDT and OST migrations. The purpose of these migrations is potentially as part of the plan I mentioned above, although I don't think we'll be using "lfs migrate" for the migrations we're doing in the near term, so really it's to see if "lfs migrate" is a viable option in the more distant future. It's also for the hypothetical cases of balancing and evacuating hardware, but I see you've said there are likely better ways to deal with (or prevent) those problems.

            As for how this is being called: trees are being made specifically for the test, and we are intentionally migrating the whole tree, and not expecting to just migrate the files at depth=1 as proposed in "DNE3: directory migration in non-recursive mode". The individual "lfs migrate" calls are on non-overlapping trees. As for your comment "inadvertently doing multiple migrations and hurting performance", we are intentionally doing multiple migrations in the hopes that it will help performance, so it seems we might be confused about what helps vs hurts performance.

            One of the major questions I have about the whole process is how the data moves. Does it use the client nodes as intermediaries, or is the migration mostly happening just between the MDSs? My attempts to increase parallelism have been to use more clients with more processes per client.

            defazio Gian-Carlo Defazio added a comment - Hi Andreas, Yes, this ticket is specific for inode/directory migration and uses the "lfs migrate -m" command. I was referring to "lfs migrate" as "lfs-migrate". The shell script for object/data migration has an underscore ("lfs_migrate"), but I see why that could be confusing. This issue came up when we were exploring ways to do a full file system migration to new hardware. It was to be part of a process that involves moving the data from the old to new hardware with "zfs send/receive", which was to be used because our tests showed that it's very fast. However, once the data is on the new hardware there's more to do, and one of those steps involves moving meta/object data around within the new hardware. We initially considered "lfs migrate" for this, but it seemed slow. The other utility we considered is "dsync", but that had an "xattr" issue, and I see that you've reviewed Olaf's patch for that. The goal is ultimately for both MDT and OST migrations. The purpose of these migrations is potentially as part of the plan I mentioned above, although I don't think we'll be using "lfs migrate" for the migrations we're doing in the near term, so really it's to see if "lfs migrate" is a viable option in the more distant future. It's also for the hypothetical cases of balancing and evacuating hardware, but I see you've said there are likely better ways to deal with (or prevent) those problems. As for how this is being called: trees are being made specifically for the test, and we are intentionally migrating the whole tree, and not expecting to just migrate the files at depth=1 as proposed in " DNE3: directory migration in non-recursive mode ". The individual "lfs migrate" calls are on non-overlapping trees. As for your comment "inadvertently doing multiple migrations and hurting performance", we are intentionally doing multiple migrations in the hopes that it will help performance, so it seems we might be confused about what helps vs hurts performance. One of the major questions I have about the whole process is how the data moves. Does it use the client nodes as intermediaries, or is the migration mostly happening just between the MDSs? My attempts to increase parallelism have been to use more clients with more processes per client.

            Hi Olaf, Gian-Carlo,
            just to clarify the topic of this ticket, this issue is strictly related to inode/directory migration between MDTs, and not OST object/data migration? The main source of confusion is that "lfs-migrate" is a shell script that is used only for OST object/data migration (using "lfs migrate" internally, or "rsync" when wanting both inode and data migration), while "lfs migrate -m" is the command that drives MDT inode migration.

            Secondly, what is the goal of the MDT migration? Is that for manual MDT space balancing, or is it for replacement of the underlying MDT storage hardware, or some other reason? Definitely, the series of MDT space balancing changes in LU-11213, LU-13440, LU-14792, LU-15216, etc. have significantly reduced the need for manual MDT space management. For MDT storage replacement, IMHO it is likely more efficient to do this at the storage level (e.g. LVM migrate or ZFS resilvering) than at the MDT level, and AFAIK LLNL has done that in the past to migrate MDTs from HDDs to SSDs.

            That isn't to say we shouldn't be looking at improving the migration performance itself, but understanding what the goals are would help shape where optimizations should be done, and also what parameters should be measured during the testing. I also have the feeling that a significant part of the performance limitation that you are seeing may relate to ZFS transaction commit performance, because the migrate process is very transaction intensive in order to ensure it is atomic and recoverable in the face of an MDS crash.

            Assuming we are discussing "lfs migrate -m" performance here, then it is also important to determine how this is being called. Currently, it is only possible to do recursive (whole-tree) directory migration, and this is handled internally on the MDS, so it may be that trying to migrate a directory tree is inadvertently doing multiple migrations and hurting performance? Before we go extensively into testing directory migration performance, we should also look at LU-14975 "DNE3: directory migration in non-recursive mode" to see whether this allows more parallelism during migration.

            adilger Andreas Dilger added a comment - Hi Olaf, Gian-Carlo, just to clarify the topic of this ticket, this issue is strictly related to inode/directory migration between MDTs, and not OST object/data migration? The main source of confusion is that " lfs-migrate " is a shell script that is used only for OST object/data migration (using " lfs migrate " internally, or " rsync " when wanting both inode and data migration), while " lfs migrate -m " is the command that drives MDT inode migration. Secondly, what is the goal of the MDT migration? Is that for manual MDT space balancing, or is it for replacement of the underlying MDT storage hardware, or some other reason? Definitely, the series of MDT space balancing changes in LU-11213 , LU-13440 , LU-14792 , LU-15216 , etc. have significantly reduced the need for manual MDT space management. For MDT storage replacement, IMHO it is likely more efficient to do this at the storage level (e.g. LVM migrate or ZFS resilvering) than at the MDT level, and AFAIK LLNL has done that in the past to migrate MDTs from HDDs to SSDs. That isn't to say we shouldn't be looking at improving the migration performance itself, but understanding what the goals are would help shape where optimizations should be done, and also what parameters should be measured during the testing. I also have the feeling that a significant part of the performance limitation that you are seeing may relate to ZFS transaction commit performance, because the migrate process is very transaction intensive in order to ensure it is atomic and recoverable in the face of an MDS crash. Assuming we are discussing " lfs migrate -m " performance here, then it is also important to determine how this is being called. Currently, it is only possible to do recursive (whole-tree) directory migration, and this is handled internally on the MDS, so it may be that trying to migrate a directory tree is inadvertently doing multiple migrations and hurting performance? Before we go extensively into testing directory migration performance, we should also look at LU-14975 " DNE3: directory migration in non-recursive mode " to see whether this allows more parallelism during migration.
            pjones Peter Jones added a comment -

            Andreas

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Andreas Could you please advise? Thanks Peter

            Peter & Co., we would like your feedback on this test plan. Once we arrive at a test plan you agree with, Gian will perform actual tests, compile the rates, and create a bug type issue to find and fix the bottlenecks. He can help work on the investigation and fixes, but he doesn't have the knowledge to be the main person working the issue.

            ofaaland Olaf Faaland added a comment - Peter & Co., we would like your feedback on this test plan. Once we arrive at a test plan you agree with, Gian will perform actual tests, compile the rates, and create a bug type issue to find and fix the bottlenecks. He can help work on the investigation and fixes, but he doesn't have the knowledge to be the main person working the issue.

            People

              adilger Andreas Dilger
              defazio Gian-Carlo Defazio
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: