<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:16:23 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15211] lfs migrate metadata performance test plan</title>
                <link>https://jira.whamcloud.com/browse/LU-15211</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;h1&gt;&lt;a name=&quot;lfsmigrateMetadataPerformanceTesting&quot;&gt;&lt;/a&gt;lfs-migrate Metadata Performance Testing&lt;/h1&gt;

&lt;p&gt;While trying to use &lt;tt&gt;lfs-migrate&lt;/tt&gt; for meta-data migration, we found that &lt;tt&gt;lfs-migrate&lt;/tt&gt; perfomance does not scale well with additional processes. Even when using many processes and nodes, sustained performance was around 400 items/second, which is too slow to be practical for migrations of large numbers of files and directories.&lt;/p&gt;

&lt;p&gt;This testing plan is for performing additional tests to see if the above results are in fact the limit, or near the limit of &lt;tt&gt;lfs-migrate&lt;/tt&gt;&apos;s performance.&lt;/p&gt;
&lt;h2&gt;&lt;a name=&quot;Overview&quot;&gt;&lt;/a&gt;Overview&lt;/h2&gt;

&lt;p&gt;The performance to be measured is the rate at which items (files and directories) can be migrated. These items will be in a tree (or trees) and migrated by many processes running &lt;tt&gt;lfs-migrate&lt;/tt&gt; in parallel.&lt;/p&gt;

&lt;p&gt;The 3 basic parts of the test are:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;create the trees&lt;/li&gt;
	&lt;li&gt;migrate the trees&lt;/li&gt;
	&lt;li&gt;analyze the data generated during the migration&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;&lt;a name=&quot;CreatetheTrees&quot;&gt;&lt;/a&gt;Create the Trees&lt;/h3&gt;

&lt;p&gt;A single tree can be created using &lt;tt&gt;mdtest&lt;/tt&gt;. &lt;tt&gt;mdtest&lt;/tt&gt; has the ability to make trees of files and directories, and can parameterize those trees in most of the ways necessary for this test.&lt;/p&gt;

&lt;p&gt;The major shortcoming with &lt;tt&gt;mdtest&lt;/tt&gt; is that it doesn&apos;t set the striping and directory striping of the trees it creates. This can be overcome by pre-creating directories, setting their striping and directory striping, and then having &lt;tt&gt;mdtest&lt;/tt&gt; create trees within these directories so that each tree inherits these setting from its respective parent directory.&lt;/p&gt;

&lt;p&gt;The command to create the trees needs to be saved. This includes both the &lt;tt&gt;mdtest&lt;/tt&gt; command per directory, and also the command to make the directories and set their striping and directory striping. Also, &lt;tt&gt;mdtest&lt;/tt&gt; will be run with &lt;tt&gt;srun&lt;/tt&gt;, so the whole &lt;tt&gt;srun&lt;/tt&gt; command needs to be saved because the &lt;tt&gt;srun&lt;/tt&gt; parameters will affect the size/shape of the tree.&lt;/p&gt;
&lt;h3&gt;&lt;a name=&quot;MigratetheTrees&quot;&gt;&lt;/a&gt;Migrate the Trees&lt;/h3&gt;

&lt;p&gt;The migration is done in parallel by many processes, each running &lt;tt&gt;lfs-migrate&lt;/tt&gt; on one of the directories that contains a tree created by &lt;tt&gt;mdtest&lt;/tt&gt;. The many processes are created and spread across multiple client nodes using &lt;tt&gt;srun&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;Data needs to be collected during the run. Process 0 will record run-wide data such as total items migrated, and each process will write its own performance data. This will generate 1 file per processes, and 1 more for the run-wide data. Some of the collected data could be inferred from other data (or from the slurm database) but recording it simplifies post-processing.&lt;/p&gt;
&lt;h4&gt;&lt;a name=&quot;DatatoCollectPerRun&quot;&gt;&lt;/a&gt;Data to Collect Per Run&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt;total items migrated&lt;/li&gt;
	&lt;li&gt;total data migrated&lt;/li&gt;
	&lt;li&gt;the mdtest command and the striping/dirstriping commands&lt;/li&gt;
	&lt;li&gt;slurm jobid&lt;/li&gt;
	&lt;li&gt;the srun command that does the migration&lt;/li&gt;
&lt;/ul&gt;


&lt;h4&gt;&lt;a name=&quot;DatatoCollectPerProcess&quot;&gt;&lt;/a&gt;Data to Collect Per Process&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt;start time (of lfs-migrate)&lt;/li&gt;
	&lt;li&gt;end time (of lfs-migrate)&lt;/li&gt;
	&lt;li&gt;source MDTs&lt;/li&gt;
	&lt;li&gt;destination MDTs&lt;/li&gt;
	&lt;li&gt;the lfs-migrate command&lt;/li&gt;
	&lt;li&gt;lfs getdirstripe output for the root of the tree the process will migrate&lt;/li&gt;
&lt;/ul&gt;


&lt;h4&gt;&lt;a name=&quot;PotentialParameterstoVarybetweenRuns&quot;&gt;&lt;/a&gt;Potential Parameters to Vary between Runs&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt;total number of processes, nodes*ppn&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;
	&lt;li&gt;
	&lt;ul&gt;
		&lt;li&gt;the number of processes per node (2,8,16)&lt;/li&gt;
		&lt;li&gt;the number of nodes (1,8,32)&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;the kind of items that are migrated (files,directories)&lt;/li&gt;
	&lt;li&gt;how many items per process are migrated (1K, 8K, 64K configured with mdtest command)&lt;/li&gt;
	&lt;li&gt;file size = 0, fixed&lt;/li&gt;
	&lt;li&gt;DoM or not DoM&lt;/li&gt;
&lt;/ul&gt;


&lt;h4&gt;&lt;a name=&quot;InitialRunsPlanned&quot;&gt;&lt;/a&gt;Initial Runs Planned&lt;/h4&gt;

&lt;p&gt;Note that the above is still probably a larger parameter space than is necessary to find first-order bottlenecks (3*3*3*3*1*2 == 162 tests). To reduce the amount of tests, and expected total run time, initially only the following tests will be run.&#160; More complete testing of the parameter space will be performed as needed after developers are engaged.&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Find the values of nodes and ppn that maximize overall lfs-migrate rate for files only, 8K per process, without DoM (9 tests)&lt;/li&gt;
	&lt;li&gt;Using those values for nodes and ppn, test for the above items per process.&#160; Record the value of items/process (ipp) that maximizes overall lfs-migrate rate for files only, without DoM (3 tests)&lt;/li&gt;
	&lt;li&gt;Using those values for nodes, ppn, and ipp, test with files with DoM and files without DoM (2 tests)&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;&lt;a name=&quot;DataAnalysis&quot;&gt;&lt;/a&gt;Data Analysis&lt;/h3&gt;

&lt;p&gt;The data recorded for each run will all go into a single directory, along with the trees(s) creation data. A script will read the meta-data and per-process performance data, and calculate the rate at which items are migrated. The important input parameters and corresponding results for all runs will be output as a csv.&lt;/p&gt;
&lt;h3&gt;&lt;a name=&quot;PerformanceComparison&quot;&gt;&lt;/a&gt;Performance Comparison&lt;/h3&gt;

&lt;p&gt;For comparison, other performance metrics with the same file system and clients will be gathered:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;mdtest will be run with the same node and ppn combinations and enough objects per process to make each mdtest stage (e.g. create, unlink, etc.) take at least 10 minutes.&lt;/li&gt;
&lt;/ul&gt;
</description>
                <environment>client:&lt;br/&gt;
toss 3.7-14.1&lt;br/&gt;
3.10.0-1160.45.1.1chaos.ch6.x86_64&lt;br/&gt;
lustre 2.12.7_2.llnl&lt;br/&gt;
&lt;br/&gt;
server:&lt;br/&gt;
toss 4.1-5&lt;br/&gt;
4.18.0-240.22.1.1toss.t4.x86_64&lt;br/&gt;
zfs 2.0.52_2llnl-1&lt;br/&gt;
lustre 2.14.0_5.llnl&lt;br/&gt;
&lt;br/&gt;
</environment>
        <key id="67141">LU-15211</key>
            <summary>lfs migrate metadata performance test plan</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="adilger">Andreas Dilger</assignee>
                                    <reporter username="defazio">Gian-Carlo Defazio</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Thu, 11 Nov 2021 23:00:41 +0000</created>
                <updated>Tue, 29 Aug 2023 13:36:45 +0000</updated>
                                            <version>Lustre 2.14.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="318029" author="ofaaland" created="Thu, 11 Nov 2021 23:09:14 +0000"  >&lt;p&gt;Peter &amp;amp; Co., we would like your feedback on this test plan.  Once we arrive at a test plan you agree with, Gian will perform actual tests, compile the rates, and create a bug type issue to find and fix the bottlenecks.  He can help work on the investigation and fixes, but he doesn&apos;t have the knowledge to be the main person working the issue.&lt;/p&gt;</comment>
                            <comment id="318137" author="pjones" created="Fri, 12 Nov 2021 19:15:53 +0000"  >&lt;p&gt;Andreas&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="318145" author="adilger" created="Fri, 12 Nov 2021 20:25:35 +0000"  >&lt;p&gt;Hi Olaf, Gian-Carlo,&lt;br/&gt;
just to clarify the topic of this ticket, this issue is strictly related to inode/directory migration between MDTs, and &lt;b&gt;not&lt;/b&gt; OST object/data migration?  The main source of confusion is that &quot;&lt;tt&gt;lfs-migrate&lt;/tt&gt;&quot; is a shell script that is used &lt;b&gt;only&lt;/b&gt; for OST object/data migration (using &quot;&lt;tt&gt;lfs migrate&lt;/tt&gt;&quot; internally, or &quot;&lt;tt&gt;rsync&lt;/tt&gt;&quot; when wanting &lt;b&gt;both&lt;/b&gt; inode and data migration), while &quot;&lt;tt&gt;lfs migrate -m&lt;/tt&gt;&quot; is the command that drives MDT inode migration.&lt;/p&gt;

&lt;p&gt;Secondly, what is the goal of the MDT migration?  Is that for manual MDT space balancing, or is it for replacement of the underlying MDT storage hardware, or some other reason?  Definitely, the series of MDT space balancing changes in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11213&quot; title=&quot;DNE3: remote mkdir() in ROOT/ by default&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11213&quot;&gt;&lt;del&gt;LU-11213&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13440&quot; title=&quot;DNE3: limit directory default layout inheritance&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13440&quot;&gt;&lt;del&gt;LU-13440&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14792&quot; title=&quot;DNE3: enable filesystem-wide default LMV&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14792&quot;&gt;&lt;del&gt;LU-14792&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15216&quot; title=&quot;improve MDT QOS space balance&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15216&quot;&gt;&lt;del&gt;LU-15216&lt;/del&gt;&lt;/a&gt;, etc. have significantly reduced the need for manual MDT space management.  For MDT storage replacement, IMHO it is likely more efficient to do this at the storage level (e.g. LVM migrate or ZFS resilvering) than at the MDT level, and AFAIK LLNL has done that in the past to migrate MDTs from HDDs to SSDs.&lt;/p&gt;

&lt;p&gt;That isn&apos;t to say we shouldn&apos;t be looking at improving the migration performance itself, but understanding what the goals are would help shape where optimizations should be done, and also what parameters should be measured during the testing.  I also have the feeling that a significant part of the performance limitation that you are seeing may relate to ZFS transaction commit performance, because the migrate process is very transaction intensive in order to ensure it is atomic and recoverable in the face of an MDS crash.&lt;/p&gt;

&lt;p&gt;Assuming we are discussing &quot;&lt;tt&gt;lfs migrate -m&lt;/tt&gt;&quot; performance here, then it is also important to determine how this is being called.  Currently, it is &lt;b&gt;only&lt;/b&gt; possible to do recursive (whole-tree) directory migration, and this is handled internally on the MDS, so it may be that trying to migrate a directory tree is inadvertently doing multiple migrations and hurting performance?  Before we go extensively into testing directory migration performance, we should also look at &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14975&quot; title=&quot;DNE3: directory migration in non-recursive mode&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14975&quot;&gt;&lt;del&gt;LU-14975&lt;/del&gt;&lt;/a&gt; &quot;&lt;tt&gt;DNE3: directory migration in non-recursive mode&lt;/tt&gt;&quot; to see whether this allows more parallelism during migration.&lt;/p&gt;</comment>
                            <comment id="318154" author="defazio" created="Fri, 12 Nov 2021 22:02:09 +0000"  >&lt;p&gt;Hi Andreas,&lt;/p&gt;

&lt;p&gt;Yes, this ticket is specific for inode/directory migration and uses the &quot;lfs migrate -m&quot; command. I was referring to &quot;lfs migrate&quot; as &quot;lfs-migrate&quot;. The shell script for object/data migration has an underscore (&quot;lfs_migrate&quot;), but I see why that could be confusing. This issue came up when we were exploring ways to do a full file system migration to new hardware. It was to be part of a process that involves moving the data from the old to new hardware with &quot;zfs send/receive&quot;, which was to be used because our tests showed that it&apos;s very fast. However, once the data is on the new hardware there&apos;s more to do, and one of those steps involves moving meta/object data around within the new hardware. We initially considered &quot;lfs migrate&quot; for this, but it seemed slow. The other utility we considered is &quot;dsync&quot;, but that had an &quot;xattr&quot; issue, and I see that you&apos;ve reviewed Olaf&apos;s patch for that.&lt;/p&gt;

&lt;p&gt;The goal is ultimately for both MDT and OST migrations. The purpose of these migrations is potentially as part of the plan I mentioned above, although I don&apos;t think we&apos;ll be using &quot;lfs migrate&quot; for the migrations we&apos;re doing in the near term, so really it&apos;s to see if &quot;lfs migrate&quot; is a viable option in the more distant future. It&apos;s also for the hypothetical cases of balancing and evacuating hardware, but I see you&apos;ve said there are likely better ways to deal with (or prevent) those problems.&lt;/p&gt;

&lt;p&gt;As for how this is being called: trees are being made specifically for the test, and we are intentionally migrating the whole tree, and not expecting to just migrate the files at depth=1 as proposed in &quot;&lt;tt&gt;DNE3: directory migration in non-recursive mode&lt;/tt&gt;&quot;. The individual &quot;lfs migrate&quot; calls are on non-overlapping trees. As for your comment &quot;inadvertently doing multiple migrations and hurting performance&quot;, we are intentionally doing multiple migrations in the hopes that it will help performance, so it seems we might be confused about what helps vs hurts performance.&lt;/p&gt;

&lt;p&gt;One of the major questions I have about the whole process is how the data moves. Does it use the client nodes as intermediaries, or is the migration mostly happening just between the MDSs? My attempts to increase parallelism have been to use more clients with more processes per client.&lt;/p&gt;</comment>
                            <comment id="318161" author="adilger" created="Fri, 12 Nov 2021 23:04:09 +0000"  >&lt;p&gt;For directory/inode migration, this is mostly done on the MDS, and is only triggered by the client, because the whole operation has to be handled within a filesystem transaction on the MDT, so having the client involved would not improve things. I &lt;em&gt;think&lt;/em&gt; that using 1-level migrations &lt;em&gt;may&lt;/em&gt; improve parallelism, but the MDS may also throttle the amount of work that is being done to avoid consuming a large number of MDS service threads, since this can take a long time.&lt;/p&gt;

&lt;p&gt;There are almost certainly improvements to be had in this operation, since it has not been a focus for improvement in the past. &lt;/p&gt;</comment>
                            <comment id="340092" author="ofaaland" created="Mon, 11 Jul 2022 18:59:26 +0000"  >&lt;p&gt;Improved lfs migrate performance would be very useful for us, but we have worked out migration methods that are performant enough based on dsync(1) from mpifileutils.  Removing topllnl.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="65865">LU-14975</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>Performance</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i029pz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>