<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:27:51 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9627] Bad small-file behaviour even when local-only and on RAM-FS</title>
                <link>https://jira.whamcloud.com/browse/LU-9627</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi everyone, I have noticed a curiously bad small-file creation behaviour on Lustre 2.9.55.&lt;/p&gt;

&lt;p&gt;I know that Lustre is inefficient when handling large amounts of small files and profits from the Metadata Servers running on SSDs &#8211;&#160;but while exploring just how bad this is, I found something curious.&lt;/p&gt;

&lt;p&gt;My use case is simple: Create 50.000 40-byte files in a single directory. The &quot;test.py&quot; script below will do just that.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Since I wanted to find the theoretical speed of Lustre, I used the following setup:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;A single server played the role of MGS, MDT, OST and Client.&lt;/li&gt;
	&lt;li&gt;All data storage happens via ldiskfs on a ramdisk
	&lt;ul&gt;
		&lt;li&gt;16GB Metadata&lt;/li&gt;
		&lt;li&gt;48GB Object Data&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;All network accesses happen via TCP loopback&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The final Lustre FS looks like this:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[-bash-4.3]$&#160;lfs df -h
UUID bytes Used Available Use% Mounted on
ram-MDT0000_UUID 8.9G 46.1M 8.0G 1% /mnt/ram/client[MDT:0]
ram-OST0000_UUID 46.9G 53.0M 44.4G 0% /mnt/ram/client[OST:0]
filesystem_summary: 46.9G 53.0M 44.4G 0% /mnt/ram/client
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Unfortunately, when running the test-script (which needs ~5 seconds on a local disk), I instead get these abysmal speeds:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[-bash-4.3]$ ./test.py /mnt/ram/client
2017-06-09 18:49:56,518 [INFO ] Creating 50k files in one directory...
2017-06-09 18:50:50,437 [INFO ] Reading 50k files...
2017-06-09 18:51:09,310 [INFO ] Deleting 50k files...
2017-06-09 18:51:20,604 [INFO ] Creation took: 53.92 seconds
2017-06-09 18:51:20,604 [INFO ] Reading took: 18.87 seconds
2017-06-09 18:51:20,604 [INFO ] Deleting took: 11.29 seconds

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This tells me, that there is a rather fundamental performance issue within Lustre &#8211; and that it has nothing to do with the disk or network latency.&lt;/p&gt;

&lt;p&gt;That &#8211; or my test script is broken &#8211; but I do not think it is.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;If you&apos;re curious, here&apos;s how I set up the test scenario:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;mkdir -p /mnt/ram/disk
mount -t tmpfs -o size=64G tmpfs /mnt/ram/disk
dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/dev/zero of=/mnt/ram/disk/mdt.img bs=1M count=16K
dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/dev/zero of=/mnt/ram/disk/odt.img bs=1M count=48K
losetup /dev/loop0 /mnt/ram/disk/mdt.img
losetup /dev/loop1 /mnt/ram/disk/odt.img
mkfs.lustre --mgs --mdt --fsname=ram --backfstype=ldiskfs --index=0 /dev/loop0
mkfs.lustre --ost --fsname=ram --backfstype=ldiskfs --index=0 --mgsnode=127.0.0.1@tcp0 /dev/loop1

mkdir -p /mnt/ram/mdt
mount -t lustre -o defaults,noatime /dev/loop0 /mnt/ram/mdt
mkdir -p /mnt/ram/ost
mount -t lustre -o defaults,noatime /dev/loop1 /mnt/ram/ost

mkdir -p /mnt/ram/client
mount -t lustre 127.0.0.1@tcp0:/ram /mnt/ram/client
chmod 1777 /mnt/ram/client

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment></environment>
        <key id="46605">LU-9627</key>
            <summary>Bad small-file behaviour even when local-only and on RAM-FS</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="mhschroe">Martin Schr&#246;der</reporter>
                        <labels>
                    </labels>
                <created>Fri, 9 Jun 2017 17:13:28 +0000</created>
                <updated>Fri, 21 Jan 2022 01:02:25 +0000</updated>
                                            <version>Lustre 2.9.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="198870" author="adilger" created="Mon, 12 Jun 2017 08:09:50 +0000"  >&lt;p&gt;We are working on a feature for 2.11 to improve small file performance - Data-on-MDT in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3825&quot; title=&quot;mdt_hsm_release() clobbers ma_valid&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3825&quot;&gt;&lt;del&gt;LU-3825&lt;/del&gt;&lt;/a&gt;. If you are interested to test this new feature (still under development), the last patch in the series is &lt;a href=&quot;https://review.whamcloud.com/#/c/23010/24&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/23010/24&lt;/a&gt; currently. &lt;/p&gt;</comment>
                            <comment id="198872" author="mhschroe" created="Mon, 12 Jun 2017 08:31:59 +0000"  >&lt;p&gt;&#160;Andreas.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Yes, I am aware of that planned feature. Thing just is: I do not believe it will actually improve the situation I created here.&lt;/p&gt;

&lt;p&gt;In my test, all network connection is local-loopback only &#8211; so round-trip-times for any network packet sent is in the microseconds.&lt;/p&gt;

&lt;p&gt;Additionally, all data is kept in memory, so all accesses should happen with a latency of nanoseconds (and a datarate of GB/s &#8211; not that that matters with 40 byte files.)&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;So I&apos;d expect this test to run in no time at all. I did a test on the raw ramdisk, and the test script passes in a bit over 2 seconds:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[-bash-4.3]$ mount | grep ram
tmpfs on /mnt/ram/disk type tmpfs (rw,size=1G)

[-bash-4.3]$ ./test.py /mnt/ram/disk/
2017-06-12 10:25:12,260 [INFO ] Creating 50k files in one directory...
2017-06-12 10:25:13,489 [INFO ] Reading 50k files...
2017-06-12 10:25:14,349 [INFO ] Deleting 50k files...
2017-06-12 10:25:14,678 [INFO ] Creation took: 1.23 seconds
2017-06-12 10:25:14,678 [INFO ] Reading took: 0.86 seconds
2017-06-12 10:25:14,678 [INFO ] Deleting took: 0.33 seconds

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;As far as I can tell, all that the Data-on-MDT feature does, is remove exactly one network connection to the OST per file creation. I fail to see how this could improve the time by more than a factor of 2 (because 2 conns get turned into 1 conn).&lt;/p&gt;

&lt;p&gt;So I&apos;d expect the timing to fall from 85 seconds to ~40 seconds &#8211; which would still be 20x slower than raw access.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;But well, just for completeness&apos; sake, I&apos;ll give it a try today and post the results.&lt;/p&gt;</comment>
                            <comment id="198904" author="mhschroe" created="Mon, 12 Jun 2017 14:38:08 +0000"  >&lt;p&gt;Hi everyone.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I have now built and deployed the &quot;Data-on-MDT&quot; feature, and &#8211; as expected &#8211; it indeed improves the timing by about 50%.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[-bash-4.3]$ ./test.py /mnt/ram/client
[...]
2017-06-12 16:25:22,025 [INFO ] Creation took: 31.36 seconds
2017-06-12 16:25:22,025 [INFO ] Reading took: 12.36 seconds
2017-06-12 16:25:22,025 [INFO ] Deleting took: 8.38 seconds

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;While this is good news, it still means that something in the code is producing a slow-down of a factor of 20.&lt;/p&gt;

&lt;p&gt;As mentioned before, that is weird since the two main suspects &#8211; disk speed (6GByte/s) and network latency (0.01ms) &#8211; have been removed as much as possible.&lt;/p&gt;

&lt;p&gt;If we assume that the network RTT would be the main slow-down compared to direct disk access, that would only account for 500ms (50k x 0.01) delay. So even with a factor of 10, I&apos;d only expect ~5 seconds delay &#8211; but instead we see 30 seconds of delay.&lt;/p&gt;

&lt;p&gt;Curios.&lt;/p&gt;</comment>
                            <comment id="199513" author="adilger" created="Fri, 16 Jun 2017 20:49:31 +0000"  >&lt;p&gt;Martin, thank you for your continued investigation of this issue.  One note is that &lt;tt&gt;tmpfs&lt;/tt&gt; provides the best concievable performance possible for such a workload, since there is virtually no overhead for this filesystem.  A more useful comparison would be formatting a ram-backed ldiskfs filesystem to see what the performance comparison is to the &lt;tt&gt;tmpfs&lt;/tt&gt; filesystem.  That would expose how much of the overhead is in ldiskfs (locking, write amplification from 40-&amp;gt;4096 byte blocks, journaling, etc), compared to how much is in the client+ptlrpc+MDS.  &lt;/p&gt;

&lt;p&gt;With ldiskfs there is a relatively new option called &quot;&lt;tt&gt;inline_data&lt;/tt&gt;&quot; that allows storing the data of extremely small files directly in the inode.  While Lustre doesn&apos;t directly support this feature today, it may be useful for real-world usage with DoM to minimize space usage on the MDT as well as avoiding the extra IOPS/write amplification caused by using a full filesystem block for small files.  In Lustre 2.10 the default inode size has increased to 1024 bytes (from 512 bytes previously), which may also be a contributing factor in this benchmark, but will allow files up to ~768 bytes to be stored directly in the inode.&lt;/p&gt;</comment>
                            <comment id="199581" author="mhschroe" created="Mon, 19 Jun 2017 08:44:11 +0000"  >&lt;p&gt;Hi Andreas.&lt;/p&gt;

&lt;p&gt;Thanks for the reply.&lt;/p&gt;

&lt;p&gt;Please note that I am indeed using &lt;em&gt;ldiskfs&lt;/em&gt; already. The flow is:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;I create a &lt;em&gt;tmpfs&lt;/em&gt; file system and mount it under &quot;/mnt/ram/disk&quot;&lt;/li&gt;
	&lt;li&gt;I create two zero-filled files under that path: &lt;b&gt;mdt.img&lt;/b&gt; and &lt;b&gt;odt.img&lt;/b&gt;&lt;/li&gt;
	&lt;li&gt;These two files are &lt;em&gt;loop&lt;/em&gt;-mounted into /dev/loop&lt;span class=&quot;error&quot;&gt;&amp;#91;0,1&amp;#93;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;Each loop-mount is then formatted with &lt;em&gt;ldiskfs&lt;/em&gt; and used by Lustre as either metadata or data storage target.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So the effect is that each I/O operation goes like this:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;ldiskfs --&amp;gt; loopmount --&amp;gt; tmpfs --&amp;gt; RAM&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Since the overhead of loopmount and tmpfs is virtually negligible &#8211; and the machine has 196 GB of RAM so does no swapping &#8211; the only speed block can be ldiskfs or Lustre.&lt;/p&gt;

&lt;p&gt;Just for comparison&apos;s sake, I have created the same loop, but used an EXT4 file system directly &#8211; with the same settings as used by Lustre.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-none&quot;&gt;[bash-4.3]# mount -t tmpfs -o size=64G tmpfs /mnt/ram/disk
[bash-4.3]# dd if=/dev/zero of=/mnt/ram/disk/odt.img bs=1M count=48K
49152+0 records in
49152+0 records out
51539607552 bytes (52 GB) copied, 20.0891 s, 2.6 GB/s

[bash-4.3]# losetup /dev/loop0 /mnt/ram/disk/odt.img
[bash-4.3]# mke2fs -j -b 4096 -L ram:OST0000 &#160;-J size=400 -I 256 -i 69905 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E resize=&quot;4290772992&quot;,lazy_journal_init -F /dev/loop0

[bash-4.3]# mount -t ext4 -o rw,noatime /dev/loop0 /mnt/ram/ost
[bash-4.3]# df -h /mnt/ram/ost
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 48G 52M 45G 1% /mnt/ram/ost
[bash-4.3]# chmod 1777 /mnt/ram/ost
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;Then, I ran the performance test again:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-none&quot;&gt;[bash-4.3]$ ./test.py /mnt/ram/ost
[...]
2017-06-19 10:37:52,651 [INFO ] Creation took: 2.11 seconds
2017-06-19 10:37:52,651 [INFO ] Reading took: 0.86 seconds
2017-06-19 10:37:52,651 [INFO ] Deleting took: 0.80 seconds
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, EXT4 adds about 1 second to the file creation speed, compared to &quot;raw&quot; tmpfs (2.11 sec vs. 1.23 sec).&lt;br/&gt;
 Therefore, the write-amplification of 40 byte -&amp;gt; 4096 byte and other EXT4 overheads are present, but negligible.&lt;br/&gt;
&#160;&lt;br/&gt;
 The drastic, massive slow-down has to be because of something inside Lustre. Some kind of internal latency that gets added to every single read and write. It&#160;&lt;em&gt;&lt;b&gt;could&lt;/b&gt;&lt;/em&gt; be the LNET network layer, but since the packets never leave the machine, I could not imagine that this alone leads to a 10-20x slowdown.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="49050">LU-10176</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="26471">LU-5603</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="50617">LU-10619</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="45743">LU-9409</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="26951" name="test.py" size="1233" author="mhschroe" created="Fri, 9 Jun 2017 17:01:32 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzenj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>