<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:26:18 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16355] batch dirty buffered write of small files</title>
                <link>https://jira.whamcloud.com/browse/LU-16355</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;For buffered I/O mode, I/O can cache on the client side until flush is needed.&lt;br/&gt;
Small I/O is not good supported in the current Lustre.&lt;br/&gt;
To improve the small I/O performance, Lustre has already implemented short I/O feature in which the data is transferring using inline buffer of an I/O RPC request.  However, the performance improvement is limited.&lt;/p&gt;

&lt;p&gt;After batched RPC introduced, It can batch many dirty pages of many small files at OSC layer into a large RPC and transfer the I/O in bulk I/O mode.&lt;br/&gt;
The max dirty pages allowed by an OSC is reach 2G. Thus, it can cache lots of dirty data from OSC objects before hit max dirty limit or the space grants which needs to write out the data.&lt;br/&gt;
In OSC layer,  It can scan the dirty objects, and batching the dirty pages of these objects, and send the I/O requests in a batched way.&lt;/p&gt;

&lt;p&gt;It expects that this feature can benefit the write I/O from many small files and sync() at the end of the writes. (i.e. mdtest-hard-write).&lt;/p&gt;

&lt;p&gt;Here there are two design choices:&lt;br/&gt;
1. Use the existed short I/O mechanism to store the data into the batched RPC.&lt;br/&gt;
The advantage is that&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;It can store very small file data (i.e. less than 1024 bytes) in a much efficient way and does not need a whole page to hold the data for small files.&lt;/li&gt;
	&lt;li&gt;It can better integrate with the batched RPC.&lt;br/&gt;
The disadvantage is that the data movement is not zero-copy. The dirty pages is needed to copy to the inline buffer of the RPC on the client side,  and the inline data still needs to copy into the prepared page buffer to doing I/O to the backend filesystem on the server side.&lt;br/&gt;
2. Use RDMA mechanism, bind the dirty page IOV from multiple objects to the Bulk I/O directly on the client side, transfer the data into the prepared page IOV on the server side.&lt;br/&gt;
The advantage of this mechanism is that all data movement is zero copy from a client to a server.&lt;br/&gt;
The disadvantage is that:&lt;/li&gt;
	&lt;li&gt;The implementation may be complex. The bulk IOV contains the I/O pages from multiple objects, which may change the I/O logic at server side a lot.&lt;/li&gt;
	&lt;li&gt;the min I/O per object is page size, not much efficient for small objects just with several bytes.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Any suggestion and comment is welcome!&lt;/p&gt;

</description>
                <environment></environment>
        <key id="73446">LU-16355</key>
            <summary>batch dirty buffered write of small files</summary>
                <type id="2" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11311&amp;avatarType=issuetype">New Feature</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="qian_wc">Qian Yingjin</assignee>
                                    <reporter username="qian_wc">Qian Yingjin</reporter>
                        <labels>
                    </labels>
                <created>Wed, 30 Nov 2022 09:11:39 +0000</created>
                <updated>Thu, 31 Aug 2023 13:41:10 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="354660" author="adilger" created="Wed, 30 Nov 2022 13:01:50 +0000"  >&lt;blockquote&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;the min I/O per object is page size, not much efficient for small objects just with several bytes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;If we agreed that zero-copy IO was not needed/possible for very small writes (smaller than 4KB), then it would be possible to pack the dirty data from multiple files into a single RDMA transfer, and then copy the data out of the pages on the server into the server side inode pages again.  Even with the extra &lt;tt&gt;memcpy()&lt;/tt&gt; it would likely still be faster than sending separate RPCs for each file.  This also would fit very well with WBC since it could create DoM layouts directly for small files, and skip the DoM component for larger files that will store data on OSTs.&lt;/p&gt;

&lt;p&gt;It isn&apos;t very clear if packing the &lt;b&gt;pages&lt;/b&gt; would help with &lt;tt&gt;mdtest-hard-write&lt;/tt&gt; since those files are 3901 bytes, so only 195 bytes smaller than 4096-byte pages (4%).  However, packing multiple &lt;b&gt;objects&lt;/b&gt; into a single RPC should hopefully improve performance.&lt;/p&gt;

&lt;p&gt;If it is helpful, there was at one time support in the &lt;tt&gt;OBD_BRW_WRITE&lt;/tt&gt; RPC for handling multiple objects, since there could be an array of &lt;tt&gt;struct obd_ioobj&lt;/tt&gt; in the request, but I think much of this support was removed, because there could only be a single &lt;tt&gt;struct obdo&lt;/tt&gt; with file attributes per RPC (timestamps, UID, GID, PRJID, etc), so it didn&apos;t make sense to have writes to multiple objects.  However, if the writes are (commonly) all from the same UID/GID/PRJID then this might be possible.&lt;/p&gt;

&lt;p&gt;Alternately, having the batching at the RPC level is likely still helpful, and this would also allow mballoc to do a better job to aggregate the blocks of many small file writes together in the filesystem (e.g. 256x4KB writes into a single 1MB group allocation on disk). &lt;/p&gt;

&lt;p&gt;For very small files (&amp;lt; 512 bytes) it would be possible to use the &quot;inline data&quot; feature of ldiskfs (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5603&quot; title=&quot;Enable inline_data feature for Lustre&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5603&quot;&gt;LU-5603&lt;/a&gt;) to store data directly in the inode.  This could be used with DoM files to store them most efficiently, but may need some added testing/fixing in ldiskfs to work correctly with other ldiskfs features. &lt;/p&gt;</comment>
                            <comment id="355645" author="gerrit" created="Thu, 8 Dec 2022 04:48:50 +0000"  >&lt;p&gt;&quot;Qian Yingjin &amp;lt;qian@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/49342&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/49342&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16355&quot; title=&quot;batch dirty buffered write of small files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16355&quot;&gt;LU-16355&lt;/a&gt; osc: batch dirty buffered write of small files&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 11e13575d9370c66742ce44cffa0ce0cfedc1f63&lt;/p&gt;</comment>
                            <comment id="377959" author="adilger" created="Fri, 7 Jul 2023 19:29:24 +0000"  >&lt;p&gt;Have you looked at implementing batched &lt;b&gt;read&lt;/b&gt; support, if the reads can be generated asynchronously (e.g. via AIO, io_uring, or statahead for mdtest-easy/hard-read)?&lt;/p&gt;</comment>
                            <comment id="379131" author="qian_wc" created="Tue, 18 Jul 2023 15:29:11 +0000"  >&lt;p&gt;I have some thoughts about ahead operations (batched open+read-ahead for DoM files), not implemented yet. But we have ahead operations framework.&lt;br/&gt;
However, it only works for batching of small DoM-only files.&lt;br/&gt;
For files with data on OSTs, we can open-ahead the file, but current batch RPC does not support batched extent DLM locking, but we can use asynchronous RPCs (one PRC per file read, not batched RPC), which is similar to asynchronous RPCs for stat-ahead for AGL after opened the files.&lt;/p&gt;</comment>
                            <comment id="383886" author="gerrit" created="Mon, 28 Aug 2023 09:50:34 +0000"  >&lt;p&gt;&quot;Qian Yingjin &amp;lt;qian@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/52129&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/52129&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16355&quot; title=&quot;batch dirty buffered write of small files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16355&quot;&gt;LU-16355&lt;/a&gt; osc: add tunable for batching small writes&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: bc7e9d919b3ad5cfc2f8339da108eae513478e0b&lt;/p&gt;</comment>
                            <comment id="384389" author="gerrit" created="Thu, 31 Aug 2023 13:41:10 +0000"  >&lt;p&gt;&quot;Qian Yingjin &amp;lt;qian@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/52200&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/52200&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16355&quot; title=&quot;batch dirty buffered write of small files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16355&quot;&gt;LU-16355&lt;/a&gt; osc: batch small writes based on small object count&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: caf3701792f394b702bf2f260d8ab850304b94ba&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="26471">LU-5603</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="61685">LU-14139</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i036uf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>