<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:22:52 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2160] Implement ZFS dmu_tx_hold_append() declarations for llog </title>
                <link>https://jira.whamcloud.com/browse/LU-2160</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;&#160; llog records are written in append fashion and holes in llog files are not allowed. Lustre doesn&apos;t know until late into transaction life cycle if it will need to write llog record for this operation, at what offset and to what llog file. In other words at &lt;tt&gt;dmu_tx_assign()&lt;/tt&gt; time Lustre only knows the size of a potential llog record it might write during this transaction but doesn&apos;t know the exact start offset and object id for this write into llog file (it also knows the maximum llog file size). But &lt;tt&gt;dmu_tx_&lt;b&gt;&lt;/tt&gt; APIs assume that the caller knows precise offset of a future write before &lt;tt&gt;dmu_tx_assign()&lt;/tt&gt;. This is needed to calculate precise amount of space that will be consumed by a given write and this amount may differ for the same write size at different offsets (a second reason is that in debug mode &lt;tt&gt;dmu_tx_dirty_buf()&lt;/tt&gt; verifies all writes are done only to offsets declared with &lt;tt&gt;dmu_tx_hold_write()&lt;/tt&gt;). Changing Lustre code to calculate precise offset in llog before &lt;tt&gt;dmu_tx_assign()&lt;/tt&gt; step doesn&apos;t seem to yield an efficient solution therefore we propose to add 2 new dmu_tx&lt;/b&gt; APIs to accommodate Lustre requirements.&lt;/p&gt;

&lt;p&gt;1)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;void
dmu_tx_hold_append(dmu_tx_t *tx, uint64_t object, uint64_t startoff,
                   uint64_t obj_maxsize, uint64_t len, int bs, int ibs)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This API will reserve the maximum possible amount of space needed to append len bytes to any plain file within &lt;tt&gt;tx-&amp;gt;tx_objset objset&lt;/tt&gt;. The caller guarantees the size of this file is at most obj_maxsize but the actual start offset for the write may be anywhere from startoff to max_objsize-len (assert(len &amp;lt;= obj_maxsize)). This API is passed an object as object may be unknown when it&apos;s called. bs and ibs specify direct and indirect blocksizes used for the object we&apos;ll write (this avoids assuming worst cases for block sizes). if bs and ibs is 0 worst case for those blocksizes should be assumed.&lt;/p&gt;

&lt;p&gt;dmu_tx_hold_append() should pick the worst range for len bytes write. This will be the range where start and end byte are covered by a different indirect block at every level of the tree (i.e. seems like it should be [obj_maxsize/2 - len/2, obj_maxsize/2 + len/2)). dmu_tx_hold_append() can assume all writes to data blocks will only create new data and metadata blocks even if there&apos;s no snapshot. Therefore only txh_space_towrite field needs to be calculated for txh structure created by dmu_tx_hold_append() (space from metadnode update should be added just as in case of dmu_tx_hold_write() when dn is NULL).&lt;/p&gt;

&lt;p&gt;While dmu_tx_hold_append() may overestimate the net space consumed on disk by always assuming there&apos;s a snapshot and all writes create new metadata (this is correct assumption for data since we only write in append fashion to llog) it will avoid over reservation done by dmu_tx_hold_write() in other respects. Specifically dmu_tx_count_write() (called by dmu_tx_hold_write()) assumes the object size can be up to 2^64-1 when either dnode is not specified or declared write extends beyond current dn_maxblkid. In this case dmu_tx_count_write() accounts for maximum number of indirect blocks (and for the first write when maxblkid is 0 it uses the worst possible overhead from indirect blocks by computing the number of indirect blocks based on smallest indir blksz but using max indir blkzs as block size).&lt;/p&gt;

&lt;p&gt;dmu_tx_hold_append() can avoid that by knowing max file size and actual indirect/direct blocks sizes the caller guarantees to use. Specifically for 4K direct and 16K indirect blocksizes and 512MB max_objsize dmu_tx_hold_append() for 40 byte write will reserve 5 indirect blocks and 2 direct blocks or 88KB excluding metadnode file write.&lt;/p&gt;

&lt;p&gt;In the same case dmu_tx_hold_write() will reserve from this code (used when we extend the object into new block):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;	/*
	 * &apos;end&apos; is the last thing we will access, not one past.
	 * This way we won&apos;t overflow when accessing the last byte.
	 */
	start = P2ALIGN(off, 1ULL &amp;lt;&amp;lt; max_bs);
	end = P2ROUNDUP(off + len, 1ULL &amp;lt;&amp;lt; max_bs) - 1;
	txh-&amp;gt;txh_space_towrite += end - start + 1;

	start &amp;gt;&amp;gt;= min_bs;
	end &amp;gt;&amp;gt;= min_bs;

	epbs = min_ibs - SPA_BLKPTRSHIFT;
        /*
	 * The object contains at most 2^(64 - min_bs) blocks,
	 * and each indirect level maps 2^epbs.
	 */
	for (bits = 64 - min_bs; bits &amp;gt;= 0; bits -= epbs) {
		start &amp;gt;&amp;gt;= epbs;
		end &amp;gt;&amp;gt;= epbs;
		ASSERT3U(end, &amp;gt;=, start);
		txh-&amp;gt;txh_space_towrite += (end - start + 1) &amp;lt;&amp;lt; max_ibs;
		if (start != 0) {
			/*
			 * We also need a new blkid=0 indirect block
			 * to reference any existing file data.
			 */
			txh-&amp;gt;txh_space_towrite += 1ULL &amp;lt;&amp;lt; max_ibs;
		}
	}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;at least (52/7 + 1) * 16K + 16K + 4K = 148KB (I added extra 16K for all writes when start is &amp;gt; 64MB).&lt;/p&gt;

&lt;p&gt;That&apos;s greater than 88KB reserved by dmu_tx_hold_append() excluding metadnode reservation (metadnode reservation would be the same if we don&apos;t know object id upfront). And when dnode id is unknown (which is the actual case for llog writes) dmu_tx_count_write() would reserve much more (as it uses 3 not 7 for epbs above).&lt;/p&gt;</description>
                <environment></environment>
        <key id="12251">LU-2160</key>
            <summary>Implement ZFS dmu_tx_hold_append() declarations for llog </summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="bzzz">Alex Zhuravlev</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                            <label>llnl</label>
                            <label>zfs</label>
                    </labels>
                <created>Wed, 26 Oct 2011 01:56:33 +0000</created>
                <updated>Wed, 3 May 2023 15:27:53 +0000</updated>
                                                                                <due></due>
                            <votes>1</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="134604" author="adilger" created="Thu, 26 Nov 2015 02:53:49 +0000"  >&lt;p&gt;Alex, how hard would it be to actually implement this in the DMU?  My recollection was that it is mostly a matter of adding a new API, and the actual code complexity is low?&lt;/p&gt;</comment>
                            <comment id="135331" author="bzzz" created="Sat, 5 Dec 2015 05:28:36 +0000"  >&lt;p&gt;Andreas, reservation itself isn&apos;t a big issue. and I think we could just declare at high offset to reserve enough credits. the issue is debugging code which doesn&apos;t like when a write is going out of declared ranges.&lt;/p&gt;</comment>
                            <comment id="371247" author="behlendorf" created="Wed, 3 May 2023 15:09:03 +0000"  >&lt;p&gt;Alex, I&apos;ve opened PR &lt;a href=&quot;https://github.com/openzfs/zfs/pull/14819&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/openzfs/zfs/pull/14819&lt;/a&gt; which adds something close to the interface you suggested above.&#160; I was able to simplify your original proposal because in 2017 we relaxed the strict space accounting which was being done, &lt;a href=&quot;https://github.com/openzfs/zfs/commit/3ec3bc2167352df525c10c99cf24cb24952c2786&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/openzfs/zfs/commit/3ec3bc2167352df525c10c99cf24cb24952c2786&lt;/a&gt;,&#160; With the following interface you just need to specify a minimum starting offset and a length for the write.&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;void dmu_tx_hold_append(dmu_tx_t *tx, uint64_t object, uint64_t off, int len);&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;void dmu_tx_hold_append_by_dnode(dmu_tx_t *tx, dnode_t *dn, uint64_t off, int len)&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;As proposed this interface will still prefetch the L0 block at the specified offset to catch I/O errors early and to help make sure the block is in the ARC (for partial writes).&#160; That said, any offset beyond what&apos;s provided is still allowed but may result in some unnecessary reads so the more accurate you can be the better.&lt;/p&gt;

&lt;p&gt;I haven&apos;t had a chance to test this new interface yet but wanted to open up the PR to get your feedback.&#160; I&apos;d love to finally make some progress on this so we can get one step closer to enabling debugging on the OpenZFS side.&lt;/p&gt;</comment>
                            <comment id="371248" author="bzzz" created="Wed, 3 May 2023 15:27:53 +0000"  >&lt;p&gt;great, Brian.. will try that ASAP. thanks a lot.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="33047">LU-7409</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="25269">LU-5242</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="55745">LU-12336</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzuvef:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2749</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10002" key="com.atlassian.jira.plugin.system.customfieldtypes:float">
                        <customfieldname>Story Points</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>