<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:16:58 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15281] inode size disparity on ZFS MDTs</title>
                <link>https://jira.whamcloud.com/browse/LU-15281</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have two clusters running, echo and lima.&#160; Before I go further, we are comparing apples and oranges a bit here as:&lt;/p&gt;

&lt;p&gt;The MDT pool on echo is composed of two vdevs that are hardware RAIDs (legacy hardware) so no ZFS mirroring.&#160; The MDT pool on lima is composed of 4 NVMe cards in 2 mirrors.&lt;/p&gt;

&lt;p&gt;The MDS on echo keeps getting very close to filling and we can&apos;t work out why.&lt;/p&gt;

&lt;p&gt;The two clusters are both used to do backups with heavy use of hard-linking (using dirvish/rsync).&lt;/p&gt;

&lt;p&gt;I know this is an oversimplification since it&apos;s not just inodes on the MDT but, running df -k and df -i to get kB and inodes used then dividing one by the other yields ~14kB/inode on echo and ~3kB/inode on lima.&lt;/p&gt;

&lt;p&gt;Are there any particular diagnostic tools/commands we could use to find what&apos;s using all the space on the ZFS MDT?&lt;/p&gt;

&lt;p&gt;echo&apos;s MDT is currently using 4.8TB for 350M inodes&lt;/p&gt;

&lt;p&gt;lima&apos;s MDT is currently using 2.8TB for 946M inodes&lt;/p&gt;

&lt;p&gt;Happy to provide any other info/params that might be useful.&lt;/p&gt;</description>
                <environment>CentOS 7.6</environment>
        <key id="67345">LU-15281</key>
            <summary>inode size disparity on ZFS MDTs</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="dneg">Dneg</reporter>
                        <labels>
                    </labels>
                <created>Fri, 26 Nov 2021 22:31:03 +0000</created>
                <updated>Mon, 29 Nov 2021 21:01:48 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="319279" author="dneg" created="Fri, 26 Nov 2021 23:43:41 +0000"  >&lt;p&gt;echo:&#160; logicalused: 1.74TB, used: 4.73TB&lt;/p&gt;

&lt;p&gt;lima:&#160; logicalused: 1.38TB, used: 2.74TB&lt;/p&gt;

&lt;p&gt;So lima is still way smaller per inode but the used/logicalused disparity is huge too.&lt;/p&gt;

&lt;p&gt;echo has physical block size of 4096, logical of 512&lt;/p&gt;

&lt;p&gt;lima has physical and logical of 512&lt;/p&gt;

&lt;p&gt;Both use 128K recordsize&lt;/p&gt;</comment>
                            <comment id="319286" author="adilger" created="Sat, 27 Nov 2021 01:25:57 +0000"  >&lt;p&gt;The first thing to check here would be whether &lt;tt&gt;ashift&lt;/tt&gt; is different on the two pools.  The lima filesystem may be using ashift=9 (512-byte sectors) and echo using ashift=12 (4096-byte sectors).  ZFS will normally automatically select ashift based on the hardware sector size, even if the device claims a smaller sector size.&lt;/p&gt;

&lt;p&gt;Using ashift=9 on 4096-byte sector devices can dramatically hurt performance, as well as potentially cause data errors because of the internal read-modify-write of a sector may be modifying in-use blocks that are not part of the ZFS transaction causing errors in the &quot;other&quot; sub-sectors.  This would &lt;em&gt;mostly&lt;/em&gt; not be fatal, because of redundancy would likely allow those errors to be recovered, but they would be at risk if a device failed.&lt;/p&gt;</comment>
                            <comment id="319352" author="dneg" created="Mon, 29 Nov 2021 17:11:17 +0000"  >&lt;p&gt;zdb -U /etc/zfs/zpool.cache on both does show what you suggest in that lima&apos;s devices have an ashift of 9 and echo&apos;s are 12.&lt;/p&gt;

&lt;p&gt;Given the physical/logical block sizes reported by /sys/block this seems appropriate right?&lt;/p&gt;

&lt;p&gt;Without knowing deep internals of ZFS, it seems to me there are two things at play here (although I&apos;ll freely admit these are educated guesses):&lt;/p&gt;

&lt;p&gt;1) The logicalused/used disparity on echo - Is this explained by 4096/512/ashift-9?&lt;/p&gt;

&lt;p&gt;2) The logicalused/logicalused disparity between echo and lima - 1.7ish TB for 350M files on echo and 1.4ish TB for 946M on lima.&lt;/p&gt;</comment>
                            <comment id="319373" author="adilger" created="Mon, 29 Nov 2021 19:24:54 +0000"  >&lt;p&gt;The ashift value represents the smallest possible on-disk unit of allocation, so having a larger ashift is definitely going to inflate space usage. This can become significant with RAID-Z2, but is less so with mirrors.&lt;/p&gt;

&lt;p&gt;Since ZFS is always compressing metadata, and dnode allocation is done in larger chunks (64KB) the ratio of MDT space usage is 13.71KB/inode vs. 2.96KB/inode, 4.6x and not 8x as would be expected just from the ashift ratio. &lt;/p&gt;

&lt;p&gt;Unfortunately, this is a property of how ZFS is implemented.&lt;/p&gt;

&lt;p&gt;The only other possible cause for excessive space usage on the MDT would be if &quot;echo&quot; has an old Changelog user registered that is not consuming the records. This can be checked on the MDS with &quot;&lt;tt&gt;lctl get_param mdd.&amp;#42;.changelog_size&lt;/tt&gt;&quot;. &lt;/p&gt;</comment>
                            <comment id="319376" author="dneg" created="Mon, 29 Nov 2021 19:38:58 +0000"  >&lt;p&gt;Output of that command:&lt;/p&gt;

&lt;p&gt;mdd.echo-MDT0000.changelog_size=0&lt;/p&gt;</comment>
                            <comment id="319391" author="adilger" created="Mon, 29 Nov 2021 20:37:09 +0000"  >&lt;p&gt;OK, that means there is no stray changelog usage, and it can&apos;t be the cause of the difference in space usage, so I think it is only the ashift.  Unfortunately, the only way to change ashift is to reformat the whole pool, and even then that isn&apos;t recommended if the underlying storage is using 4KB sectors with 512-byte emulation, and will not work at all if it has only 4KB sector size. &lt;/p&gt;</comment>
                            <comment id="319395" author="dneg" created="Mon, 29 Nov 2021 21:01:48 +0000"  >&lt;p&gt;Ok thanks.&lt;/p&gt;

&lt;p&gt;Given the prevalence of 4k block devices now, I guess we&apos;re either gonna have to just grab more of those 512 NVMe cards, drastically increase our expected MDT size, or switch MDS back to ldiskfs.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="41531" name="echo-get-all" size="5479" author="dneg" created="Fri, 26 Nov 2021 23:50:22 +0000"/>
                            <attachment id="41532" name="lima-get-all" size="5053" author="dneg" created="Fri, 26 Nov 2021 23:50:22 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02az3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>