<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:02:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13588] sigbus sent to mmap writer that is a long way below quota</title>
                <link>https://jira.whamcloud.com/browse/LU-13588</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;we&apos;ve been seeing SIGBUS from a tensorflow build, and possibly other builds and codes, since moving to 2.12.4 on servers. we moved to centos 7.8 on servers and clients at the same time. our previous Lustre version on servers was 2.10.5 (plus many patches) and zfs 0.7.9. the old server lustre versions had no issues with SIGBUS that we know of. we have been running 2.10.8 on clients for about 6 months and that is unchanged. &lt;/p&gt;

&lt;p&gt;after a week or so narrowing down the issue, we have found a reproducer in a tensorflow build ld step that will reliably SIGBUS, and have also found that this is related to group block quotas.&lt;/p&gt;

&lt;p&gt;the .so file that ld (ld.gold, collect2) is writing into is initially nulls and sparse, is about 210M in size, is mapp&apos;d, and probably receives a lot (&amp;gt;600k) small memcpy/memset&apos;s into the file before it gets a SIGBUS.&lt;/p&gt;

&lt;p&gt;a strace -f -t snippet is&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;62275 16:15:47 mmap(NULL, 258627360, PROT_READ|PROT_WRITE, MAP_SHARED, 996, 0) = 0x2b3b15905000
62275 16:15:48 --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x2b3b23a7cd23} ---
62275 16:15:48 +++ killed by SIGBUS +++
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;if that value of si_addr is correct, then it&apos;s well within the size of the file, so it doesn&apos;t look like ld is writing in the wrong place. ltrace also shows no addresses out of bounds.&lt;/p&gt;

&lt;p&gt;it gets interesting if we change the group quota limit on the account. if there is less than ~9TB of group quota free in the account, then we reliably get a SIGBUS.&lt;br/&gt;
ie. anywhere in the range -&amp;gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# lfs setquota  -g oz997 -B 6000000000 /fred
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;to&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# lfs setquota  -g oz997 -B 14000000000 /fred
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;where only about 5TB is actually used in the account -&amp;gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# lfs quota  -g oz997 /fred
Disk quotas for grp oz997 (gid 10273):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
          /fred 5166964968       0 6000000000       -  936838       0 2000000       -
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;then we get a sigbus -&amp;gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; $ /apps/skylake/software/core/gcccore/6.4.0/bin/gcc @bazel-out/k8-py2-opt/bin/tensorflow/python/_pywrap_tensorflow_internal.so-2.params
collect2: fatal error: ld terminated with signal 7 [Bus error]
compilation terminated.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but when there is ~9TB free quota, or more -&amp;gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# lfs setquota  -g oz997 -B 14000000000 /fred
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;then we do not see a SIGBUS and the ld step completes ok.&lt;/p&gt;

&lt;p&gt;other things to mention&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;we have tried with various different (much newer) gcc&apos;s and they see the same thing.&lt;/li&gt;
	&lt;li&gt;as far as I can tell from strace and ltrace output of the memcpy/memsets, all of the addresses it is writing to are well within the bounds of the file and so should not be getting SIGBUS. ie. it&apos;s probably not a bug in ld.gold.&lt;/li&gt;
	&lt;li&gt;ld.gold is the default linker. if we pick ld.bfd instead, then ld.bfd does ordinary (not mmap&apos;d) i/o to the output .so and that succeeds with the smallest quota above, so this just seems to affect mmap&apos;d i/o.&lt;/li&gt;
	&lt;li&gt;we&apos;ve tried a couple of different user and group accounts and the pattern is similar, so I don&apos;t think it&apos;s anything odd in an account&apos;s limits or settings.&lt;/li&gt;
	&lt;li&gt;another user with a much larger quota is also seeing SIGBUS on a build, but that group is within 30T of a 2P quota, so are &quot;close&quot; to over by some measure. I haven&apos;t dug into this bug report yet, but I suspect it&apos;s the same issue as this one.&lt;/li&gt;
	&lt;li&gt;builds to XFS work ok. I haven&apos;t tried XFS with a quota.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;on the surface this seems similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13228&quot; title=&quot;write access to an mmapped file over soft quota limit causes sigbus&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13228&quot;&gt;&lt;del&gt;LU-13228&lt;/del&gt;&lt;/a&gt; but we do not set any soft quotas, and the accounts are many TB away from being over quota. also our only recent lustre changes have been on the server side, and AFAICT that ticket is a client side fix.&lt;/p&gt;

&lt;p&gt;as we have a reproducer and a test methodology, we could probably build a 2.12.4 client image and try that if you would find that useful. we weren&apos;t planning to move to 2.12.x client in production just yet, but we could try it as an experiment.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</description>
                <environment>centos 7.8, zfs 0.8.3, lustre 2.12.4 on servers&lt;br/&gt;
zfs compression is enabled on OSTs.&lt;br/&gt;
centos 7.8, lustre 2.10.8 on clients&lt;br/&gt;
all x86_64&lt;br/&gt;
group block and inode quotas set and enforcing</environment>
        <key id="59248">LU-13588</key>
            <summary>sigbus sent to mmap writer that is a long way below quota</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="scadmin">SC Admin</reporter>
                        <labels>
                    </labels>
                <created>Tue, 19 May 2020 16:08:20 +0000</created>
                <updated>Sat, 27 Jun 2020 14:41:11 +0000</updated>
                            <resolved>Thu, 25 Jun 2020 12:34:07 +0000</resolved>
                                    <version>Lustre 2.10.8</version>
                    <version>Lustre 2.12.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="270607" author="green" created="Tue, 19 May 2020 21:45:06 +0000"  >&lt;p&gt;any chance you can give this patch a try still? &lt;a href=&quot;https://review.whamcloud.com/38292&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/38292&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I have seen it failing even in total absence of quotas just based on grant dynamics which is the other way how you can get that codepath triggered&lt;/p&gt;</comment>
                            <comment id="270820" author="scadmin" created="Thu, 21 May 2020 09:26:02 +0000"  >&lt;p&gt;Hi Oleg,&lt;/p&gt;

&lt;p&gt;I tried 2.12.4 client (no patches except a build patch for rhel7.8) instead of 2.10.8, and the sigbus issue is still there.&lt;/p&gt;

&lt;p&gt;do I need to apply the patch in &lt;a href=&quot;https://review.whamcloud.com/38292&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/38292&lt;/a&gt; to the servers as well as clients?&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="270857" author="green" created="Thu, 21 May 2020 18:00:37 +0000"  >&lt;p&gt;no, it&apos;s a client only patch&lt;/p&gt;</comment>
                            <comment id="270939" author="scadmin" created="Fri, 22 May 2020 16:32:33 +0000"  >&lt;p&gt;Hi Oleg,&lt;/p&gt;

&lt;p&gt;2.12.4 + the patch in &lt;a href=&quot;https://review.whamcloud.com/38292&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/38292&lt;/a&gt; seems to have fixed it. thanks!&lt;/p&gt;

&lt;p&gt;BTW any idea if 2.12.5 is out soon?&lt;br/&gt;
it would be good to have all those fixes as well as this one before we make the jump to 2.12 clients.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="270944" author="pjones" created="Fri, 22 May 2020 17:36:25 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=scadmin&quot; class=&quot;user-hover&quot; rel=&quot;scadmin&quot;&gt;scadmin&lt;/a&gt; yes 2.12.5 should be out soon - we&apos;re aiming to have an RC next week&lt;/p&gt;</comment>
                            <comment id="272145" author="pjones" created="Sat, 6 Jun 2020 14:15:54 +0000"  >&lt;p&gt;Robin&lt;/p&gt;

&lt;p&gt;We&apos;re at an advanced stage of release testing on RC1 and so far so good &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="272831" author="pjones" created="Sat, 13 Jun 2020 14:59:47 +0000"  >&lt;p&gt;Robin&lt;/p&gt;

&lt;p&gt;2.12.5 is now GA&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="273716" author="scadmin" created="Thu, 25 Jun 2020 07:42:11 +0000"  >&lt;p&gt;Hi Oleg and Peter,&lt;/p&gt;

&lt;p&gt;we have all clients at 2.12.5 now, and no sign of SIGBUS.&lt;/p&gt;

&lt;p&gt;please close this ticket.&lt;br/&gt;
thanks!&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="273724" author="pjones" created="Thu, 25 Jun 2020 12:34:07 +0000"  >&lt;p&gt;Good news! Thanks&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>Quota</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i010rb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>