<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:09:31 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7510] (vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28</title>
                <link>https://jira.whamcloud.com/browse/LU-7510</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have some production apps and rsync processes failing writes with ENOSPC errors on the ZFS backed FS only. It is currently at ~79%. There are no server side errors, -28 errors as above appeare in the client logs.&lt;/p&gt;

&lt;p&gt;I see that &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3522&quot; title=&quot;sanity-benchmark test_iozone: &amp;quot;no space left on device&amp;quot; on ZFS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3522&quot;&gt;&lt;del&gt;LU-3522&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2049&quot; title=&quot;add support for OBD_CONNECT_GRANT_PARAM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2049&quot;&gt;&lt;del&gt;LU-2049&lt;/del&gt;&lt;/a&gt; may have a bearing on this issue, is there a 2.5 backport or equivalent fix available? &lt;/p&gt;</description>
                <environment>Servers and clients: 2.5.4-11chaos-11chaos--PRISTINE-2.6.32-573.7.1.1chaos.ch5.4.x86_64&lt;br/&gt;
ZFS back end </environment>
        <key id="33409">LU-7510</key>
            <summary>(vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="10000">Done</resolution>
                                        <assignee username="utopiabound">Nathaniel Clark</assignee>
                                    <reporter username="ruth.klundt@gmail.com">Ruth Klundt</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 1 Dec 2015 21:13:15 +0000</created>
                <updated>Wed, 8 Jun 2016 22:23:43 +0000</updated>
                            <resolved>Wed, 8 Jun 2016 22:23:43 +0000</resolved>
                                    <version>Lustre 2.5.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="135022" author="adilger" created="Wed, 2 Dec 2015 19:56:00 +0000"  >&lt;p&gt;Ruth, could you please post the output of &quot;&lt;tt&gt;lfs df&lt;/tt&gt;&quot; and &quot;&lt;tt&gt;lfs df -i&lt;/tt&gt;&quot; on your filesystem(s).  On the OSS nodes, could you please collect &quot;&lt;tt&gt;lctl get_param obdfilter.*.tot_granted&lt;/tt&gt;&quot; to see if this is the actual cause of the ENOSPC errors.  Also, how many clients are connected to the filesystem?&lt;/p&gt;

&lt;p&gt;One potential workaround is to release some of the grant from the clients using &quot;&lt;tt&gt;lctl set_param osc.&amp;#42;.cur_grant_bytes=2M&lt;/tt&gt;&quot; and then check &lt;tt&gt;lctl get_param obdfilter.&amp;#42;.tot_granted&lt;/tt&gt;&quot; on the OSS nodes again to see if the total grant space has been reduced.&lt;/p&gt;

&lt;p&gt;Have you modified the client&apos;s maximum RPC size (via &lt;tt&gt;lctl set_param osc.&amp;#42;.max_pages_per_rpc=4M&lt;/tt&gt;, e.g. to have 4MB RPC size), or the ZFS maximum blocksize (via &lt;tt&gt;zfs set recordsize=1048576 lustre/lustre-OST0000&lt;/tt&gt; or similar)?  That will aggravate this problem until the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3522&quot; title=&quot;sanity-benchmark test_iozone: &amp;quot;no space left on device&amp;quot; on ZFS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3522&quot;&gt;&lt;del&gt;LU-3522&lt;/del&gt;&lt;/a&gt; is landed.&lt;/p&gt;</comment>
                            <comment id="135038" author="ruth.klundt@gmail.com" created="Wed, 2 Dec 2015 21:06:55 +0000"  >&lt;p&gt;The max_pages_per_rpc value is default of 256, and the zfs recordsize is 128K. We have 3 OSTs on each OSS rather than just one. We have ~6500 clients mounting the file system.&lt;/p&gt;

&lt;p&gt;We requested that some of the heavy users clean up, so the FS is at 75% now. Also moved a couple of affected users to the other (ldiskfs) file system. &lt;/p&gt;

&lt;p&gt;No messages so far today. I will go ahead and release some grant if you think it&apos;s still necessary or beneficial. &lt;/p&gt;

&lt;p&gt;I was guessing a combination of the bug + heavy user activity + high fs usage may have triggered this. Our FS usage tends to run high around here.&lt;/p&gt;</comment>
                            <comment id="135142" author="adilger" created="Thu, 3 Dec 2015 19:11:51 +0000"  >&lt;p&gt;Ruth, the &lt;tt&gt;cur_grant_bytes&lt;/tt&gt; command to release grants is something that you can try as a workaround if the &lt;tt&gt;-ENOSPC&lt;/tt&gt; errors are being hit again. It doesn&apos;t hurt to run this now, or occasionally, though it may cause a very brief hiccup in IO performance as the grant is released. The main reason this isn&apos;t useful to do (much) in advance of the problem is that this command asks clients try to return their grant to the server, but if the server isn&apos;t low on space it will just return the grant back to the client.&lt;/p&gt;

&lt;p&gt;The real fix for this problem is indeed the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2049&quot; title=&quot;add support for OBD_CONNECT_GRANT_PARAM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2049&quot;&gt;&lt;del&gt;LU-2049&lt;/del&gt;&lt;/a&gt; patch. The reason you see this problem when LLNL does not is that you have many more clients connected directly to the filesystem (6500 vs 768) and their OSTs are 72TB vs 30TB so they wouldn&apos;t hit this until they reach 99% full.&lt;/p&gt;

&lt;p&gt;We are working to get the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2049&quot; title=&quot;add support for OBD_CONNECT_GRANT_PARAM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2049&quot;&gt;&lt;del&gt;LU-2049&lt;/del&gt;&lt;/a&gt; patch landed to resolve this issue permanently. &lt;/p&gt;</comment>
                            <comment id="135170" author="ruth.klundt@gmail.com" created="Thu, 3 Dec 2015 22:33:55 +0000"  >&lt;p&gt;thanks, much appreciated. I&apos;ll keep an eye on those messages, they have not resumed since the usage went down. &lt;/p&gt;</comment>
                            <comment id="136738" author="jfc" created="Thu, 17 Dec 2015 17:52:08 +0000"  >&lt;p&gt;We are resolving this as a duplicate.&lt;/p&gt;

&lt;p&gt;Ruth &amp;#8211; if the problem recurs and you need more help please either ask us to reopen this ticket, or open a new one, as you prefer.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="148197" author="charr" created="Thu, 7 Apr 2016 23:42:32 +0000"  >&lt;p&gt;John,&lt;br/&gt;
FYI...We&apos;ve been hitting this at LLNL the last week or so. I&apos;ll note it on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2049&quot; title=&quot;add support for OBD_CONNECT_GRANT_PARAM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2049&quot;&gt;&lt;del&gt;LU-2049&lt;/del&gt;&lt;/a&gt; as well.&lt;/p&gt;</comment>
                            <comment id="148242" author="ruth.klundt@gmail.com" created="Fri, 8 Apr 2016 16:17:33 +0000"  >&lt;p&gt;We are running the workaround to release grant on the clients from the epilog (ie after each job), just to proactively keep that under control. We have not seen enospc errs in logs again, and usage has hit the 80% mark several times since. &lt;/p&gt;</comment>
                            <comment id="148436" author="charr" created="Mon, 11 Apr 2016 17:03:09 +0000"  >&lt;p&gt;Ruth,&lt;br/&gt;
Thanks for letting us know you went down the grant release route. I had noticed that 32 of our 80 OSTs were ~90% full (the others ~ 65%), so I deactivated those 32 fuller OSTs and that seems to have resolved the problem for now.&lt;/p&gt;</comment>
                            <comment id="149102" author="ruth.klundt@gmail.com" created="Fri, 15 Apr 2016 15:39:42 +0000"  >&lt;p&gt;Yesterday we had a server go down with this LBUG:&lt;/p&gt;

&lt;p&gt;LustreError: 8117:0:(ofd_grant.c:352:ofd_grant_incoming()) fscratch-OST001d: cli 8c1795e2-8806-4e65-5865-4e42489eac9b/ffff8807dee68400 dirty 33554432 pend 0 grant -54657024&lt;br/&gt;
LustreError: 8117:0:(ofd_grant.c:354:ofd_grant_incoming()) LBUG&lt;/p&gt;

&lt;p&gt;The client side grant release doesn&apos;t seem to be taking effect on that node. The values of tot_granted on the server side were increased to ~5x10^12 on that node only, all the other nodes had values ~2x10^12.&lt;/p&gt;

&lt;p&gt;This node has gone down several times in the last few weeks, this is the first time we got any log messages before it died. It seems we should also deactivate those OSTs, and increase the priority of upgrading the servers - although it appears a 2.5 upgrade path is not yet available (looking at &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2049&quot; title=&quot;add support for OBD_CONNECT_GRANT_PARAM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2049&quot;&gt;&lt;del&gt;LU-2049&lt;/del&gt;&lt;/a&gt;). &lt;/p&gt;

&lt;p&gt;But I&apos;m puzzled at why the grant release doesn&apos;t work for just that node. All the others have not been re-booted during this time period since we started the workaround.&lt;/p&gt;</comment>
                            <comment id="149673" author="ruth.klundt@gmail.com" created="Thu, 21 Apr 2016 15:00:00 +0000"  >&lt;p&gt;It turns out that the grant release does work, even on the problem node, once the grant on the problem server reaches ~4.9T. It decreased to ~4.0T over the course of a day before the lbug on 2 different targets. The other servers respond to grant release at levels of as low as 1.4T. The usage levels are similar, all osts between 75-80%. The only difference I can find is that the other zpools last item in history is activation of compression back in march. So this one server was rebooted after that compression activation, all the rest were not. Wondering if the size computation is affected by whether compression is on or off? All zpools are reporting 1.03-1.05 ratios.&lt;/p&gt;</comment>
                            <comment id="150311" author="utopiabound" created="Tue, 26 Apr 2016 21:46:21 +0000"  >&lt;p&gt;FYI: I don&apos;t have an exact version for 2.5.4-11chaos (12chaos and 4chaos are tagged in our system, so I have a good idea).&lt;/p&gt;

&lt;p&gt;Do you have any logging leading up to the LBUG, by any chance?&lt;/p&gt;</comment>
                            <comment id="150386" author="ruth.klundt@gmail.com" created="Wed, 27 Apr 2016 16:36:02 +0000"  >&lt;p&gt;There is nothing prior to the lbug. Here are the traces. &lt;/p&gt;

&lt;p&gt;The ofd code at least does not differ between 11chaos and 12chaos versions afaics.&lt;/p&gt;</comment>
                            <comment id="150387" author="ruth.klundt@gmail.com" created="Wed, 27 Apr 2016 16:38:09 +0000"  >&lt;p&gt;after deactivating the osts on that node, the rate of increase is slower, but it still is much larger than all the others and not decreasing so far at about ~3.7T.&lt;/p&gt;</comment>
                            <comment id="150764" author="ruth.klundt@gmail.com" created="Mon, 2 May 2016 20:53:18 +0000"  >&lt;p&gt;Each of the osts have shown a couple of decreases, in the 3.8-3.9 T range. &lt;/p&gt;</comment>
                            <comment id="152212" author="ruth.klundt@gmail.com" created="Fri, 13 May 2016 14:29:15 +0000"  >&lt;p&gt;Nearly all OSS nodes on this file system became inaccessible yesterday, 3 of them showed the LBUG at ofd_grant.c:352:ofd_grant_incoming with negative grant values. I disabled the automated grant release workaround in case it is related to this occurence. The OSTs are 77-79% full at the moment. After that another OSS went down with the same LBUG. &lt;/p&gt;

&lt;p&gt;This coincides with the addition of a new cluster, but we haven&apos;t done any I/O from it so far, just mounting. Any advice/thoughts? &lt;/p&gt;
</comment>
                            <comment id="152213" author="ruth.klundt@gmail.com" created="Fri, 13 May 2016 14:30:33 +0000"  >&lt;p&gt;And a specific question, is the LBUG likely addressed by changes upstream or should this be a separate ticket? &lt;/p&gt;</comment>
                            <comment id="153740" author="utopiabound" created="Thu, 26 May 2016 20:11:48 +0000"  >&lt;p&gt;The LBUG in question hasn&apos;t been changed, though the grant code has been reworked (a la &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2049&quot; title=&quot;add support for OBD_CONNECT_GRANT_PARAM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2049&quot;&gt;&lt;del&gt;LU-2049&lt;/del&gt;&lt;/a&gt;) upstream.  The negative grant resulting in LBUG should be separate bug, though it&apos;s probably 2.5 only.&lt;/p&gt;</comment>
                            <comment id="155150" author="ruth.klundt@gmail.com" created="Wed, 8 Jun 2016 20:14:06 +0000"  >&lt;p&gt;The file system usage has been reduced to ~70%, and we haven&apos;t seen -28 issues or LBUGs since then. &lt;/p&gt;

&lt;p&gt;You can close this one, we&apos;ll consider the fix for -28 issues to be upgrade to 2.8 lustre on the servers at some point in the future.&lt;/p&gt;

&lt;p&gt;If the LBUG re-occurs I&apos;ll open a new ticket. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Ruth&lt;/p&gt;</comment>
                            <comment id="155173" author="jfc" created="Wed, 8 Jun 2016 22:23:43 +0000"  >&lt;p&gt;Thanks Ruth.&lt;/p&gt;

&lt;p&gt;~ jfc.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="36063">LU-8007</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="16178">LU-2049</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="21303" name="lu-7510-lbug.txt" size="14079" author="ruth.klundt@gmail.com" created="Wed, 27 Apr 2016 16:36:02 +0000"/>
                            <attachment id="19788" name="zfs.lfs-out.12.02" size="10160" author="ruth.klundt@gmail.com" created="Wed, 2 Dec 2015 21:06:55 +0000"/>
                            <attachment id="19789" name="zfs.tot_granted.12.02" size="3180" author="ruth.klundt@gmail.com" created="Wed, 2 Dec 2015 21:06:55 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 26 May 2016 21:13:15 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxuon:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Tue, 1 Dec 2015 21:13:15 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>