<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:04:02 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13766] tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0</title>
                <link>https://jira.whamcloud.com/browse/LU-13766</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Many thousands of console log messages like this one on the lustre OSS nodes after servers were rebooted while clients stayed up:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jun 25 03:45:08 brass21 kernel: LustreError: 27913:0:(tgt_grant.c:758:tgt_grant_check()) lsrza-OST0010: cli ac60c141-9de9-1a2e-5d0d-fd1e525ff506 claims 1703936 GRANT, real grant 0
Jun 25 03:45:08 brass21 kernel: LustreError: 27913:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 237 previous similar messages
Jun 25 03:47:35 brass10 kernel: LustreError: 20031:0:(tgt_grant.c:758:tgt_grant_check()) lsrza-OST0005: cli f6897b82-71ad-5bc7-b60d-554c4cbbcdf7 claims 1703936 GRANT, real grant 0
Jun 25 03:47:35 brass10 kernel: LustreError: 20031:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 433 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This server cluster has 4 MDTs and 18 OSTs.  &lt;/p&gt;

&lt;p&gt;The number of these messages dropped significantly over time.  Roughly, in thousands, counts per day for all of brass were:&lt;/p&gt;

&lt;p&gt;2020-06-24  469&lt;br/&gt;
2020-06-25  417&lt;br/&gt;
2020-06-26   39&lt;br/&gt;
2020-06-27   27&lt;br/&gt;
2020-06-28  16&lt;br/&gt;
2020-06-29   19&lt;/p&gt;

&lt;p&gt;From what I can see, under Lustre 2.12.4 (at least) the clients all have some notion of their allocated grant, and when the server is restarted, the server loses all record of what grant it allocated.  They then appear to sync up as clients issue new writes using grant they were given, but that the server does not know about.  Eventually they would use up that &quot;old grant&quot; and be back in sync again.&lt;/p&gt;

&lt;p&gt;The pattern above seems consistent with that.  But why is the number of such messages so large?&lt;/p&gt;

&lt;p&gt;There are 18 OSTs, and they report 967 exports, so that works out to about (987,000 messages / 18,000 OST_client combinations) = about 54,000 such messages per OST_client combination.  It seems strange it would take 54,000 writes for the grant to be synced up between an OST and a client after some disturbance like a reboot.&lt;/p&gt;</description>
                <environment>brass&lt;br/&gt;
zfs-0.7.11-9.4llnl.ch6.x86_64&lt;br/&gt;
lustre-2.12.4_6.chaos-1.ch6.x86_64&lt;br/&gt;
(other lustre clusters as well including those at lustre 2.10.8)</environment>
        <key id="59896">LU-13766</key>
            <summary>tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Thu, 9 Jul 2020 01:37:08 +0000</created>
                <updated>Fri, 6 Nov 2020 20:02:34 +0000</updated>
                            <resolved>Fri, 16 Oct 2020 22:06:32 +0000</resolved>
                                    <version>Lustre 2.12.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="274824" author="ofaaland" created="Thu, 9 Jul 2020 01:37:33 +0000"  >&lt;p&gt;For my records, my local issue is TOSS4826&lt;/p&gt;</comment>
                            <comment id="274825" author="gerrit" created="Thu, 9 Jul 2020 01:39:37 +0000"  >&lt;p&gt;Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/39324&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39324&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13766&quot; title=&quot;tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13766&quot;&gt;&lt;del&gt;LU-13766&lt;/del&gt;&lt;/a&gt; obdclass: add grant fields to export procfile&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 46411bc3510234fa7608821f8147244ccecca1d0&lt;/p&gt;</comment>
                            <comment id="274830" author="ofaaland" created="Thu, 9 Jul 2020 01:55:04 +0000"  >&lt;p&gt;The patch is for troubleshooting on my systems.  If you think it would be a useful change generally, I&apos;ll rebase it on master and push it for test and review.&lt;/p&gt;</comment>
                            <comment id="274887" author="pjones" created="Thu, 9 Jul 2020 15:06:06 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;What do you think about this proposal?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="274965" author="tappro" created="Fri, 10 Jul 2020 13:04:40 +0000"  >&lt;p&gt;Usually server gets grant info from clients upon client reconnect, so it is sync-ed when clients are connected. In our case it looks like before reboot server gives more grants to clients then it has, so after reboot each client reports own grants and all of them in sum cause no grants remain on server. I tend to think that can be caused by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12687&quot; title=&quot;Fast ENOSPC on direct I/O&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12687&quot;&gt;&lt;del&gt;LU-12687&lt;/del&gt;&lt;/a&gt; issue which is about such behavior - clients may have more grants than server can handle. &lt;/p&gt;</comment>
                            <comment id="275316" author="ofaaland" created="Tue, 14 Jul 2020 05:23:02 +0000"  >&lt;p&gt;One of the connected clients, rzgenie28, had what appears to be a corrupt (maybe underflowed?) value of cur_grant_bytes for OST0000; that node is also involved in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13763&quot; title=&quot;ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13763&quot;&gt;&lt;del&gt;LU-13763&lt;/del&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rzgenie28:~]# lctl get_param -n osc.*OST0000*.cur_grant_bytes
18446744073707847680
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Total grant reported by OST0000:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@brass5:~]# lctl get_param obdfilter.*OST0000.tot_granted
obdfilter.lsrza-OST0000.tot_granted=1551928458425
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Total cur_grant_bytes summed over all clients without rzgenie28, the outlier:&lt;/p&gt;
&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;Brass&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;Clients - rzgenie28&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;Clients/Brass&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1549998554295&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;18446745561354235904 -&#160;18446744073707847680 =&#160;&lt;br/&gt;
 1487646388224&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;clients/brass = .959772&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
</comment>
                            <comment id="275406" author="ofaaland" created="Wed, 15 Jul 2020 00:53:25 +0000"  >&lt;p&gt;I confirmed that I can reproduce &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12687&quot; title=&quot;Fast ENOSPC on direct I/O&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12687&quot;&gt;&lt;del&gt;LU-12687&lt;/del&gt;&lt;/a&gt; under Lustre 2.12.15 (probably no surprise to you).&lt;/p&gt;</comment>
                            <comment id="275462" author="tappro" created="Wed, 15 Jul 2020 09:12:48 +0000"  >&lt;p&gt;Olaf, I&apos;ve ported it to b2_12 if needed: &lt;a href=&quot;https://review.whamcloud.com/39386&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39386&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="275484" author="ofaaland" created="Wed, 15 Jul 2020 15:26:16 +0000"  >&lt;p&gt;Thanks, Mikhail.  I ported that grant patch for direct io also, and in my local test (using FSTYPE=zfs llmount.sh, and dd oflag=direct) it did not work.  Unfortunately, I just got that far yesterday before I had to stop, so I don&apos;t know yet why.  Our backports look the same to me.  Did you test it successfully, or are you waiting for auto testing results for that?&lt;/p&gt;</comment>
                            <comment id="275486" author="tappro" created="Wed, 15 Jul 2020 15:39:23 +0000"  >&lt;p&gt;I checked locally new patch tests 64e/f and they are working, let&apos;s see Maloo test results&lt;/p&gt;</comment>
                            <comment id="275487" author="ofaaland" created="Wed, 15 Jul 2020 15:42:21 +0000"  >&lt;p&gt;OK, good.  Thanks.&lt;/p&gt;</comment>
                            <comment id="276583" author="vsaveliev" created="Mon, 3 Aug 2020 09:18:40 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jun 25 03:45:08 brass21 kernel: LustreError: 27913:0:(tgt_grant.c:758:tgt_grant_check()) lsrza-OST0010: cli ac60c141-9de9-1a2e-5d0d-fd1e525ff506 claims 1703936 GRANT, real grant 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;weren&apos;t OSTs mentioned in such messages running out of space by that time by chance?&lt;/p&gt;</comment>
                            <comment id="276614" author="ofaaland" created="Mon, 3 Aug 2020 18:30:37 +0000"  >&lt;p&gt;Vladimir,&lt;br/&gt;
No, those OSTs had &amp;gt;350T free each.&lt;/p&gt;</comment>
                            <comment id="278091" author="ofaaland" created="Wed, 26 Aug 2020 00:19:58 +0000"  >&lt;p&gt;Hi Mike,&lt;/p&gt;

&lt;p&gt;Are you able to look at the test failures on &lt;a href=&quot;https://review.whamcloud.com/#/c/39386/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/39386/&lt;/a&gt; ?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="278101" author="tappro" created="Wed, 26 Aug 2020 07:14:36 +0000"  >&lt;p&gt;Olaf, I am working on that right now, it seems that just taking one patch from master was not enough, some other related changes are needed.&lt;/p&gt;</comment>
                            <comment id="278353" author="tappro" created="Sun, 30 Aug 2020 22:24:25 +0000"  >&lt;p&gt;Olaf, I&apos;ve found the reason of failures, patch should work now&lt;/p&gt;</comment>
                            <comment id="278406" author="ofaaland" created="Mon, 31 Aug 2020 18:36:16 +0000"  >&lt;p&gt;Thanks Mike&lt;/p&gt;</comment>
                            <comment id="280032" author="pjones" created="Sat, 19 Sep 2020 12:54:44 +0000"  >&lt;p&gt;The port has landed to b2_12 but I&apos;m holding off closing out the ticket because we&apos;ve been seeing some test failures that coincide with this landing that we should investigate.&lt;/p&gt;</comment>
                            <comment id="281340" author="pjones" created="Fri, 2 Oct 2020 14:29:31 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ofaaland&quot; class=&quot;user-hover&quot; rel=&quot;ofaaland&quot;&gt;ofaaland&lt;/a&gt; Mike has identiified that &lt;a href=&quot;https://review.whamcloud.com/#/c/39518/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/39518/&lt;/a&gt; is the missing fix needed. We&apos;ll be including this into 2.12.6 and you could pick it up sooner if you wish to use the previously mentioned fixes sooner.&lt;/p&gt;</comment>
                            <comment id="281403" author="ofaaland" created="Sat, 3 Oct 2020 16:46:48 +0000"  >&lt;p&gt;Thanks Peter, Thanks Mike&lt;/p&gt;</comment>
                            <comment id="282479" author="pjones" created="Fri, 16 Oct 2020 22:06:32 +0000"  >&lt;p&gt;I believe that everything is landed for this now.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="56732">LU-12687</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="61579">LU-14125</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="58790">LU-13457</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i014qn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>