<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:06:56 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-428] Lustre: 16290:0:(quota_interface.c:460:quota_chk_acq_common()) still haven&apos;t managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)</title>
                <link>https://jira.whamcloud.com/browse/LU-428</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We&apos;ve deployed a new filesystem recently and enabled quotas.  We&apos;ve gotten over 1200 of these messages since we&apos;ve been in production the last couple weeks:&lt;/p&gt;

&lt;p&gt;Lustre: 16290:0:(quota_interface.c:460:quota_chk_acq_common()) still haven&apos;t managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)&lt;/p&gt;

&lt;p&gt;Some days we get none, or very few, and some days we might get 50-100.  The MDS has very little load on it.  We&apos;re not aware of an operational problem associated with the above messages - no one has complained to us about I/O or quota problems.  But we&apos;d like to solve whatever issue is causing these messages.&lt;/p&gt;

&lt;p&gt;One strange thing is that when we get one of the above messages, it is always on the 10th retry, and err is always zero and rc is always zero in that case - it seems funny to me that the 10th call to acquire() is &lt;b&gt;always&lt;/b&gt; successful even if it failed 9 times in a row prior to this.&lt;/p&gt;</description>
                <environment>x86_64, CentOS5, 2.6.18-194.17.1.el5_lustre.1.8.5, OFED 1.5.2, 4 OSS nodes, 4 8TB OSTs/OSS, 700 clients (some o2ib, some tcp)&lt;br/&gt;
</environment>
        <key id="11188">LU-428</key>
            <summary>Lustre: 16290:0:(quota_interface.c:460:quota_chk_acq_common()) still haven&apos;t managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="prescott@hpc.ufl.edu">Craig Prescott</reporter>
                        <labels>
                    </labels>
                <created>Fri, 17 Jun 2011 12:05:31 +0000</created>
                <updated>Sun, 14 Aug 2011 21:53:57 +0000</updated>
                            <resolved>Sun, 14 Aug 2011 21:53:56 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="16546" author="pjones" created="Fri, 17 Jun 2011 12:34:08 +0000"  >&lt;p&gt;Craig&lt;/p&gt;

&lt;p&gt;Which Lustre version are you running and are any patches applied?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="16548" author="prescott@hpc.ufl.edu" created="Fri, 17 Jun 2011 12:52:48 +0000"  >&lt;p&gt;Hi Peter - We are running Lustre 1.8.5 on the filesystem I opened the bug about.  We built Lustre against OFED 1.5.2 and the 2.6.18-194.17.1.el5 kernel.  It has a few patches applied from SCST 2.0.0.1.&lt;/p&gt;

&lt;p&gt;We enabled quotas on a Lustre 2.0-based filesystem we have, as well - it is logging the same messages.  The base kernel version used for this filesystem is 2.6.18-164.11.1.el5.  These servers don&apos;t have any SCST patches above.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Craig&lt;/p&gt;</comment>
                            <comment id="16573" author="pjones" created="Mon, 20 Jun 2011 02:22:46 +0000"  >&lt;p&gt;This does sound strange and I have not heard of any problems of this nature elsewhere. Niu, do you have something to suggest here?&lt;/p&gt;</comment>
                            <comment id="16629" author="niu" created="Mon, 20 Jun 2011 10:15:29 +0000"  >&lt;p&gt;Hi, Craig&lt;/p&gt;

&lt;p&gt;quota_chk_acq_common() only print the warning message on (retry_count % 10 == 0), that&apos;s why you always see it on 10th retry. and &quot;rc/err == 0&quot; doesn&apos;t indicate a successful quota acquire. I think the acquire() in quota_chk_acq_common() must return zero for some reason (a successful acquire() should return 1). &lt;/p&gt;

&lt;p&gt;When did the messages show up? On server start or some special operations (quota check for instance)? Can &apos;lfs quota&apos; work properly on your system? I suspect the quota on server isn&apos;t initilized correctly.&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Niu&lt;/p&gt;</comment>
                            <comment id="16633" author="prescott@hpc.ufl.edu" created="Mon, 20 Jun 2011 10:56:57 +0000"  >&lt;p&gt;Thanks, Niu.  Sorry, I assumed a return value of zero from acquire() meant success.&lt;/p&gt;

&lt;p&gt;We set block hard and soft limits for each user.  We don&apos;t have any inode-based quotas.  &apos;lfs quota&apos; works properly, and the block-hardlimit quota works.  When the soft limit is violated, we see an asterisk in the &apos;lfs quota&apos; output.  &lt;/p&gt;

&lt;p&gt;The quota_chk_acq_common() messages on the OSS nodes aren&apos;t triggered by an &apos;lfs quota&apos; command - they just happen periodically during the day.  There are no corresponding messages on the MDS.&lt;/p&gt;

&lt;p&gt;Is there a way to enable additional logging to see more information about the source of these messages?&lt;/p&gt;</comment>
                            <comment id="16648" author="niu" created="Mon, 20 Jun 2011 22:59:00 +0000"  >&lt;p&gt;Hi, Craig&lt;/p&gt;

&lt;p&gt;I think you can collect more information by following way:&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;enable D_TRACE &amp;amp; D_QUOTA on the OSS by &apos;echo +trace &amp;gt; /proc/sys/lnet/debug&apos; &amp;amp; &apos;echo +quota &amp;gt; /proc/sys/lnet/debug&apos;&lt;/li&gt;
	&lt;li&gt;clear current debug log buffer on the OSS by &apos;lctl clear&apos;&lt;/li&gt;
	&lt;li&gt;start debug daemon on the OSS by &apos;lctl debug_daemon start debuglog 100&apos;&lt;/li&gt;
	&lt;li&gt;trigger the messages by some operation, for instance, writing to a file to exceed the user or group quota limit (on client).&lt;/li&gt;
	&lt;li&gt;stop the debug daemon on the OSS by &apos;lctl debug_daemon stop&apos;&lt;/li&gt;
	&lt;li&gt;convert the binary debug log to ascii file by &apos;lctl debug_file debuglog debugfile&apos;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;hope we can find the source of these messages from the &apos;debugfile&apos;. &lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Niu&lt;/p&gt;</comment>
                            <comment id="17120" author="prescott@hpc.ufl.edu" created="Tue, 28 Jun 2011 15:59:02 +0000"  >&lt;p&gt;Our quota_chk_acq_common() messages seem to have stopped for the time being - none since Saturday.  &lt;/p&gt;

&lt;p&gt;Before they stopped, I did some tracing with the lustre debug_daemon, as you suggested.  Since we can&apos;t trigger the problem, I&apos;d have to wait for the quota_chk_acq_common() messages to happen &quot;naturally&quot;.  I turned on the debug_daemon to start tracing, and when I saw the quota_chk_acq_common() message in syslog, I stopped the debug_daemon and looked at the decoded debug file.  A 10GB debug file got anywhere from 15 seconds to a couple minutes of debug file output.&lt;/p&gt;

&lt;p&gt;Is there a timing issue between when something is logged in the syslog and when the corresponding debug info gets logged by the Lustre debug daemon?  Even though I&apos;d sit and watch the syslog, I never saw a trace message that corresponded to the syslog &quot;10 retries&quot; messages.  But I did see traces that had several retries.&lt;/p&gt;

&lt;p&gt;I could not really learn anything new by looking at the debug debug_daemon output, though.  Is there something you can suggest that I should look for?&lt;/p&gt;</comment>
                            <comment id="17155" author="niu" created="Wed, 29 Jun 2011 00:15:58 +0000"  >&lt;p&gt;Hi, Craig&lt;br/&gt;
What&apos;s the file size did you specified in the &apos;lctl debug_daemon&apos;? I think it might because there are two much messages with D_TRACE turned on, so the useful messages have been overwritten when start to process the log.&lt;/p&gt;

&lt;p&gt;Given that there isn&apos;t any way to trigger the messages, I sugguest you turn on the D_QUOTA only this time. Thank you.&lt;/p&gt;</comment>
                            <comment id="17160" author="prescott@hpc.ufl.edu" created="Wed, 29 Jun 2011 10:28:21 +0000"  >&lt;p&gt;I used 10000 (10GB) for the size - i.e., &apos;lctl debug_daemon start /var/tmp/debug.log 10000&apos;.  Using smaller sizes, I found that the amount of time that would be logged just wasn&apos;t very much (probably due to the D_TRACE producing so much output).  I&apos;ll turn off D_TRACE and wait for the messages to reappear.&lt;/p&gt;</comment>
                            <comment id="19071" author="prescott@hpc.ufl.edu" created="Thu, 11 Aug 2011 10:24:43 +0000"  >&lt;p&gt;I think the sync_acq_req times in the OST stats are probably the relevant timing stats for this issue - is that correct?  I&apos;ve been watching the average sync_acq_req times, and they have remained constant over the last 6 weeks or so - ~40ms on average, though one OST is averaging a 20ms sync_acq_req time.  Are these times reasonable?&lt;/p&gt;

&lt;p&gt;In any case, since we can&apos;t seem to trigger these messages, it doesn&apos;t happen (much) anymore, and does not seem to cause any problem anyway - I think we can just close this report.&lt;/p&gt;</comment>
                            <comment id="19220" author="niu" created="Sun, 14 Aug 2011 21:52:13 +0000"  >&lt;p&gt;Yes, the sync_acq_req is the stats for this issue, and the 40ms/20ms looks reasonable to me.&lt;/p&gt;</comment>
                            <comment id="19221" author="niu" created="Sun, 14 Aug 2011 21:53:57 +0000"  >&lt;p&gt;Close it as per customer&apos;s suggestion.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw0dr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10141</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>