<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:16:41 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1444]  @@@ processing error (-16)</title>
                <link>https://jira.whamcloud.com/browse/LU-1444</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have a large cluster installation where the /home directories are Lustre mounts.  Occasionally (roughly once a week or so), we see the OSS lock up - this usually manifests itself as a hang doing &apos;df -h&apos; or the like.&lt;/p&gt;

&lt;p&gt;Before the crash, this is what we see on the OSS:&lt;/p&gt;

&lt;p&gt;    May 24 21:29:11 oss0 kernel: LustreError: 5460:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (&lt;del&gt;16)  req@ffff810311398c00 x1396618176669269/t0 o8&lt;/del&gt;&amp;gt;2bf3e0b9-b782-cd24-0e06-8141a03901ff@NET_0x500000a370002_UUID:0/0 lens 368/264 e 0 to 0 dl 1337909450 ref 1 fl Interpret:/0/0 rc -16/0&lt;br/&gt;
    May 24 21:29:11 oss0 kernel: Lustre: 5475:0:(service.c:1434:ptlrpc_server_handle_request()) @@@ Request x1396618176668701 took longer than estimated (756+139s); client may timeout.  req@ffff810322b3a000 x1396618176668701/t0 o101-&amp;gt;2bf3e0b9-b782-cd24-0e06-8141a03901ff@NET_0x500000a370002_UUID:0/0 lens 296/352 e 1 to 0 dl 1337909211 ref 1 fl Complete:/0/0 rc 0/0&lt;br/&gt;
    May 24 21:29:11 oss0 kernel: Lustre: lustre-OST0004: slow parent lock 892s due to heavy IO load&lt;br/&gt;
    May 24 21:29:11 oss0 kernel: Lustre: Skipped 2 previous similar messages&lt;br/&gt;
    May 24 21:29:11 oss0 kernel: Lustre: Service thread pid 5475 completed after 895.08s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).&lt;br/&gt;
    May 24 21:29:11 oss0 kernel: Lustre: lustre-OST0004: slow preprw_write setup 892s due to heavy IO load&lt;/p&gt;


&lt;p&gt;The -16 error seems to show up every time this happens.  The OSS itself is using an Adaptec 6805.  The arrays all show as optimal.&lt;/p&gt;

&lt;p&gt;The only fix we&apos;ve found thus far is to reboot the OSS, then mount the OSS volumes (we don&apos;t have it set to automount on boot). Then, wait about 10 minutes for Lustre to recover.   In the past, we&apos;ve been able to pinpoint the problematic OSS (we have 2) by looking for unusual load via &apos;uptime&apos;, and then confirming -16 in /var/log/messages. &lt;/p&gt;

&lt;p&gt;Any assistance would be extremely helpful.  Please let me know if you need any further information.  Thanks.  &lt;/p&gt;</description>
                <environment>CentOS 5.8; kernel 2.6.18-238.12.1.el5_lustre.g266a955</environment>
        <key id="14614">LU-1444</key>
            <summary> @@@ processing error (-16)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="acontois">adam contois</reporter>
                        <labels>
                    </labels>
                <created>Tue, 29 May 2012 19:02:59 +0000</created>
                <updated>Tue, 19 Jun 2012 02:03:42 +0000</updated>
                            <resolved>Tue, 19 Jun 2012 02:03:42 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="39561" author="pjones" created="Wed, 30 May 2012 01:57:37 +0000"  >&lt;p&gt;Niu&lt;/p&gt;

&lt;p&gt;Could you please comment on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="39568" author="niu" created="Wed, 30 May 2012 08:01:54 +0000"  >&lt;p&gt;Hi, adam&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Before the crash, this is what we see on the OSS&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Is the OSS locked up or crashed? Is there any other message dumped on crash?&lt;/p&gt;

&lt;p&gt;There are quite a few tickets for slow IO issue: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15&quot; title=&quot;strange slow IO messages and bad performance &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15&quot;&gt;&lt;del&gt;LU-15&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-410&quot; title=&quot;Performance concern with Shrink file_max_cache_size to alleviate the memory pressure of OST patch for LU-15 &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-410&quot;&gt;&lt;del&gt;LU-410&lt;/del&gt;&lt;/a&gt;. Disabling read only &amp;amp; write through cache on OSS could alleviate the situation, is the read only &amp;amp; write through cache on OSS is disabled? (you can veifty/change them in /proc/fs/$FSNAME/obdfilter/$OSTNAME/read_cache_enable and /proc/fs/$FSNAME/obdfilter/$OSTNAME/writethrough_cache_enable)&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="39687" author="acontois" created="Wed, 30 May 2012 20:59:17 +0000"  >&lt;p&gt;Hello Niu,&lt;/p&gt;

&lt;p&gt;I&apos;m checking on the what exactly happens on the OSS, so I&apos;ll get back to you on that.  &lt;/p&gt;

&lt;p&gt;read_cache_enable and writethrough_cache_enable both return true.&lt;/p&gt;</comment>
                            <comment id="39688" author="acontois" created="Wed, 30 May 2012 21:05:15 +0000"  >&lt;p&gt;Hi again Niu,&lt;/p&gt;

&lt;p&gt;When this has occurred, only Lustre becomes unavailable (filesystem commands hang) and the load on the OSS systems goes up.&lt;/p&gt;

&lt;p&gt;We did have one occurrence of the OSS being unreachable, but I&apos;m not sure if it crashed or was under high load and unresponsive - we had to reboot it quickly to get users up and running again.&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="39692" author="niu" created="Wed, 30 May 2012 22:00:12 +0000"  >&lt;p&gt;Thanks, Adam. Could you try to disable the read only cache &amp;amp; write through cache on OSS to see if it can still be reproduced? Is the quota enabled? (I want to know if it&apos;s &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-952&quot; title=&quot;Hung thread with HIGH OSS load&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-952&quot;&gt;&lt;del&gt;LU-952&lt;/del&gt;&lt;/a&gt;)&lt;/p&gt;</comment>
                            <comment id="39756" author="acontois" created="Thu, 31 May 2012 15:22:40 +0000"  >&lt;p&gt;Hi Niu,&lt;/p&gt;

&lt;p&gt;We would be happy to disable read and write cache - however, this is a production system with important data etc.  So, we need specific instructions.  Do we just echo 0 to those /proc files on the OSS&apos;es only?  Will this take effect immediately, or does it require a reboot?  Furthermore, what are the potential negative ramifications of doing this?  And (last question), can we try this on a single OSS, or would they all need to be changed?  Thanks! &lt;/p&gt;</comment>
                            <comment id="39785" author="niu" created="Fri, 1 Jun 2012 00:17:17 +0000"  >&lt;p&gt;You can either disable it by config log or write proc file directly, both take effect immediately, no reboot required.&lt;/p&gt;

&lt;p&gt;Run following command on MGS to disable them for certain OST permanently:&lt;/p&gt;

&lt;p&gt;lctl conf_param $OSTNAME.ost.read_cache_enable=0&lt;br/&gt;
lctl conf_param $OSTNAME.ost.writethrough_cache_enable=0&lt;/p&gt;

&lt;p&gt;You can try this only for the problematical OSS, and there should be no visible impact on the customer application.&lt;/p&gt;
</comment>
                            <comment id="40762" author="acontois" created="Mon, 18 Jun 2012 11:44:22 +0000"  >&lt;p&gt;Hi Niu,&lt;/p&gt;

&lt;p&gt;We have disabled the read and writethrough cache.  So far, we haven&apos;t had further issues.  You can close out the ticket, and I&apos;ll go ahead and reopen it if necessary.  Thanks for your help.&lt;/p&gt;</comment>
                            <comment id="40828" author="niu" created="Tue, 19 Jun 2012 02:03:42 +0000"  >&lt;p&gt;Thanks, adam.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvgx3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6390</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>