<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:30:41 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3067] ASSERTION(!(aa-&gt;aa_oa-&gt;o_valid &amp; OBD_MD_FLHANDLE))</title>
                <link>https://jira.whamcloud.com/browse/LU-3067</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;One of our OSSs had problems writing to disk (due to a raid card problem).  &lt;/p&gt;

&lt;p&gt;Several clients have an LBUG and haven&apos;t recovered after OSS reboot.&lt;br/&gt;
The error is:&lt;/p&gt;

&lt;p&gt;Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp; OBD_MD_FLHANDLE)) failed&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) LBUG&lt;/p&gt;

&lt;p&gt;I attach the associated log file, and reproduce some lines of context in /var/log/messages&lt;/p&gt;


&lt;p&gt;Mar 29 05:57:03 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12&lt;br/&gt;
0@tcp.&lt;br/&gt;
Mar 29 05:57:03 cn492 kernel: Lustre: Skipped 1 previous similar message&lt;br/&gt;
Mar 29 06:09:39 cn492 kernel: Lustre: 3004:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1430341259304767 sent from lustre_0-OST002&lt;br/&gt;
7-osc-ffff81021c041800 to NID 10.1.4.120@tcp 756s ago has timed out (756s prior to deadline).&lt;br/&gt;
Mar 29 06:09:39 cn492 kernel:   req@ffff8101145e6800 x1430341259304767/t0 o3-&amp;gt;lustre_0-OST0027_UUID@10.1.4.120@tcp:6/4 lens 448/592 e 1 to 1 dl &lt;br/&gt;
1364537379 ref 2 fl Rpc:/2/0 rc 0/0&lt;br/&gt;
Mar 29 06:09:39 cn492 kernel: Lustre: 3004:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 1 previous similar message&lt;br/&gt;
Mar 29 06:09:39 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection to service lustre_0-OST0027 via nid 10.1.4.120@tcp was l&lt;br/&gt;
ost; in progress operations using this service will wait for recovery to complete.&lt;br/&gt;
Mar 29 06:09:39 cn492 kernel: Lustre: Skipped 1 previous similar message&lt;br/&gt;
Mar 29 06:09:39 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12&lt;br/&gt;
0@tcp.&lt;br/&gt;
Mar 29 06:09:39 cn492 kernel: Lustre: Skipped 1 previous similar message&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp; OBD_MD_FLHANDLE)) failed&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) LBUG&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel: Pid: 3004, comm: ptlrpcd&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel: &lt;br/&gt;
Mar 29 06:20:10 cn492 kernel: Call Trace:&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff885786a1&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x51/0x60 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff88578bda&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x7a/0xd0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff88580fc0&amp;gt;&amp;#93;&lt;/span&gt; tracefile_init+0x0/0x110 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8879c7e8&amp;gt;&amp;#93;&lt;/span&gt; brw_interpret+0x8e8/0xdb0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff886d36ac&amp;gt;&amp;#93;&lt;/span&gt; after_reply+0xcac/0xe30 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff886d4b0b&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_check_set+0x12db/0x15a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8004b396&amp;gt;&amp;#93;&lt;/span&gt; try_to_del_timer_sync+0x7f/0x88&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff887095ad&amp;gt;&amp;#93;&lt;/span&gt; ptlrpcd_check+0xdd/0x1f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8009a98c&amp;gt;&amp;#93;&lt;/span&gt; process_timeout+0x0/0x5&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff88709ef1&amp;gt;&amp;#93;&lt;/span&gt; ptlrpcd+0x1b1/0x259 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8008f3ad&amp;gt;&amp;#93;&lt;/span&gt; default_wake_function+0x0/0xe&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8005dfc1&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x11&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff88709d40&amp;gt;&amp;#93;&lt;/span&gt; ptlrpcd+0x0/0x259 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel:  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8005dfb7&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0x0/0x11&lt;br/&gt;
Mar 29 06:20:10 cn492 kernel: &lt;br/&gt;
Mar 29 06:20:10 cn492 kernel: LustreError: dumping log to /tmp/lustre-log.1364538010.3004&lt;/p&gt;</description>
                <environment>Scientific Linux [&lt;a href=&apos;mailto:walker@fe02&apos;&gt;walker@fe02&lt;/a&gt; ~]$ uname -r&lt;br/&gt;
2.6.18-348.3.1.el5&lt;br/&gt;
&lt;br/&gt;
Patchless client:&lt;br/&gt;
&lt;br/&gt;
lustre-client-modules-1.8.9-wc1_2.6.18_348.3.1.el5&lt;br/&gt;
lustre-client-1.8.9-wc1_2.6.18_348.3.1.el5&lt;br/&gt;
&lt;br/&gt;
Servers are all running:&lt;br/&gt;
[&lt;a href=&apos;mailto:root@sn20&apos;&gt;root@sn20&lt;/a&gt; ~]# rpm -qa | grep ^lustre&lt;br/&gt;
lustre-modules-1.8.9-wc1_2.6.18_348.1.1.el5_lustre&lt;br/&gt;
lustre-1.8.9-wc1_2.6.18_348.1.1.el5_lustre&lt;br/&gt;
lustre-ldiskfs-3.1.53-wc1_2.6.18_348.1.1.el5_lustre&lt;br/&gt;
&lt;br/&gt;
</environment>
        <key id="18162">LU-3067</key>
            <summary>ASSERTION(!(aa-&gt;aa_oa-&gt;o_valid &amp; OBD_MD_FLHANDLE))</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="walker">Christopher J. Walker</reporter>
                        <labels>
                            <label>mn8</label>
                    </labels>
                <created>Fri, 29 Mar 2013 11:29:21 +0000</created>
                <updated>Sat, 2 Jul 2016 15:09:38 +0000</updated>
                            <resolved>Sat, 2 Jul 2016 15:09:38 +0000</resolved>
                                    <version>Lustre 1.8.9</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>14</watches>
                                                                            <comments>
                            <comment id="55374" author="gshilamkar" created="Wed, 3 Apr 2013 11:03:20 +0000"  >&lt;p&gt;We have seen this problem with our customer and &lt;br/&gt;
while investigating it I stumbled upon the recent changes made to this code.&lt;br/&gt;
&lt;a href=&quot;http://jira.whamcloud.com/browse/LU-2703&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;http://jira.whamcloud.com/browse/LU-2703&lt;/a&gt;&lt;br/&gt;
This soft lockup issue was fixed in 1.8.9 by removing obd_cancel() (which cancels a ldlm lock held for brw) out of *_ap_completion. This was done as obd_cancel() is heavy operation and should not be called while holding a spinlock (cl_loi_list_lock). &lt;/p&gt;

&lt;p&gt;However this fix added the LASSERT,&lt;/p&gt;

&lt;p&gt;        list_for_each_entry_safe(oap, tmp, &amp;amp;aa-&amp;gt;aa_oaps, oap_rpc_item) {&lt;br/&gt;
            list_del_init(&amp;amp;oap-&amp;gt;oap_rpc_item);&lt;/p&gt;

&lt;p&gt;            aa-&amp;gt;aa_oa-&amp;gt;o_flags &amp;amp;= ~OBD_FL_HAVE_LOCK;&lt;br/&gt;
            osc_ap_completion(cli, aa-&amp;gt;aa_oa, oap, 1, rc);&lt;br/&gt;
            if (aa-&amp;gt;aa_oa-&amp;gt;o_flags &amp;amp; OBD_FL_HAVE_LOCK) &lt;/p&gt;
{
                LASSERT(!(aa-&amp;gt;aa_oa-&amp;gt;o_valid &amp;amp;
                      OBD_MD_FLHANDLE));
                LASSERT(index &amp;lt; aa-&amp;gt;aa_handle_count);
                aa-&amp;gt;aa_handle[index++] = aa-&amp;gt;aa_oa-&amp;gt;o_handle;
            }

&lt;p&gt;Shouldn&apos;t the OBD_MD_FLHANDLE be set as it indicates presence of valid lock handle ?&lt;/p&gt;
</comment>
                            <comment id="55602" author="kitwestneat" created="Fri, 5 Apr 2013 13:21:50 +0000"  >&lt;p&gt;FYI the customer is Sanger. &lt;/p&gt;</comment>
                            <comment id="55617" author="walker" created="Fri, 5 Apr 2013 16:55:08 +0000"  >&lt;p&gt;The problem also occurs when the the network connection to an OSS fails (eg today when a colleague managed to take out both of the resilient core switches). &lt;/p&gt;
</comment>
                            <comment id="55647" author="pjones" created="Fri, 5 Apr 2013 20:42:44 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Could you please comment?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="55711" author="hongchao.zhang" created="Mon, 8 Apr 2013 08:49:33 +0000"  >&lt;p&gt;Hi Girish,&lt;br/&gt;
Thanks for your investigation! the OBD_MD_FLHANDLE could be used in client side (OSC)!&lt;/p&gt;

&lt;p&gt;the patch is tracked at &lt;a href=&quot;http://review.whamcloud.com/#change,5971&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,5971&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="57445" author="kitwestneat" created="Wed, 1 May 2013 16:21:10 +0000"  >&lt;p&gt;Hi I was wondering what the status of this issue was. We are running into this bug regularly at NOAA as well. Thanks.&lt;/p&gt;</comment>
                            <comment id="64012" author="alex.ku" created="Fri, 9 Aug 2013 20:25:54 +0000"  >&lt;p&gt;We ran into the same issue on 1.8.9 at FNAL on client. Trace dump is the same. It happened after client communication error with OSS (the router had issues). Is it going to be fixed in 1.8.10 (or 1.8.9.1) ? Thanks, Alex.&lt;/p&gt;</comment>
                            <comment id="74626" author="prescott@hpc.ufl.edu" created="Thu, 9 Jan 2014 00:36:46 +0000"  >&lt;p&gt;We occasionally run into this issue.  I see the patch set has been rebased at &lt;a href=&quot;http://review.whamcloud.com/#change,5971&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,5971&lt;/a&gt; - is it only waiting approval at this point?&lt;/p&gt;</comment>
                            <comment id="74662" author="efocht" created="Thu, 9 Jan 2014 17:18:52 +0000"  >&lt;p&gt;We see this issue also at a customer where we deployed servers with Lustre 2.5.0 and his most stable setup seems to be with 1.8.9 clients. Up to this bug!&lt;/p&gt;</comment>
                            <comment id="78208" author="ferner" created="Mon, 3 Mar 2014 13:03:50 +0000"  >&lt;p&gt;Looks like we ran into this bug as well. As far as I can tell, we hit this on a (1.8) client after it had some network issues of unknown type... Is the patch ok to use as it is?&lt;/p&gt;</comment>
                            <comment id="78220" author="walker" created="Mon, 3 Mar 2014 14:52:36 +0000"  >&lt;p&gt;We are using it in production and haven&apos;t noticed problems.&lt;/p&gt;</comment>
                            <comment id="157585" author="pjones" created="Sat, 2 Jul 2016 15:09:38 +0000"  >&lt;p&gt;No plans for further 1.8.x releases&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="22660">LU-4452</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="12448" name="lustre-log.1364538010.3004.gz" size="192134" author="walker" created="Fri, 29 Mar 2013 11:29:21 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvmnz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>7466</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>