<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:18:00 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1592] ASSERTION(cfs_atomic_read(&amp;imp-&gt;imp_refcount) == 0) failed: value: -1</title>
                <link>https://jira.whamcloud.com/browse/LU-1592</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We had three mds/mgs crashed within an hour.&lt;br/&gt;
The service110 crashed in class_import_destroy() while service150 and service170 crashed in class_import_put(). The class_import_destroy() failed on an ASSERT because the refcount was -1. The other two cases look like ORI-710.&lt;/p&gt;

&lt;p&gt;Unfortunately, we were not able to get a vmcore. Kdump crashed.&lt;/p&gt;


&lt;p&gt;Service110:&lt;br/&gt;
Lustre: nbp4-MDT0000: Export ffff8806560aa400 already connecting from 10.151.5.8@o2ib^M&lt;br/&gt;
Lustre: nbp4-MDT0000: denying duplicate export for 81811d25-ee59-5ea0-fbaf-31ee49f5aeb7, -114^M&lt;br/&gt;
Lustre: Skipped 1 previous similar message^M&lt;br/&gt;
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.46.203@o2ib rejected: o2iblnd fatal error^M&lt;br/&gt;
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 7 previous similar messages^M&lt;br/&gt;
LustreError: 4621:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M&lt;br/&gt;
Lustre: nbp4-MDT0000: denying duplicate export for 000566b2-2e24-9e9d-b38c-24016bb34ecd, -114^M&lt;br/&gt;
Lustre: nbp3-MDT0000: Export ffff880756260c00 already connecting from 10.151.4.136@o2ib^M&lt;br/&gt;
Lustre: Skipped 3 previous similar messages^M&lt;br/&gt;
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.17.141@o2ib rejected: o2iblnd fatal error^M&lt;br/&gt;
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 6 previous similar messages^M&lt;br/&gt;
LustreError: 4650:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M&lt;br/&gt;
LustreError: 3541:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&amp;amp;imp-&amp;gt;imp_refcount) == 0) failed: value: -1^M&lt;br/&gt;
Lustre: nbp4-MDT0000: denying duplicate export for b7f2dcde-c1be-3701-2f14-fec0e7c1b513, -114^M&lt;br/&gt;
LustreError: 3541:0:(genops.c:930:class_import_destroy()) LBUG^M&lt;br/&gt;
Pid: 3541, comm: obd_zombid^M&lt;/p&gt;

&lt;p&gt;We were not able to product a vmcore. Kdump crashed.&lt;/p&gt;

&lt;p&gt;Service150 (and also service170):&lt;br/&gt;
(The crash on both MDS looks like ORI-710)&lt;br/&gt;
Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1404442415660130 sent from nbp5-OST002c-osc-MDT0000 to NID 10.151.25.241@o2ib has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1341262190&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;real_sent 0&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;current 1341262295&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;deadline 105s&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;delay 0s&amp;#93;&lt;/span&gt;  req@ffff880549250800 x1404442415660130/t0(0) o-1-&amp;gt;nbp5-OST002c_UUID@10.151.25.241@o2ib:28/4 lens 368/512 e 0 to 1 dl 1341262295 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1^M&lt;br/&gt;
Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 599 previous similar messages^M&lt;br/&gt;
Lustre: nbp5-MDT0000: haven&apos;t heard from client b04675d7-083f-bc2b-0fa5-2863afc271db (at 10.151.41.133@o2ib) in 279 seconds. I think it&apos;s dead, and I am evicting it. exp ffff880bfaff3800, cur 1341262407 expire 1341262257 last 1341262128^M&lt;br/&gt;
Lustre: Skipped 43832 previous similar messages^M&lt;br/&gt;
LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) 10.151.13.180@o2ib rejected: o2iblnd fatal error^M&lt;br/&gt;
LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) Skipped 3013 previous similar messages^M&lt;br/&gt;
LustreError: 3694:0:(genops.c:934:class_import_put()) ASSERTION(__v &amp;gt; 0 &amp;amp;&amp;amp; __v &amp;lt; ((int)0x5a5a5a5a5a5a5a5a)) failed: value: 0^M&lt;br/&gt;
LustreError: 12396:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 &amp;gt; 6^M&lt;br/&gt;
LustreError: 12389:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 &amp;gt; 7^M&lt;br/&gt;
LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) MGS handle cmd=250 rc=-114^M&lt;br/&gt;
LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) Skipped 1 previous similar message^M&lt;br/&gt;
LustreError: 3694:0:(genops.c:934:class_import_put()) LBUG^M&lt;br/&gt;
Pid: 3694, comm: ll_mgs_01^M&lt;/p&gt;
</description>
                <environment>2.1.2 servers and clients.</environment>
        <key id="15113">LU-1592</key>
            <summary>ASSERTION(cfs_atomic_read(&amp;imp-&gt;imp_refcount) == 0) failed: value: -1</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="jaylan">Jay Lan</reporter>
                        <labels>
                    </labels>
                <created>Mon, 2 Jul 2012 18:02:30 +0000</created>
                <updated>Sat, 22 Dec 2012 10:37:52 +0000</updated>
                            <resolved>Fri, 31 Aug 2012 20:02:53 +0000</resolved>
                                    <version>Lustre 2.3.0</version>
                    <version>Lustre 2.1.2</version>
                                    <fixVersion>Lustre 2.3.0</fixVersion>
                    <fixVersion>Lustre 2.4.0</fixVersion>
                    <fixVersion>Lustre 2.1.4</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="41382" author="jaylan" created="Mon, 2 Jul 2012 18:23:41 +0000"  >&lt;p&gt;S110 runs 2.1.2 but s150 and s170 run 2.1.1.&lt;/p&gt;

&lt;p&gt;We actually had a vmcore on s150. Let me know what crash command output you want me to provide.&lt;/p&gt;</comment>
                            <comment id="41383" author="jaylan" created="Mon, 2 Jul 2012 18:27:49 +0000"  >&lt;p&gt;Here is the stack trace of the running process (on service150):&lt;/p&gt;

&lt;p&gt;PID: 3694   TASK: ffff880c1ecda0c0  CPU: 0   COMMAND: &quot;ll_mgs_01&quot;&lt;br/&gt;
 #0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e7836b8&amp;#93;&lt;/span&gt; machine_kexec at ffffffff8103204b&lt;br/&gt;
 #1 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783718&amp;#93;&lt;/span&gt; crash_kexec at ffffffff810b8472&lt;br/&gt;
 #2 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e7837e8&amp;#93;&lt;/span&gt; kdb_kdump_check at ffffffff8127a53f&lt;br/&gt;
 #3 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e7837f8&amp;#93;&lt;/span&gt; kdb_main_loop at ffffffff8127d757&lt;br/&gt;
 #4 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783908&amp;#93;&lt;/span&gt; kdb_save_running at ffffffff81277aae&lt;br/&gt;
 #5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783918&amp;#93;&lt;/span&gt; kdba_main_loop at ffffffff8144a538&lt;br/&gt;
 #6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783958&amp;#93;&lt;/span&gt; kdb at ffffffff8127aa57&lt;br/&gt;
 #7 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e7839c8&amp;#93;&lt;/span&gt; panic at ffffffff81520c97&lt;br/&gt;
 #8 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783a48&amp;#93;&lt;/span&gt; lbug_with_loc at ffffffffa05e3eeb&lt;br/&gt;
 #9 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783a98&amp;#93;&lt;/span&gt; class_import_put at ffffffffa06afb48&lt;br/&gt;
#10 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783ae8&amp;#93;&lt;/span&gt; client_destroy_import at ffffffffa079e92e&lt;br/&gt;
#11 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783b08&amp;#93;&lt;/span&gt; target_handle_connect at ffffffffa07a058e&lt;br/&gt;
#12 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783cc8&amp;#93;&lt;/span&gt; mgs_handle at ffffffffa0b94cbd&lt;br/&gt;
#13 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783d88&amp;#93;&lt;/span&gt; ptlrpc_main at ffffffffa07e842e&lt;br/&gt;
#14 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff880c1e783f48&amp;#93;&lt;/span&gt; kernel_thread at ffffffff8100c14a&lt;/p&gt;</comment>
                            <comment id="41384" author="jaylan" created="Mon, 2 Jul 2012 20:46:50 +0000"  >&lt;p&gt;Service170 crashed two more times. One is like the trace above in class_import_put(). The other is another variant.&lt;/p&gt;

&lt;p&gt;So far we have seen three different variants. But all triggered by &lt;br/&gt;
Lustre: MGS: denying duplicate export for 63509613-c870-1438-aa35-858e9606c728, -114&lt;br/&gt;
Lustre: MGS: Export ffff88079183c400 already connecting from 10.151.26.31@o2ib&lt;/p&gt;

&lt;p&gt;and eventually would crash.&lt;/p&gt;

&lt;p&gt;Here is the third variant:&lt;br/&gt;
Lustre: MGS: denying duplicate export for 63509613-c870-1438-aa35-858e9606c728, -114&lt;br/&gt;
Lustre: MGS: Export ffff88079183c400 already connecting from 10.151.26.31@o2ib&lt;br/&gt;
Lustre: Skipped 4 previous similar messages&lt;br/&gt;
LustreError: 4756:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114&lt;br/&gt;
LustreError: 4761:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114&lt;br/&gt;
Lustre: MGS: Client 63509613-c870-1438-aa35-858e9606c728 (at 10.151.26.31@o2ib) reconnecting&lt;br/&gt;
LustreError: 4490:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114&lt;br/&gt;
LustreError: 4761:0:(obd_class.h:501:obd_set_info_async()) obd_set_info_async: dev 0 no operation&lt;br/&gt;
LustreError: 4761:0:(obd_class.h:501:obd_set_info_async()) Skipped 5 previous similar messages&lt;br/&gt;
LustreError: 4761:0:(genops.c:1586:obd_zombie_import_add()) ASSERTION(imp-&amp;gt;imp_sec == NULL) failed&lt;br/&gt;
LustreError: 4761:0:(genops.c:1586:obd_zombie_import_add()) LBUG&lt;br/&gt;
Pid: 4761, comm: ll_mgs_07&lt;/p&gt;

&lt;p&gt;Call Trace:&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05e0855&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x55/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05e0e95&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x75/0xe0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Lustre: MGS: Export ffff88079183c400 already connecting from 10.151.26.31@o2ib&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa05ebda6&amp;gt;&amp;#93;&lt;/span&gt; libcfs_assertion_failed+0x66/0x70 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06b560b&amp;gt;&amp;#93;&lt;/span&gt; class_import_put+0x2cb/0x300 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07a6787&amp;gt;&amp;#93;&lt;/span&gt; target_handle_connect+0x10a7/0x3070 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0b9fdcd&amp;gt;&amp;#93;&lt;/span&gt; mgs_handle+0x3fd/0x19b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mgs&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06e9f0f&amp;gt;&amp;#93;&lt;/span&gt; ? keys_fill+0x6f/0x180 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07ddce4&amp;gt;&amp;#93;&lt;/span&gt; ? lustre_msg_get_opc+0x94/0x100 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07ee7be&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_main+0xb7e/0x18f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07edc40&amp;gt;&amp;#93;&lt;/span&gt; ? ptlrpc_main+0x0/0x18f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c14a&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x20&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07edc40&amp;gt;&amp;#93;&lt;/span&gt; ? ptlrpc_main+0x0/0x18f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07edc40&amp;gt;&amp;#93;&lt;/span&gt; ? ptlrpc_main+0x0/0x18f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c140&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0x0/0x20&lt;/p&gt;

&lt;p&gt;Kernel panic - not syncing: LBUG&lt;br/&gt;
Pid: 4761, comm: ll_mgs_07 Not tainted 2.6.32-220.4.1.el6.20120130.x86_64.lustre211 #1&lt;br/&gt;
...&lt;/p&gt;

&lt;p&gt;We had a vmcore of this third variant.&lt;/p&gt;


&lt;p&gt;The git source for service170 and service150 is at:&lt;br/&gt;
&lt;a href=&quot;https://github.com/jlan/lustre-nas/commits/nas-2.1.1&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/jlan/lustre-nas/commits/nas-2.1.1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Service170&apos;s tag 2.1.1-2nasS is on this commit (June 18, 2012)&lt;br/&gt;
&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-685&quot; title=&quot;Wide busy lock in kiblnd_pool_alloc_node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-685&quot;&gt;&lt;del&gt;LU-685&lt;/del&gt;&lt;/a&gt; obdclass: lu_object reclamation is inefficient&quot;&lt;/p&gt;

&lt;p&gt;The git source for service160 is at:&lt;br/&gt;
&lt;a href=&quot;https://github.com/jlan/lustre-nas/commits/nas-2.1.2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/jlan/lustre-nas/commits/nas-2.1.2&lt;/a&gt;&lt;br/&gt;
Service160&apos;s tag 2.1.2-1nasS is on this commit (June 19, 2012)&lt;br/&gt;
&quot;NAS: Provide meanful lustre version information to procfs&quot;&lt;/p&gt;</comment>
                            <comment id="41387" author="jay" created="Tue, 3 Jul 2012 00:02:22 +0000"  >&lt;p&gt;I took a look at this issue, from the backtrace, it seems one extra refcount of exp_imp_reverse was dropped and all of variants you have seen are related to this.&lt;/p&gt;

&lt;p&gt;I guess this issue is due to defect in reconnect handling, need to investigate more.&lt;/p&gt;</comment>
                            <comment id="41388" author="pjones" created="Tue, 3 Jul 2012 01:10:34 +0000"  >&lt;p&gt;Bobijam is looking it this one&lt;/p&gt;</comment>
                            <comment id="41397" author="bobijam" created="Tue, 3 Jul 2012 05:01:01 +0000"  >&lt;p&gt;this looks like obd cleanup and target_handle_connect race, like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1432&quot; title=&quot;Race condition between lprocfs_exp_setup() and lprocfs_free_per_client_stats() causes LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1432&quot;&gt;&lt;del&gt;LU-1432&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;http://review.whamcloud.com/#change,3244&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,3244&lt;/a&gt; should avoid this kind of race.&lt;/p&gt;</comment>
                            <comment id="41410" author="jaylan" created="Tue, 3 Jul 2012 13:28:34 +0000"  >&lt;p&gt;Of the five mgs crashes yesterday, 3 on S170, 1 on S110, and 1 on S150. Of them S110 runs 2.1.2 but S170 and S150 runs two slightly different versions of 2.1.1. Of the three filesystems, the nobackupp1 (ie S17*) is the most heavily used. That probably explain why it crashed 3 times yesterday.&lt;/p&gt;

&lt;p&gt;The other very heavily used filesystem is nobackupp2, which survived yesterday. The nobackupp2 runs 2.1.2 (same version as S110). We have a planned upgrade of nobackupp1 to 2.1.2 today and we will proceed as planned.&lt;/p&gt;

&lt;p&gt;I want to document that here so that we all know that S170 will upgrade to 2.1.2 today, not the same code that crashed 3 times yesterday. However, since S110 also crashed yesterday, i believe whatever the problem still exists in 2.1.2.&lt;/p&gt;</comment>
                            <comment id="41533" author="jaylan" created="Thu, 5 Jul 2012 17:49:00 +0000"  >&lt;p&gt;I cherry-picked the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1432&quot; title=&quot;Race condition between lprocfs_exp_setup() and lprocfs_free_per_client_stats() causes LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1432&quot;&gt;&lt;del&gt;LU-1432&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1428&quot; title=&quot;MDT servrice threads spinning in cfs_hash_for_each_relax()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1428&quot;&gt;&lt;del&gt;LU-1428&lt;/del&gt;&lt;/a&gt; and built a new set of images. The new images will take effect next time our Lustre servers LBUG&apos;ed or have to reboot for whatever reasons.&lt;/p&gt;</comment>
                            <comment id="41557" author="dmoreno" created="Fri, 6 Jul 2012 09:45:38 +0000"  >&lt;p&gt;At Bull we also hit this issue.&lt;/p&gt;

&lt;p&gt;We&apos;re going to wait for patch approval before installing it.&lt;/p&gt;</comment>
                            <comment id="42525" author="ian" created="Tue, 31 Jul 2012 19:04:22 +0000"  >&lt;p&gt;Also observed at LLNL on Orion&lt;/p&gt;</comment>
                            <comment id="43099" author="pjones" created="Mon, 13 Aug 2012 10:27:44 +0000"  >&lt;p&gt;Closing as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1432&quot; title=&quot;Race condition between lprocfs_exp_setup() and lprocfs_free_per_client_stats() causes LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1432&quot;&gt;&lt;del&gt;LU-1432&lt;/del&gt;&lt;/a&gt; which landed for 2.3 on July 19th. Please reopen if this issue is encountered with that code in place&lt;/p&gt;</comment>
                            <comment id="43127" author="jaylan" created="Mon, 13 Aug 2012 13:30:44 +0000"  >&lt;p&gt;Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1432&quot; title=&quot;Race condition between lprocfs_exp_setup() and lprocfs_free_per_client_stats() causes LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1432&quot;&gt;&lt;del&gt;LU-1432&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
&lt;a href=&quot;https://github.com/jlan/lustre-nas/commits/nas-2.1.2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/jlan/lustre-nas/commits/nas-2.1.2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The console showed the messages:&lt;/p&gt;

&lt;p&gt;LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&amp;amp;imp-&amp;gt;imp_refcount) == 0) failed: value: -1^M&lt;br/&gt;
LustreError: 3669:0:(genops.c:930:class_import_destroy()) LBUG^M&lt;br/&gt;
Pid: 3669, comm: obd_zombid^M&lt;br/&gt;
^M&lt;br/&gt;
Call Trace:^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0598855&amp;gt;&amp;#93;&lt;/span&gt; libcfs_debug_dumpstack+0x55/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 [  &amp;lt;7ff fofutf fofffa 80 59cp8ue9s5 &amp;gt;in]  lkdbbug,_ wwaiitht_ilngo cf+0oxr 7t5h/0ex re0e st&lt;span class=&quot;error&quot;&gt;&amp;#91;,l ibtcimfeso&amp;#93;&lt;/span&gt;^Mu&lt;br/&gt;
t in 10 second(s)^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0663a06&amp;gt;&amp;#93;&lt;/span&gt; class_import_destroy+0x3a6/0x3b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06678ba&amp;gt;&amp;#93;&lt;/span&gt; obd_zombie_impexp_cull+0xda/0x5a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810903ac&amp;gt;&amp;#93;&lt;/span&gt; ? remove_wait_queue+0x3c/0x50^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0667e85&amp;gt;&amp;#93;&lt;/span&gt; obd_zombie_impexp_thread+0x105/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8105fff0&amp;gt;&amp;#93;&lt;/span&gt; ? default_wake_function+0x0/0x20^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0667d80&amp;gt;&amp;#93;&lt;/span&gt; ? obd_zombie_impexp_thread+0x0/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c14a&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x20^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0667d80&amp;gt;&amp;#93;&lt;/span&gt; ? obd_zombie_impexp_thread+0x0/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0667d80&amp;gt;&amp;#93;&lt;/span&gt; ? obd_zombie_impexp_thread+0x0/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c140&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0x0/0x20^M&lt;br/&gt;
^M&lt;br/&gt;
Kernel panic - not syncing: LBUG^M&lt;br/&gt;
Pid: 3669, comm: obd_zombid Tainted: G          I----------------   2.6.32-220.4.1.el6.20120607.x86_64.lustre212 #1^M&lt;br/&gt;
Call Trace:^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81520c56&amp;gt;&amp;#93;&lt;/span&gt; ? panic+0x78/0x164^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0598eeb&amp;gt;&amp;#93;&lt;/span&gt; ? lbug_with_loc+0xcb/0xe0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0663a06&amp;gt;&amp;#93;&lt;/span&gt; ? class_import_destroy+0x3a6/0x3b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06678ba&amp;gt;&amp;#93;&lt;/span&gt; ? obd_zombie_impexp_cull+0xda/0x5a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810903ac&amp;gt;&amp;#93;&lt;/span&gt; ? remove_wait_queue+0x3c/0x50^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0667e85&amp;gt;&amp;#93;&lt;/span&gt; ? obd_zombie_impexp_thread+0x105/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8105fff0&amp;gt;&amp;#93;&lt;/span&gt; ? default_wake_function+0x0/0x20^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0667d80&amp;gt;&amp;#93;&lt;/span&gt; ? obd_zombie_impexp_thread+0x0/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c14a&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0xa/0x20^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0667d80&amp;gt;&amp;#93;&lt;/span&gt; ? obd_zombie_impexp_thread+0x0/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0667d80&amp;gt;&amp;#93;&lt;/span&gt; ? obd_zombie_impexp_thread+0x0/0x270 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;^M&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c140&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0x0/0x20^M&lt;/p&gt;
</comment>
                            <comment id="43155" author="bobijam" created="Mon, 13 Aug 2012 22:01:29 +0000"  >&lt;p&gt;from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0&lt;/p&gt;


&lt;p&gt;        LASSERT(cfs_list_empty(&amp;amp;imp-&amp;gt;imp_zombie_chain));&lt;br/&gt;
        LASSERT_ATOMIC_GT_LT(&amp;amp;imp-&amp;gt;imp_refcount, 0, LI_POISON);&lt;/p&gt;

&lt;p&gt;still investigating...&lt;/p&gt;</comment>
                            <comment id="43237" author="bobijam" created="Wed, 15 Aug 2012 02:18:25 +0000"  >&lt;p&gt;Jay Lan,&lt;/p&gt;

&lt;p&gt;Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy.&lt;/p&gt;</comment>
                            <comment id="43300" author="jaylan" created="Wed, 15 Aug 2012 19:24:19 +0000"  >&lt;p&gt;The output of &quot;bt -a&quot; command from crash.&lt;/p&gt;</comment>
                            <comment id="43315" author="bobijam" created="Thu, 16 Aug 2012 03:53:46 +0000"  >&lt;p&gt;patch tracking at &lt;a href=&quot;http://review.whamcloud.com/3684&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/3684&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;patch description&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LU-1592 ldlm: protect obd_export:exp_imp_reverse&apos;s change

* Protect obd_export::exp_imp_reverse from reconnect and destroy race.
* Add an assertion in class_import_put() to catch race in the first
  place.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="44069" author="pjones" created="Fri, 31 Aug 2012 20:02:53 +0000"  >&lt;p&gt;Landed for 2.3 and 2.4&lt;/p&gt;</comment>
                            <comment id="44157" author="jaylan" created="Tue, 4 Sep 2012 15:08:29 +0000"  >&lt;p&gt;Could you please land this patch to b2_1 branch? Thanks!&lt;/p&gt;</comment>
                            <comment id="44181" author="bobijam" created="Tue, 4 Sep 2012 21:13:44 +0000"  >&lt;p&gt;b2_1 patch port tracking at  &lt;a href=&quot;http://review.whamcloud.com/3869&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/3869&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11779" name="bt-a.out" size="14494" author="jaylan" created="Wed, 15 Aug 2012 19:24:19 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv5rj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4468</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>