<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:13:21 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1085] ASSERTION(cfs_atomic_read(&amp;exp-&gt;exp_refcount) == 0) failed</title>
                <link>https://jira.whamcloud.com/browse/LU-1085</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have multiple Lustre 2.1 OSS nodes crashing repeatedly during recovery.  This is on our classified Lustre cluster which was updated from 1.8 to 2.1 on Tuesday.  The summary is one observed symptom.  We have also seen these assertions appearing together.&lt;/p&gt;

&lt;p&gt;ASSERTION(cfs_list_empty(&amp;amp;imp-&amp;gt;imp_zombie_chain)) failed&lt;br/&gt;
ASSERTION(cfs_atomic_read(&amp;amp;exp-&amp;gt;exp_refcount) == 0) failed&lt;/p&gt;

&lt;p&gt;We don&apos;t have backtraces for the assertions because STONITH kicked in before the crash dump completed.&lt;/p&gt;

&lt;p&gt;Other OSS nodes are crashing in kernel string handling functions with stacks like&lt;/p&gt;

&lt;p&gt;machine_kexec&lt;br/&gt;
crash_kexec&lt;br/&gt;
oops_end&lt;br/&gt;
die&lt;br/&gt;
do_general_protection&lt;br/&gt;
general_protection&lt;br/&gt;
(exception RIP: strlen+9)&lt;br/&gt;
strlen&lt;br/&gt;
string&lt;br/&gt;
vsnprintf&lt;br/&gt;
libcfs_debug_vmsg2&lt;br/&gt;
_debug_req&lt;br/&gt;
target_send_replay_msg&lt;br/&gt;
target_send_reply&lt;br/&gt;
ost_handle&lt;br/&gt;
ptlrpc_main&lt;/p&gt;

&lt;p&gt;So it appears we are passing a bad value in a debug message.  &lt;/p&gt;

&lt;p&gt;Another stack trace:&lt;/p&gt;

&lt;p&gt;BUG: unable to handle kernel NULL ptr dereference at 000...38&lt;br/&gt;
IP: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;fffffffa0a8706&amp;gt;&amp;#93;&lt;/span&gt; filter_export_stats_init+0x1f1/0x500 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdfilter&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;machine_kexec&lt;br/&gt;
crash_kexec&lt;br/&gt;
oops_end&lt;br/&gt;
no_context&lt;br/&gt;
__bad_area_semaphore&lt;br/&gt;
bad_area_semaphore&lt;br/&gt;
__do_page_fault&lt;br/&gt;
do_page_fault&lt;br/&gt;
page_fault&lt;br/&gt;
filter_reconnect&lt;br/&gt;
target_handle_connect&lt;br/&gt;
ost_handle&lt;br/&gt;
ptlrpc_main&lt;/p&gt;

&lt;p&gt;We have multiple symptoms here that may or not be due to the same bug.  We may need to open a separate issue to track the root cause. Note that our branch contains &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt; patches that touched the ptlrpc queue management code, so we should be on the lookout for any races introduced there.  Also note we can&apos;t send debug data from this system.&lt;/p&gt;</description>
                <environment>RHEL 6.2&lt;br/&gt;
Our branch: &lt;a href=&quot;https://github.com/chaos/lustre/commits/2.1.0-llnl&quot;&gt;https://github.com/chaos/lustre/commits/2.1.0-llnl&lt;/a&gt;</environment>
        <key id="13144">LU-1085</key>
            <summary>ASSERTION(cfs_atomic_read(&amp;exp-&gt;exp_refcount) == 0) failed</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="nedbass">Ned Bass</reporter>
                        <labels>
                            <label>paj</label>
                    </labels>
                <created>Thu, 9 Feb 2012 15:43:30 +0000</created>
                <updated>Mon, 30 Apr 2012 11:56:20 +0000</updated>
                            <resolved>Mon, 30 Apr 2012 11:56:20 +0000</resolved>
                                    <version>Lustre 2.1.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="28297" author="green" created="Thu, 9 Feb 2012 16:01:41 +0000"  >&lt;p&gt;Hm, I think we had some fixes in the zombie thread area.&lt;/p&gt;

&lt;p&gt;Can you please see if this helps you by any chance: &lt;a href=&quot;http://review.whamcloud.com/#change,1896,patchset=2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,1896,patchset=2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;the export stats init I don&apos;t think I have seen before, I think most of that stuff was nailed in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-106&quot; title=&quot;unable to handle kernel paging request in lprocfs_stats_collect()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-106&quot;&gt;&lt;del&gt;LU-106&lt;/del&gt;&lt;/a&gt;, see patches here: &lt;a href=&quot;http://review.whamcloud.com/#change,326&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,326&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="28307" author="nedbass" created="Thu, 9 Feb 2012 17:10:36 +0000"  >&lt;p&gt;Hi Oleg,&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Can you please see if this helps you by any chance: &lt;a href=&quot;http://review.whamcloud.com/#change,1896,patchset=2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,1896,patchset=2&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Both nodes that ASSERTed had been up for a few days, so the mount thread had long since completed.  Unless there could be some latent corruption from that race, it seems like that patch won&apos;t help.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;I think most of that stuff was nailed in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-106&quot; title=&quot;unable to handle kernel paging request in lprocfs_stats_collect()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-106&quot;&gt;&lt;del&gt;LU-106&lt;/del&gt;&lt;/a&gt;, see patches here: &lt;a href=&quot;http://review.whamcloud.com/#change,326&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,326&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Ah, we don&apos;t have the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-106&quot; title=&quot;unable to handle kernel paging request in lprocfs_stats_collect()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-106&quot;&gt;&lt;del&gt;LU-106&lt;/del&gt;&lt;/a&gt; patch in our branch.  We&apos;ll pull it in.&lt;/p&gt;</comment>
                            <comment id="28320" author="nedbass" created="Thu, 9 Feb 2012 20:07:22 +0000"  >&lt;p&gt;For the case where we got into trouble in _debug_req(), I did some&lt;br/&gt;
digging in crash to see what state the ptlrpc_reqeust was in.  I dug up&lt;br/&gt;
the pointer address from the backtrace (let&apos;s call it &amp;lt;addr1&amp;gt; to save&lt;br/&gt;
typing).  Then resolving some of the strings that get passed to&lt;br/&gt;
libcfs_debug_vmsg2() from _debug_req(), I see:&lt;/p&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; struct ptlrpc_request.rq_import &amp;lt;addr&amp;gt;
 rp_import = 0x0 
crash&amp;gt; struct ptlrpc_request.rq_export &amp;lt;addr&amp;gt;
 rp_export = &amp;lt;addr2&amp;gt;
crash&amp;gt; struct obd_export.exp_connection &amp;lt;addr2&amp;gt;
 exp_connection = 0x5a5a5a5a5a5a5a5a
crash&amp;gt; struct obd_export.exp_client_uuid &amp;lt;addr2&amp;gt;
 exp_client_uuid = { 
        uuid = &quot;ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ&quot;
 }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So the presence of poison value and bogus uuid suggests&lt;br/&gt;
this export has already been destroyed.&lt;/p&gt;

&lt;p&gt;For reference, here a snippet from from _debug_req()&lt;br/&gt;
that uses these values:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2271 void _debug_req(struct ptlrpc_request *req,
2272                 struct libcfs_debug_msg_data *msgdata,
2273                 const char *fmt, ... )
2274 {       
2275         va_list args;
2276         va_start(args, fmt);
2277         libcfs_debug_vmsg2(msgdata, fmt, args,
2278                            &quot; req@%p x&quot;LPU64&quot;/t&quot;LPD64&quot;(&quot;LPD64&quot;) o%d-&amp;gt;%s@%s:%d/%d&quot;
2279                            &quot; lens %d/%d e %d to %d dl &quot;CFS_TIME_T&quot; ref %d &quot;
2280                            &quot;fl &quot;REQ_FLAGS_FMT&quot;/%x/%x rc %d/%d\n&quot;,
2281                            req, req-&amp;gt;rq_xid, req-&amp;gt;rq_transno,
2282                            req-&amp;gt;rq_reqmsg ? lustre_msg_get_transno(req-&amp;gt;rq_reqmsg) : 0,
2283                            req-&amp;gt;rq_reqmsg &amp;amp;&amp;amp; req_ptlrpc_body_swabbed(req) ?
2284                            lustre_msg_get_opc(req-&amp;gt;rq_reqmsg) : -1, 
2285                            req-&amp;gt;rq_import ? obd2cli_tgt(req-&amp;gt;rq_import-&amp;gt;imp_obd) :
2286                            req-&amp;gt;rq_export ?
2287                            (char*)req-&amp;gt;rq_export-&amp;gt;exp_client_uuid.uuid : &quot;&amp;lt;?&amp;gt;&quot;,
2288                            req-&amp;gt;rq_import ?
2289                            (char *)req-&amp;gt;rq_import-&amp;gt;imp_connection-&amp;gt;c_remote_uuid.uuid :
2290                            req-&amp;gt;rq_export ?
2291                            (char *)req-&amp;gt;rq_export-&amp;gt;exp_connection-&amp;gt;c_remote_uuid.uuid : &quot;&amp;lt;?&amp;gt;&quot;,
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="28387" author="nedbass" created="Fri, 10 Feb 2012 17:09:03 +0000"  >&lt;p&gt;I found a third incident of a panicked obd_zombid thread from yesterday.  No ASSERT message was captured in the logs for this one, but we did get a complete crash dump.  Here is the backtrace&lt;/p&gt;

&lt;p&gt;machine_kexec&lt;br/&gt;
crash_kexec&lt;br/&gt;
panic&lt;br/&gt;
lbug_with_loc&lt;br/&gt;
obd_zombie_impexp_cull&lt;br/&gt;
obd_zombie_impexp_thread&lt;br/&gt;
kernel_thread&lt;/p&gt;</comment>
                            <comment id="28389" author="nedbass" created="Fri, 10 Feb 2012 17:25:24 +0000"  >&lt;p&gt;I am going to open separate issues for the different crashes we ran into yesterday.   We can use this issue to track the obd_zombid crash.  &lt;/p&gt;

&lt;p&gt;We disabled the OSS read and writethrough caches and have not had any crashes since then. Nearly every crash was preceded by hundreds of client reconnect attempts and hundreds of log messages of the form:&lt;/p&gt;

&lt;p&gt;LustreError: 14210:0:(genops.c:1270:class_disconnect_stale_exports()) ls5-OST0349: disconnect stale client &lt;span class=&quot;error&quot;&gt;&amp;#91;UUID&amp;#93;&lt;/span&gt;@&amp;lt;unknown&amp;gt;&lt;/p&gt;</comment>
                            <comment id="28391" author="nedbass" created="Fri, 10 Feb 2012 18:05:23 +0000"  >&lt;p&gt;Opened the following new issues:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1092&quot; title=&quot;NULL pointer dereference in filter_export_stats_init()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1092&quot;&gt;&lt;del&gt;LU-1092&lt;/del&gt;&lt;/a&gt; NULL pointer dereference in filter_export_stats_init()&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1093&quot; title=&quot;unable to handle kernel paging request in target_handle_connect()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1093&quot;&gt;&lt;del&gt;LU-1093&lt;/del&gt;&lt;/a&gt; unable to handle kernel paging request in target_handle_connect()&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1094&quot; title=&quot;general protection fault in _debug_req()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1094&quot;&gt;&lt;del&gt;LU-1094&lt;/del&gt;&lt;/a&gt; general protection fault in _debug_req()&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="28396" author="green" created="Fri, 10 Feb 2012 18:59:13 +0000"  >&lt;p&gt;for the zombid backtrace, what was the line number from where the lbug_with_loc was called? I know you did not get LASSERT in the logs, but you can use gdb obdclass.ko and then run &quot;l *(obd_zombie_impexp_cull+0x...)&quot; (the +0x... value from backtrace) and it&apos;ll tell you the line number.&lt;br/&gt;
Hm, I just noticed that neither obd_zombie_impexp_cull nor obd_zombie_impexp_thread have any lbugs nor lasserts.&lt;/p&gt;</comment>
                            <comment id="28400" author="nedbass" created="Fri, 10 Feb 2012 19:20:12 +0000"  >&lt;p&gt;It was lustre/obdclass/genops.c:728:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 711 /* Export management functions */
 712 static void class_export_destroy(struct obd_export *exp)
 713 {
 714         struct obd_device *obd = exp-&amp;gt;exp_obd;
 715         ENTRY;
 716 
 717         LASSERT_ATOMIC_ZERO(&amp;amp;exp-&amp;gt;exp_refcount);
 718 
 719         CDEBUG(D_IOCTL, &quot;destroying export %p/%s for %s\n&quot;, exp,
 720                exp-&amp;gt;exp_client_uuid.uuid, obd-&amp;gt;obd_name);
 721 
 722         LASSERT(obd != NULL);
 723 
 724         /* &quot;Local&quot; exports (lctl, LOV-&amp;gt;{mdc,osc}) have no connection. */
 725         if (exp-&amp;gt;exp_connection)
 726                 ptlrpc_put_connection_superhack(exp-&amp;gt;exp_connection);
 727 
 728         LASSERT(cfs_list_empty(&amp;amp;exp-&amp;gt;exp_outstanding_replies));
 729         LASSERT(cfs_list_empty(&amp;amp;exp-&amp;gt;exp_uncommitted_replies));
 730         LASSERT(cfs_list_empty(&amp;amp;exp-&amp;gt;exp_req_replay_queue));
 731         LASSERT(cfs_list_empty(&amp;amp;exp-&amp;gt;exp_hp_rpcs));
 732         obd_destroy_export(exp);
 733         class_decref(obd, &quot;export&quot;, exp);
 734 
 735         OBD_FREE_RCU(exp, sizeof(*exp), &amp;amp;exp-&amp;gt;exp_handle);
 736         EXIT;
 737 }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="28418" author="green" created="Fri, 10 Feb 2012 21:56:48 +0000"  >&lt;p&gt;Also I guess I must ask how was the original recovery triggered? Some unrelated crash or was this a recovery test of some sort? (purely of purposes of seeing if there is some another bug at play here).&lt;/p&gt;</comment>
                            <comment id="28503" author="nedbass" created="Mon, 13 Feb 2012 13:17:00 +0000"  >&lt;blockquote&gt;&lt;p&gt;Also I guess I must ask how was the original recovery triggered? Some unrelated crash or was this a recovery test of some sort? (purely of purposes of seeing if there is some another bug at play here).&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;This was on a production system, not a recovery test.&lt;/p&gt;

&lt;p&gt;In the case of &lt;a href=&quot;http://jira.whamcloud.com/browse/LU-1085?focusedCommentId=28387&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-28387&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;this stack trace&lt;/a&gt; the server had crashed about twenty minutes earlier with ASSERTION(cfs_atomic_read(&amp;amp;exp-&amp;gt;exp_refcount) == 0).&lt;/p&gt;

&lt;p&gt;We are still trying to understand what led to that crash, and the others in LU-&lt;span class=&quot;error&quot;&gt;&amp;#91;1092,1093,1094&amp;#93;&lt;/span&gt;, but the factors at play seem to be high server load, slow server response, and many clients dropping their connection to the server and reconnecting.  There was a four hour window in which we had 16 crashes involving 5 OSS nodes.  As mentioned above, things seemed to stabilize after we disabled the OSS read and writethrough caches.&lt;/p&gt;</comment>
                            <comment id="34478" author="morrone" created="Tue, 10 Apr 2012 18:23:58 +0000"  >&lt;p&gt;Apparently we hit this over 20 times on various OSS of multiple filesystems over the weekend.  Note that last week we re-enabled OSS read cache (write cache is still disabled).&lt;/p&gt;

&lt;p&gt;The admins tell me that this often hit while in (or perhaps shortly after) recovery.  I&apos;ll need to look into why we were in recovery in the first place.&lt;/p&gt;</comment>
                            <comment id="34482" author="green" created="Tue, 10 Apr 2012 19:41:15 +0000"  >&lt;p&gt;I wonder if that&apos;s related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1166&quot; title=&quot;recovery never finished&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1166&quot;&gt;&lt;del&gt;LU-1166&lt;/del&gt;&lt;/a&gt; to some degree.&lt;/p&gt;

&lt;p&gt;Unfortunately the current patch is not fully complete, but does not make things worse.&lt;/p&gt;

&lt;p&gt;Any other messages before that crash that you can share with us?&lt;/p&gt;</comment>
                            <comment id="34493" author="morrone" created="Tue, 10 Apr 2012 22:50:35 +0000"  >&lt;p&gt;Ignore the comment about being in recovery.  So far I don&apos;t think the logs show that.&lt;/p&gt;

&lt;p&gt;I&apos;ve looked at a few nodes, and it looks like there is some kind of client timeout/eviction and then reconnection storm going on before the assertions.&lt;/p&gt;

&lt;p&gt;It looks like generally there are tens of thousands of &quot;haven&apos;t heard from client X in 226 seconds. I think it&apos;s dead, and I am evicting it&quot; messages.  A couple of minutes clients begin reconnecting in droves.  There is a mix of ost &quot;connection from&quot; and ost &quot;Not available for connect&quot; messages.&lt;/p&gt;

&lt;p&gt;The &quot;haven&apos;t heard from client&quot; and the client connect messages are both interleaved in the logs, and often repeated 30,000+ times (lustre squashes them into &quot;previous similar messages&quot; lines).&lt;/p&gt;

&lt;p&gt;And then we hit one of the two assertions listed at the beginning of this bug.&lt;/p&gt;

&lt;p&gt;Note that for several of the OSS nodes that I have looked at so far, the clients all seem to be from one particular cluster, which is running 1.8.  (servers are all 2.1.0-24chaos).&lt;/p&gt;</comment>
                            <comment id="34986" author="green" created="Tue, 17 Apr 2012 22:18:04 +0000"  >&lt;p&gt;So with many more crashes, any luck in getting the backtraces for the first two assertions?&lt;br/&gt;
Any succesful crashdumps?&lt;/p&gt;

&lt;p&gt;There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put?&lt;/p&gt;

&lt;p&gt;As for the second assertion, there is no such assertion in the code? I checked out your tree and here&apos;s what I see:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Oleg-Drokins-MacBook-Pro-2:lustre green$ grep -r exp_refcount * | grep ASSERT
lustre/obdclass/genops.c:        LASSERT_ATOMIC_ZERO(&amp;amp;exp-&amp;gt;exp_refcount);
lustre/obdclass/genops.c:        LASSERT_ATOMIC_GT_LT(&amp;amp;exp-&amp;gt;exp_refcount, 0, 0x5a5a5a);
lustre/obdecho/echo_client.c:        LASSERT(cfs_atomic_read(&amp;amp;ec-&amp;gt;ec_exp-&amp;gt;exp_refcount) &amp;gt; 0);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The first one is actually defined into LASSERTF so it does not exacty match your output I think?&lt;br/&gt;
Can you elaborate a bit more on where that might have come from?&lt;/p&gt;</comment>
                            <comment id="35051" author="nedbass" created="Wed, 18 Apr 2012 17:00:35 +0000"  >&lt;blockquote&gt;&lt;p&gt;So with many more crashes, any luck in getting the backtraces for the first two assertions?  Any succesful crashdumps?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I will review the logs and crash dumps and let you know.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;We hit the one in class_import_put.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;As for the second assertion, there is no such assertion in the code? I checked out your tree and here&apos;s what I see:&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Here&apos;s the second one in our tree:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/chaos/lustre/blob/2.1.1-3chaos/lustre/obdclass/genops.c#L717&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/chaos/lustre/blob/2.1.1-3chaos/lustre/obdclass/genops.c#L717&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="35058" author="green" created="Wed, 18 Apr 2012 18:48:18 +0000"  >&lt;p&gt;the LASSERT_ATOMIC_ZERO is defined as LASSERTF internally:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#define LASSERT_ATOMIC_EQ(a, v)                                 \
&lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; {                                                            \
        LASSERTF(cfs_atomic_read(a) == v,                       \
                 &lt;span class=&quot;code-quote&quot;&gt;&quot;value: %d\n&quot;&lt;/span&gt;, cfs_atomic_read((a)));          \
} &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; (0)
#define LASSERT_ATOMIC_ZERO(a)                  LASSERT_ATOMIC_EQ(a, 0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What was the value reported?&lt;/p&gt;</comment>
                            <comment id="35059" author="nedbass" created="Wed, 18 Apr 2012 18:52:48 +0000"  >&lt;p&gt;Value reported has been either 1, 2, or 3.&lt;/p&gt;</comment>
                            <comment id="35063" author="nedbass" created="Wed, 18 Apr 2012 20:01:53 +0000"  >&lt;p&gt;Here&apos;s a log message and backtrace for the exp_refcount assertion.&lt;/p&gt;

&lt;p&gt;LustreError: 24253:0:(genops.c:717:class_export_destory()) ASSERTION(cfs_atomic_read(&amp;amp;exp-&amp;gt;exp_refounct) == 0) failed: value: 1&lt;/p&gt;

&lt;p&gt;COMMAND: &quot;obd_zombid&quot;&lt;br/&gt;
#0 machine_kexec&lt;br/&gt;
#1 crash_kexec&lt;br/&gt;
#2 panic&lt;br/&gt;
#3 lbug_with_loc&lt;br/&gt;
#4 obd_zombie_impexp_cull&lt;br/&gt;
#5 obd_zombie_impexp_thread&lt;br/&gt;
#6 kernel_thread&lt;/p&gt;</comment>
                            <comment id="35064" author="nedbass" created="Wed, 18 Apr 2012 20:04:43 +0000"  >&lt;p&gt;Here&apos;s a log message and backtrace for the cfs_list_empty assertion.&lt;/p&gt;

&lt;p&gt;LustreError: 24458:0:(genops.c:931:class_import_put()) ASSERTINO(cfs_list_empty(&amp;amp;imp-&amp;gt;imp_zombie_chain)) failed&lt;/p&gt;

&lt;p&gt;COMMAND: &quot;ll_ost_54&quot;&lt;br/&gt;
#0 machine_kexec&lt;br/&gt;
#1 crash_kexec&lt;br/&gt;
#2 panic&lt;br/&gt;
#3 lbug_with_loc&lt;br/&gt;
#4 libcfs_assertion_failed&lt;br/&gt;
#5 class_import_put&lt;br/&gt;
#6 client_destroy_import&lt;br/&gt;
#7 target_handle_connect&lt;br/&gt;
#8 ost_handle&lt;br/&gt;
#9 ptlrpc_main&lt;br/&gt;
#10 kernel_thread&lt;/p&gt;</comment>
                            <comment id="35092" author="nedbass" created="Thu, 19 Apr 2012 11:41:13 +0000"  >&lt;p&gt;Oleg,&lt;/p&gt;

&lt;p&gt;Do you think the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1092&quot; title=&quot;NULL pointer dereference in filter_export_stats_init()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1092&quot;&gt;&lt;del&gt;LU-1092&lt;/del&gt;&lt;/a&gt; patch will help with these assertions?  Mikhail made a comment to that effect in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1336&quot; title=&quot;OSS GPF at ptlrpc_send_reply+0x470&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1336&quot;&gt;&lt;del&gt;LU-1336&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=893cf2014a38c5bd94890d3522fafe55f024a958&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=893cf2014a38c5bd94890d3522fafe55f024a958&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="35095" author="green" created="Thu, 19 Apr 2012 11:51:08 +0000"  >&lt;p&gt;Yes, this looks related.&lt;br/&gt;
Any chance you can try it?&lt;/p&gt;</comment>
                            <comment id="35099" author="nedbass" created="Thu, 19 Apr 2012 11:57:33 +0000"  >&lt;p&gt;Yes, we&apos;ll pull the patch in to our tree, and it will eventually get rolled out to our production systems.&lt;/p&gt;</comment>
                            <comment id="35881" author="pjones" created="Mon, 30 Apr 2012 11:56:20 +0000"  >&lt;p&gt;Believed to be a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1092&quot; title=&quot;NULL pointer dereference in filter_export_stats_init()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1092&quot;&gt;&lt;del&gt;LU-1092&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvhe7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6467</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>