<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:18:37 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1663] MDS threads hang for over 725s, causing fail over</title>
                <link>https://jira.whamcloud.com/browse/LU-1663</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;At NOAA, there are two filesystems that were installed at the same time, lfs1 and lfs2. Recently lfs2 has been having MDS lockups, which cause a failover to the second MDS. It seems to run ok for a couple days and then whichever MDS is currently running will lockup and failover to the other one. lfs1, however, is not affected, though it runs an identical set up as far as hardware goes. &lt;/p&gt;

&lt;p&gt;We have the stack traces that get logged, but not the lustre-logs, as they have been on tmpfs. We&apos;ve changed the debug_file location, so hopefully we&apos;ll get the next batch. I&apos;ll put a sampling of the interesting call traces, and attach the rest.&lt;/p&gt;

&lt;p&gt;Here is the root cause of the failover. The health_check times out and prints NOT HEALTHY, which causes ha to failover:&lt;br/&gt;
Jul 17 17:23:30 lfs-mds-2-2 kernel: LustreError: 16021:0:(service.c:2124:ptlrpc_service_health_check()) mds: unhealthy - request has been waiting 725s&lt;/p&gt;

&lt;p&gt;This one makes it look like it might be quota related:&lt;br/&gt;
Jul 17 17:14:04 lfs-mds-2-2 kernel: Call Trace:&lt;br/&gt;
Jul 17 17:14:04 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff887f9220&amp;gt;&amp;#93;&lt;/span&gt; :lnet:LNetPut+0x730/0x840&lt;br/&gt;
Jul 17 17:14:04 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff800649fb&amp;gt;&amp;#93;&lt;/span&gt; __down+0xc3/0xd8&lt;br/&gt;
Jul 17 17:14:04 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8008e421&amp;gt;&amp;#93;&lt;/span&gt; default_wake_function+0x0/0xe&lt;br/&gt;
Jul 17 17:14:04 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff88a29490&amp;gt;&amp;#93;&lt;/span&gt; :lquota:dqacq_handler+0x0/0xc20&lt;br/&gt;
...&lt;/p&gt;

&lt;p&gt;This one looks a little like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1395&quot; title=&quot;MDS hangs after calltrace at ldlm_expired_completion_wait()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1395&quot;&gt;&lt;del&gt;LU-1395&lt;/del&gt;&lt;/a&gt; or &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1269&quot; title=&quot;speed up ASTs sending&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1269&quot;&gt;&lt;del&gt;LU-1269&lt;/del&gt;&lt;/a&gt;:&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: Call Trace:&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff888ceb51&amp;gt;&amp;#93;&lt;/span&gt; ldlm_resource_add_lock+0xb1/0x180 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff888e2a00&amp;gt;&amp;#93;&lt;/span&gt; ldlm_expired_completion_wait+0x0/0x250 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8006388b&amp;gt;&amp;#93;&lt;/span&gt; schedule_timeout+0x8a/0xad&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8009987d&amp;gt;&amp;#93;&lt;/span&gt; process_timeout+0x0/0x5&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff888e4555&amp;gt;&amp;#93;&lt;/span&gt; ldlm_completion_ast+0x4d5/0x880 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff888c9709&amp;gt;&amp;#93;&lt;/span&gt; ldlm_lock_enqueue+0x9d9/0xb20 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8008e421&amp;gt;&amp;#93;&lt;/span&gt; default_wake_function+0x0/0xe&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff888c4b6a&amp;gt;&amp;#93;&lt;/span&gt; ldlm_lock_addref_internal_nolock+0x3a/0x90 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff888e30bb&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cli_enqueue_local+0x46b/0x520 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff88c611a7&amp;gt;&amp;#93;&lt;/span&gt; enqueue_ordered_locks+0x387/0x4d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mds&amp;#93;&lt;/span&gt;&lt;br/&gt;
Jul  4 17:58:29 lfs-mds-2-2 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff888e09a0&amp;gt;&amp;#93;&lt;/span&gt; ldlm_blocking_ast+0x0/0x2a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
...&lt;/p&gt;
</description>
                <environment>Lustre 1.8.6.80, jenkins-g9d9d86f-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443&lt;br/&gt;
Centos 5.5&lt;br/&gt;
</environment>
        <key id="15291">LU-1663</key>
            <summary>MDS threads hang for over 725s, causing fail over</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="kitwestneat">Kit Westneat</reporter>
                        <labels>
                            <label>mn8</label>
                            <label>patch</label>
                    </labels>
                <created>Mon, 23 Jul 2012 20:31:04 +0000</created>
                <updated>Fri, 4 Oct 2013 15:33:47 +0000</updated>
                            <resolved>Fri, 4 Oct 2013 15:33:37 +0000</resolved>
                                    <version>Lustre 1.8.7</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="42160" author="kitwestneat" created="Mon, 23 Jul 2012 20:31:33 +0000"  >&lt;p&gt;grepped out call traces&lt;/p&gt;</comment>
                            <comment id="42161" author="kitwestneat" created="Mon, 23 Jul 2012 20:38:33 +0000"  >&lt;p&gt;mds1 log&lt;/p&gt;</comment>
                            <comment id="42162" author="kitwestneat" created="Mon, 23 Jul 2012 20:39:27 +0000"  >&lt;p&gt;mds2 log&lt;/p&gt;</comment>
                            <comment id="42192" author="pjones" created="Tue, 24 Jul 2012 10:19:42 +0000"  >&lt;p&gt;Oleg is already looking into &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1395&quot; title=&quot;MDS hangs after calltrace at ldlm_expired_completion_wait()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1395&quot;&gt;&lt;del&gt;LU-1395&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="42193" author="green" created="Tue, 24 Jul 2012 10:27:06 +0000"  >&lt;p&gt;Kit, in your logs you only show mds side.&lt;br/&gt;
the ast waiting traces (loike the second one) imply that the client did not cancel a blocking lock. We need to get client side log to see why.&lt;br/&gt;
Preferably client side should have dlmtrace log level enabled&lt;/p&gt;</comment>
                            <comment id="42196" author="kitwestneat" created="Tue, 24 Jul 2012 12:17:03 +0000"  >&lt;p&gt;Oleg, I will try to get this from the customer for the next failover event. Are there any additional logs from the MDS that might be useful?&lt;/p&gt;</comment>
                            <comment id="42201" author="green" created="Tue, 24 Jul 2012 14:32:46 +0000"  >&lt;p&gt;Well, getting a matching dlmtrace-enabled log from MDS at the same time would be helpful too&lt;/p&gt;</comment>
                            <comment id="42434" author="kitwestneat" created="Mon, 30 Jul 2012 10:20:48 +0000"  >&lt;p&gt;logs from the client&lt;/p&gt;</comment>
                            <comment id="42436" author="kitwestneat" created="Mon, 30 Jul 2012 10:27:17 +0000"  >&lt;p&gt;MDS logs&lt;/p&gt;</comment>
                            <comment id="42437" author="kitwestneat" created="Mon, 30 Jul 2012 10:56:03 +0000"  >&lt;p&gt;These are logs from the last time the failover happened, before the debugging got turned on. It looks like many of the clients are dumping stacks on close operations:&lt;/p&gt;

&lt;p&gt;/var/log/messages-20120722:Jul 21 12:22:25 r32i3n11 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06dbf25&amp;gt;&amp;#93;&lt;/span&gt; ll_close_inode_openhandle+0x175/0x6a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
/var/log/messages-20120722:Jul 21 12:22:25 r32i3n11 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06dc549&amp;gt;&amp;#93;&lt;/span&gt; ll_mdc_real_close+0xf9/0x340 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
/var/log/messages-20120722:Jul 21 12:22:25 r32i3n11 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06ec4a2&amp;gt;&amp;#93;&lt;/span&gt; ll_mdc_close+0x222/0x3f0 [lustre&lt;/p&gt;


&lt;p&gt;/var/log/messages-20120722:Jul 21 12:21:21 r32i3n5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06dbf25&amp;gt;&amp;#93;&lt;/span&gt; ll_close_inode_openhandle+0x175/0x6a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
/var/log/messages-20120722:Jul 21 12:21:21 r32i3n5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06dc549&amp;gt;&amp;#93;&lt;/span&gt; ll_mdc_real_close+0xf9/0x340 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
/var/log/messages-20120722:Jul 21 12:21:21 r32i3n5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06ec4a2&amp;gt;&amp;#93;&lt;/span&gt; ll_mdc_close+0x222/0x3f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
/var/log/messages-20120722:Jul 21 12:21:21 r32i3n5 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07124db&amp;gt;&amp;#93;&lt;/span&gt; ? ll_stats_ops_tally+0x6b/0xd0 [lustre&lt;/p&gt;

&lt;p&gt;etc.&lt;/p&gt;</comment>
                            <comment id="43278" author="kitwestneat" created="Wed, 15 Aug 2012 14:59:19 +0000"  >&lt;p&gt;It happened again, but we were unable to get the debug logs. I&apos;ve modified the failover script to kernel panic the MDS, so hopefully we&apos;ll get a core dump. &lt;/p&gt;

&lt;p&gt;Is there anything useful in the logs I uploaded before? I can upload the logs we did get from this last event, but it looks like more of the same.&lt;/p&gt;</comment>
                            <comment id="43633" author="kitwestneat" created="Wed, 22 Aug 2012 12:30:25 +0000"  >&lt;p&gt;NOAA has had to turn off the dlmtrace debug, as it causes too big of a performance hit. I&apos;m still trying to get lctl dk from one of the events though.&lt;/p&gt;

&lt;p&gt;Can I get an opinion on the logs that are uploaded? Any thing interesting in them at all?&lt;/p&gt;</comment>
                            <comment id="43635" author="kitwestneat" created="Wed, 22 Aug 2012 12:33:26 +0000"  >&lt;p&gt;Also, this bug looks similar in terms of symptoms, though it seems to be more directly related to unlinks than to closes:&lt;br/&gt;
&lt;a href=&quot;https://bugzilla.clusterstor.com/show_bug.cgi?id=699&amp;amp;list_id=cookie&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.clusterstor.com/show_bug.cgi?id=699&amp;amp;list_id=cookie&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="43660" author="green" created="Wed, 22 Aug 2012 19:04:30 +0000"  >&lt;p&gt;The last set of logs looks to be pretty different from first set of logs.&lt;br/&gt;
Looks like some sort of journal/quota deadlock.&lt;/p&gt;</comment>
                            <comment id="43814" author="pjones" created="Mon, 27 Aug 2012 13:23:50 +0000"  >&lt;p&gt;Kit&lt;/p&gt;

&lt;p&gt;Would it be possible to collect a crashdump the next time that this occurs?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="44690" author="kitwestneat" created="Wed, 12 Sep 2012 11:11:39 +0000"  >&lt;p&gt;We&apos;ve been trying to get stack traces and crashdumps, but for one reason or another it hasn&apos;t worked. We did find a lustre-log recently that may or may not be related. I will upload it anyway. We also recorded some information after the last failover, but without the stack trace or crashdump, I&apos;m afraid it&apos;s useless. I will upload it anyway, maybe you all can suggest some other information to get. &lt;/p&gt;</comment>
                            <comment id="44691" author="kitwestneat" created="Wed, 12 Sep 2012 11:19:11 +0000"  >&lt;p&gt;lctl dk from certain clients, logs from MDS (cut off by premature forced kernel panic)&lt;/p&gt;</comment>
                            <comment id="44692" author="kitwestneat" created="Wed, 12 Sep 2012 11:22:34 +0000"  >&lt;p&gt;lustre log from 9/10&lt;/p&gt;</comment>
                            <comment id="44774" author="kitwestneat" created="Thu, 13 Sep 2012 06:23:32 +0000"  >&lt;p&gt;The customer is getting anxious about this issue. We are testing our kdump configuration now to ensure that we have a crashdump, but can you please review the logs and see if there is any more information we should gather? The next hang needs to be the last. Thanks, Kit. &lt;/p&gt;</comment>
                            <comment id="52962" author="kitwestneat" created="Mon, 25 Feb 2013 10:03:07 +0000"  >&lt;p&gt;kern.log from most recent failover (with sysrq-t)&lt;/p&gt;</comment>
                            <comment id="52963" author="kitwestneat" created="Mon, 25 Feb 2013 10:04:41 +0000"  >&lt;p&gt;Hi we were finally able to get some good debugging data from the system during a failover. Here is the sysrq-t from before it went down. I also managed to save off all the lustre-logs, there are a ton as you can see in the kern.log. Let me know if you want to see them.&lt;/p&gt;

&lt;p&gt;I noticed that there was a slab reclaim going on.. Wasn&apos;t there another issue with slab reclaims taking a long time?&lt;/p&gt;</comment>
                            <comment id="53218" author="kitwestneat" created="Fri, 1 Mar 2013 10:22:50 +0000"  >&lt;p&gt;Any update?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="53272" author="green" created="Mon, 4 Mar 2013 12:38:54 +0000"  >&lt;p&gt;Hm, I do not see any obvious culprits in the traces.&lt;br/&gt;
A bit strange how jbd thread is the first to complain it&apos;s hung and in the middle of transaction commit with no report of any sleep, yet in D state (I wish you had a crashdump so that it could be checked more closely).&lt;br/&gt;
Anyway the first log that should hold most of the data is /tmp/lustre-log.1361661252.1181 I guess? (the biggest one around that timestamp is the one we need) and I guess we need it to at least try to see what was going on at the time.&lt;/p&gt;

&lt;p&gt;Also I see an umount in the logs, but I assume that&apos;s after the system was wedged for some time and your scripts just attempt it?&lt;/p&gt;</comment>
                            <comment id="53278" author="kitwestneat" created="Mon, 4 Mar 2013 13:24:18 +0000"  >&lt;p&gt;Hi Oleg,&lt;/p&gt;

&lt;p&gt;That&apos;s correct, the system tried to umount once the OSTs went unhealthy, but failed.&lt;/p&gt;

&lt;p&gt;The lustre-log looks pretty slim, hopefully you can find something I missed.&lt;/p&gt;

&lt;p&gt;I will try to get a crashdump. Unfortunately all the ones we&apos;ve been able to get so far have been incomplete. I&apos;ll try to come up with a way to get a full crashdump. Are there any other logs/debug information we can get that won&apos;t have too much impact on the running system?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit &lt;/p&gt;</comment>
                            <comment id="53484" author="green" created="Wed, 6 Mar 2013 19:01:10 +0000"  >&lt;p&gt;Getting the full crashdump from a wedged transaction commit like this should be most useful I suspect.&lt;/p&gt;</comment>
                            <comment id="55488" author="green" created="Thu, 4 Apr 2013 15:35:01 +0000"  >&lt;p&gt;I suspect what you are seeing is a generic jbd2 transaction wraparound problem.&lt;br/&gt;
There were several rounds at RedHat to fix this, previous round here (already part of rhel6.4 kernel, no idea if they backported to rhel5 too): &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=735768&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugzilla.redhat.com/show_bug.cgi?id=735768&lt;/a&gt; and then the latest one seems to be unfolding here: &lt;a href=&quot;http://lists.debian.org/debian-kernel/2013/03/msg00313.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://lists.debian.org/debian-kernel/2013/03/msg00313.html&lt;/a&gt; though I don&apos;t know when it will make it to rhel kernels either.&lt;/p&gt;</comment>
                            <comment id="55490" author="kitwestneat" created="Thu, 4 Apr 2013 16:02:49 +0000"  >&lt;p&gt;I don&apos;t see that bug fixed in any RHEL5 kernel. Since Redhat no longer breaks out their patches, it seems like it would be difficult to figure out what they did to fix it. I will look at the SRPMS for the jbd stuff to see if I can figure it out. Have you seen the transaction wraparound problem in Lustre before? Are there any workarounds? It seems like we can&apos;t be the first to hit it.&lt;/p&gt;</comment>
                            <comment id="55493" author="green" created="Thu, 4 Apr 2013 16:19:53 +0000"  >&lt;p&gt;I have a high-load test system for lustre regression testing and I am hitting two sorts of jbd issues somewhat frequently, the rh bz735768 and then some jbd thread lockups that I think are related. Some of them are somewhat similar to what you see, I think. So I was just doing some more investigations into that and uncovered that they are apparently known problems if you have high transaction rates (which I certainly do).&lt;/p&gt;

&lt;p&gt;I suspect you can cherry-pick the patches from mainline linux kernel.&lt;/p&gt;</comment>
                            <comment id="55511" author="kitwestneat" created="Thu, 4 Apr 2013 17:16:38 +0000"  >&lt;p&gt;It looks like the patch RH took to solve bz735768 is this one:&lt;br/&gt;
&lt;a href=&quot;https://github.com/torvalds/linux/commit/deeeaf13b291420fe4a4a52606b9fc9128387340&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/torvalds/linux/commit/deeeaf13b291420fe4a4a52606b9fc9128387340&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Oh I just realized that the debian-kernel patch references the two commits already, doh. &lt;/p&gt;

&lt;p&gt;Are there any plans to carry the debian-kernel patch until RH backports it? How risky do you think it is to backport them to RHEL5?&lt;/p&gt;</comment>
                            <comment id="56334" author="kitwestneat" created="Mon, 15 Apr 2013 18:54:57 +0000"  >&lt;p&gt;We finally got a full vmcore. It&apos;s 12GB.. Is there a good place to put it? Alternatively I could run commands on it. What would be the best way to prove that it&apos;s the tid wraparound issue?&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="56340" author="kitwestneat" created="Mon, 15 Apr 2013 19:38:31 +0000"  >&lt;p&gt;Backtrace:&lt;br/&gt;
PID: 23917  TASK: ffff810abe18c7e0  CPU: 2   COMMAND: &quot;vgck&quot;&lt;br/&gt;
 #0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd75a0&amp;#93;&lt;/span&gt; crash_kexec at ffffffff800b1192&lt;br/&gt;
 #1 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7660&amp;#93;&lt;/span&gt; __die at ffffffff80065137&lt;br/&gt;
 #2 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd76a0&amp;#93;&lt;/span&gt; die at ffffffff8006c735&lt;br/&gt;
 #3 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd76d0&amp;#93;&lt;/span&gt; do_invalid_op at ffffffff8006ccf5&lt;br/&gt;
 #4 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7790&amp;#93;&lt;/span&gt; error_exit at ffffffff8005ddf9&lt;br/&gt;
    &lt;span class=&quot;error&quot;&gt;&amp;#91;exception RIP: jbd2_journal_start+58&amp;#93;&lt;/span&gt;&lt;br/&gt;
    RIP: ffffffff8b32ae31  RSP: ffff810b35bd7848  RFLAGS: 00010206&lt;br/&gt;
    RAX: ffff810c1aef19c0  RBX: ffff81061476dd38  RCX: ffff810910fdcc00&lt;br/&gt;
    RDX: ffffffffffffffe2  RSI: 0000000000000012  RDI: ffff81061476dd38&lt;br/&gt;
    RBP: ffff810919769800   R8: ffff81000005b600   R9: ffff8104cd371a80&lt;br/&gt;
    R10: 0000000000000000  R11: ffffffff8b36cb60  R12: 0000000000000012&lt;br/&gt;
    R13: ffff810b35bd78c8  R14: 0000000000000080  R15: 0000000000000100&lt;br/&gt;
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018&lt;br/&gt;
 #5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7860&amp;#93;&lt;/span&gt; ldiskfs_dquot_drop at ffffffff8b36fdbe&lt;br/&gt;
 #6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7880&amp;#93;&lt;/span&gt; clear_inode at ffffffff800234e7&lt;br/&gt;
 #7 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7890&amp;#93;&lt;/span&gt; dispose_list at ffffffff800352d6&lt;br/&gt;
 #8 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd78c0&amp;#93;&lt;/span&gt; shrink_icache_memory at ffffffff8002dcfb&lt;br/&gt;
 #9 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7900&amp;#93;&lt;/span&gt; shrink_slab at ffffffff8003f7cd&lt;br/&gt;
#10 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7940&amp;#93;&lt;/span&gt; zone_reclaim at ffffffff800cf9ae&lt;br/&gt;
#11 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd79f0&amp;#93;&lt;/span&gt; get_page_from_freelist at ffffffff8000a8a7&lt;br/&gt;
#12 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7a60&amp;#93;&lt;/span&gt; __alloc_pages at ffffffff8000f48a&lt;br/&gt;
#13 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7ad0&amp;#93;&lt;/span&gt; cache_grow at ffffffff80017a52&lt;br/&gt;
#14 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7b20&amp;#93;&lt;/span&gt; cache_alloc_refill at ffffffff8005c3ee&lt;br/&gt;
#15 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7b60&amp;#93;&lt;/span&gt; kmem_cache_alloc at ffffffff8000ac96&lt;br/&gt;
#16 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7b80&amp;#93;&lt;/span&gt; mb_cache_entry_alloc at ffffffff801063df&lt;br/&gt;
#17 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7ba0&amp;#93;&lt;/span&gt; ext3_xattr_cache_insert at ffffffff8805bb1e&lt;br/&gt;
#18 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7bd0&amp;#93;&lt;/span&gt; ext3_xattr_get at ffffffff8805cd3a&lt;br/&gt;
#19 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7c50&amp;#93;&lt;/span&gt; ext3_get_acl at ffffffff8805d4e2&lt;br/&gt;
#20 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7ca0&amp;#93;&lt;/span&gt; ext3_init_acl at ffffffff8805d8de&lt;br/&gt;
#21 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7ce0&amp;#93;&lt;/span&gt; ext3_new_inode at ffffffff8804e21f&lt;br/&gt;
#22 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7db0&amp;#93;&lt;/span&gt; ext3_create at ffffffff88054b6b&lt;br/&gt;
#23 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7e00&amp;#93;&lt;/span&gt; vfs_create at ffffffff8003aa1b&lt;br/&gt;
#24 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7e30&amp;#93;&lt;/span&gt; open_namei at ffffffff8001b749&lt;br/&gt;
#25 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7ea0&amp;#93;&lt;/span&gt; do_filp_open at ffffffff80027ba1&lt;br/&gt;
#26 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7f50&amp;#93;&lt;/span&gt; do_sys_open at ffffffff80019fa5&lt;br/&gt;
#27 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff810b35bd7f80&amp;#93;&lt;/span&gt; tracesys at ffffffff8005d29e (via system_call)&lt;br/&gt;
    RIP: 0000003d598c5b40  RSP: 00007fff8f6ae3b8  RFLAGS: 00000246&lt;br/&gt;
    RAX: ffffffffffffffda  RBX: ffffffff8005d29e  RCX: ffffffffffffffff&lt;br/&gt;
    RDX: 00000000000001b6  RSI: 0000000000000241  RDI: 00007fff8f6ae490&lt;br/&gt;
    RBP: 00007fff8f6ae4e0   R8: 0000000000000004   R9: 0000000000000000&lt;br/&gt;
    R10: 0000000000000241  R11: 0000000000000246  R12: 000000000e3a5e90&lt;br/&gt;
    R13: 0000000000484f87  R14: 0000000000000004  R15: 000000000e3a5e90&lt;br/&gt;
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b&lt;/p&gt;

&lt;p&gt;Dennis pointed out that it looks like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3071&quot; title=&quot;kernel BUG at fs/jbd2/transaction.c:293 (jbd2_journal_start)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3071&quot;&gt;&lt;del&gt;LU-3071&lt;/del&gt;&lt;/a&gt;. Reading that bug, it looks like it should affect MDTs and 1.8 as well, is that correct? The assertion hit is at the same place:&lt;br/&gt;
                J_ASSERT(handle-&amp;gt;h_transaction-&amp;gt;t_journal == journal);&lt;/p&gt;
</comment>
                            <comment id="56595" author="green" created="Thu, 18 Apr 2013 23:30:43 +0000"  >&lt;p&gt;yes, this latest trace is a different issue similar to lu-3071, but in your trace there is no lustre in the picture at all.&lt;br/&gt;
It sounds there&apos;s a bug in mballoc in your ext3 code that allocates mb_cache entry with a wrong flag, should be GFP_NOFS.&lt;/p&gt;

&lt;p&gt;I checked in rhel5 kernel I have around and the bug is there in fs/mbcache.c:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;struct mb_cache_entry *
mb_cache_entry_alloc(struct mb_cache *cache)
{
        struct mb_cache_entry *ce;

        atomic_inc(&amp;amp;cache-&amp;gt;c_entry_count);
        ce = kmem_cache_alloc(cache-&amp;gt;c_entry_cache, GFP_KERNEL);   &amp;lt;===== BUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In rhel6 I see they actually pass a gfp flags to this function to be used by the alloc to avoid this, you might be able to hunt down the patch that fixes that I guess.&lt;/p&gt;</comment>
                            <comment id="56596" author="green" created="Thu, 18 Apr 2013 23:41:19 +0000"  >&lt;p&gt;as for the vmcore, put it somewhere where we can get it? we also would need kernel-debuginfo you used at the very least. lustre modules might be useful just in case too.&lt;/p&gt;</comment>
                            <comment id="56889" author="kitwestneat" created="Tue, 23 Apr 2013 23:07:20 +0000"  >&lt;p&gt;Does this patch look correct?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://git.openvz.org/?p=linux-2.6.32-openvz;a=commitdiff;h=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://git.openvz.org/?p=linux-2.6.32-openvz;a=commitdiff;h=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are just using the default 1.8.9 kernel:&lt;br/&gt;
&lt;a href=&quot;http://downloads.whamcloud.com/public/lustre/latest-maintenance-release/el5/server/RPMS/x86_64/kernel-debuginfo-2.6.18-348.1.1.el5_lustre.x86_64.rpm&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://downloads.whamcloud.com/public/lustre/latest-maintenance-release/el5/server/RPMS/x86_64/kernel-debuginfo-2.6.18-348.1.1.el5_lustre.x86_64.rpm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Dennis is uploading the coredump now and will update the ticket when it is finished. &lt;/p&gt;</comment>
                            <comment id="56971" author="kitwestneat" created="Wed, 24 Apr 2013 19:30:09 +0000"  >&lt;p&gt;I pushed the patch we are using for the transaction wrap around in case it would be useful:&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/6147&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/6147&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="57006" author="green" created="Thu, 25 Apr 2013 02:23:56 +0000"  >&lt;p&gt;Ok, I got vmcore, but the vmlinux from openvz is not it.&lt;/p&gt;

&lt;p&gt;In any case I looked inside of the core and it looks like it&apos;s for this latest crash that we already know what happened that is a different bug altogether.&lt;/p&gt;

&lt;p&gt;What I am looking for is vmcore from a situation where jbd2 commit threads locks up.&lt;/p&gt;</comment>
                            <comment id="57467" author="kitwestneat" created="Wed, 1 May 2013 18:26:24 +0000"  >&lt;p&gt;The openviz link was just supposed to be a link to the patch, the kernel we are running is a stock Centos kernel. The backtrace and the core come from the same crash. There is another core, but it was placed on the wrong disk and we need to wait for a downtime to get it (next Tuesday). If it is a different backtrace, I&apos;ll update the ticket then. &lt;/p&gt;

&lt;p&gt;As far as the fix for the GFP_NOFS bug, does that patch work? I found the patch on git.kernel.org as well:&lt;br/&gt;
&lt;a href=&quot;http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Should I backport that to b1_8/RHEL5?&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="57648" author="green" created="Fri, 3 May 2013 16:12:58 +0000"  >&lt;p&gt;yes, this patch looks like it would do the right thing.&lt;/p&gt;</comment>
                            <comment id="67808" author="ihara" created="Fri, 27 Sep 2013 12:41:52 +0000"  >&lt;p&gt;Hi, would you please review this patch sonner? we are hitting multple server crash due to this issue.&lt;/p&gt;</comment>
                            <comment id="67811" author="pjones" created="Fri, 27 Sep 2013 12:52:30 +0000"  >&lt;p&gt;ok Ihara. Patch is at &lt;a href=&quot;http://review.whamcloud.com/#/c/6147/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/6147/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="67812" author="ihara" created="Fri, 27 Sep 2013 12:57:36 +0000"  >&lt;p&gt;Hi Peter, yes, but, we are waiting inspection. Once code review is done, we will apply this to the kernel for servers.&lt;/p&gt;</comment>
                            <comment id="67813" author="pjones" created="Fri, 27 Sep 2013 13:09:32 +0000"  >&lt;p&gt;Ihara &lt;/p&gt;

&lt;p&gt;Sorry that my comment was not clear enough. I understand that you wish to have reviews on the patch and I was acknowledging that and then separately adding the reference to the patch. It is required to include such a link in the JIRA ticket to cross-reference between JIRA and gerrit and this step had previously been overlooked&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="67860" author="kitwestneat" created="Fri, 27 Sep 2013 18:19:47 +0000"  >&lt;p&gt;I also just submitted the xattrs patch that was also referenced. We are carrying this patch already at NOAA and it seems to improve stability. &lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/7788&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7788&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="68374" author="pjones" created="Fri, 4 Oct 2013 15:33:37 +0000"  >&lt;p&gt;RH patches ok to use in production&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11842" name="09sep.tar.bz2" size="8807059" author="kitwestneat" created="Wed, 12 Sep 2012 11:19:11 +0000"/>
                            <attachment id="11711" name="call_traces" size="401767" author="kitwestneat" created="Mon, 23 Jul 2012 20:31:33 +0000"/>
                            <attachment id="11742" name="kern.log-20120721" size="195689" author="kitwestneat" created="Mon, 30 Jul 2012 10:27:17 +0000"/>
                            <attachment id="12269" name="kern.log.2013-02-23.gz" size="106016" author="kitwestneat" created="Mon, 25 Feb 2013 10:03:07 +0000"/>
                            <attachment id="12277" name="ll-1181-decoded.txt.gz" size="239" author="kitwestneat" created="Mon, 4 Mar 2013 13:41:03 +0000"/>
                            <attachment id="11741" name="log1.bz2" size="448289" author="kitwestneat" created="Mon, 30 Jul 2012 10:20:48 +0000"/>
                            <attachment id="11843" name="lustre-log.txt.bz2" size="4773924" author="kitwestneat" created="Wed, 12 Sep 2012 11:22:34 +0000"/>
                            <attachment id="11712" name="mds1.log" size="8778924" author="kitwestneat" created="Mon, 23 Jul 2012 20:38:33 +0000"/>
                            <attachment id="11713" name="mds2.log" size="3620681" author="kitwestneat" created="Mon, 23 Jul 2012 20:39:27 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv3fj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4055</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>