<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:16:57 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8370] ASSERTION( lur-&gt;lur_hdr.lrh_len &lt;= ctxt-&gt;loc_chunk_size )</title>
                <link>https://jira.whamcloud.com/browse/LU-8370</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We just had several MDS nodes crash in our 2.8.0 DNE testbed with the following assertion:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-07-05 17:13:18 [421811.048147] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size ) failed:
2016-07-05 17:13:18 [421811.064551] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And here is a bit more of the console log showing some events leading up to the assertion:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-07-05 17:00:01 [421013.769026] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2016-07-05 17:00:14 [421027.108636] Lustre: 13742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467763108/real 1467
2016-07-05 17:00:14 [421027.143423] Lustre: 13742:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
2016-07-05 17:00:14 [421027.155156] Lustre: lquake-MDT0004-osp-MDT0000: Connection to lquake-MDT0004 (at 172.19.1.115@o2ib100) was lost; in progress operations using thi
2016-07-05 17:01:01 [421073.879154] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2016-07-05 17:01:41 [421113.695183] Lustre: lquake-MDT0000: haven&apos;t heard from client lquake-MDT0000-lwp-MDT0004_UUID (at 172.19.1.115@o2ib100) in 227 seconds. I think i
2016-07-05 17:01:41 [421113.721418] Lustre: Skipped 2 previous similar messages
2016-07-05 17:04:05 [421258.363863] Lustre: lquake-MDT000b-osp-MDT0000: Connection to lquake-MDT000b (at 172.19.1.122@o2ib100) was lost; in progress operations using thi
2016-07-05 17:04:13 [421266.343370] Lustre: 121411:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/5), not sending early reply
2016-07-05 17:04:13 [421266.343370]   req@ffff887a33193600 x1538602753601316/t0(0) o36-&amp;gt;9131ef49-c8bd-b5a0-9096-bf612b95d242@192.168.128.136@o2ib18:423/0 lens 736/440 e
2016-07-05 17:04:13 [421266.379876] Lustre: 121411:0:(service.c:1336:ptlrpc_at_send_early_reply()) Skipped 6 previous similar messages
2016-07-05 17:05:03 [421316.429903] Lustre: lquake-MDT0000: Client 7f8c5dbb-5c65-9fd0-386e-4acb17a569c1 (at 192.168.128.173@o2ib18) reconnecting
2016-07-05 17:05:03 [421316.430031] Lustre: lquake-MDT0000: Connection restored to  (at 192.168.128.111@o2ib18)
2016-07-05 17:05:03 [421316.430034] Lustre: lquake-MDT0000: Connection restored to  (at 192.168.128.132@o2ib18)
2016-07-05 17:05:03 [421316.463306] Lustre: Skipped 6 previous similar messages
2016-07-05 17:05:21 [421334.445759] Lustre: 13738:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467763414/real 1467
2016-07-05 17:05:21 [421334.481039] Lustre: 13738:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
2016-07-05 17:05:29 [421341.717463] Lustre: lquake-MDT0000: haven&apos;t heard from client lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100) in 227 seconds. I think it&apos;s d
2016-07-05 17:05:29 [421341.743640] Lustre: Skipped 2 previous similar messages
2016-07-05 17:05:56 [421368.821761] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
2016-07-05 17:05:56 [421368.834311] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.115@o2ib100 (38): c: 0, oc: 0, rc: 8
2016-07-05 17:07:31 [421464.551678] Lustre: lquake-MDT0009-osp-MDT0000: Connection to lquake-MDT0009 (at 172.19.1.120@o2ib100) was lost; in progress operations using thi
2016-07-05 17:08:28 [421520.945572] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 6 seconds
2016-07-05 17:08:28 [421520.957950] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.120@o2ib100 (62): c: 0, oc: 0, rc: 8
2016-07-05 17:09:19 [421571.987133] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 7 seconds
2016-07-05 17:09:19 [421571.999514] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.122@o2ib100 (38): c: 0, oc: 0, rc: 8
2016-07-05 17:09:23 [421576.250103] Lustre: MGS: haven&apos;t heard from client 859165fd-8f11-0394-d9d3-83cdf15c6ed5 (at 172.19.1.120@o2ib100) in 227 seconds. I think it&apos;s de
2016-07-05 17:09:23 [421576.276383] Lustre: Skipped 2 previous similar messages
2016-07-05 17:10:01 [421614.394252] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2016-07-05 17:11:22 [421695.760858] Lustre: lquake-MDT000f-osp-MDT0000: Connection to lquake-MDT000f (at 172.19.1.126@o2ib100) was lost; in progress operations using thi
2016-07-05 17:12:58 [421790.944230] Lustre: MGS: haven&apos;t heard from client 4d1129cf-ad52-0cb8-c58a-0602f03cb346 (at 172.19.1.126@o2ib100) in 229 seconds. I think it&apos;s de
2016-07-05 17:12:58 [421790.970449] Lustre: Skipped 2 previous similar messages
2016-07-05 17:13:18 [421811.048147] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size ) failed:
2016-07-05 17:13:18 [421811.064551] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) LBUG
2016-07-05 17:13:18 [421811.073795] Pid: 86030, comm: mdt02_055
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The above log was from NID 172.19.1.111@o2ib100.&lt;/p&gt;

&lt;p&gt;Note that NIDS 172.19.1.111 through 172.19.1.126 are MDS nodes (172.19.1.111 hosts the MGS as well).  NIDs 172.19.1.127 through 172.19.1.130 are OSS nodes. (All @o2ib100).&lt;/p&gt;

&lt;p&gt;It looks to me like the MDS nodes are all hitting this assertion one by one.  So far 10 out of 16 have hit it and are down.&lt;/p&gt;

&lt;p&gt;Here is another node&apos;s console messages (172.19.1.112@o2ib100):&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-07-05 17:24:35 [366721.789837] LustreError: Skipped 589 previous similar messages
2016-07-05 17:25:48 [366795.199563] Lustre: 14874:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1467764693/real 0]  
req@ffff883e59bc2700 x1538678743470044/t0(0) o38-&amp;gt;lquake-MDT000b-osp-MDT0001@172.19.1.122@o2ib100:24/4 lens 520/544 e 0 to 1 dl 1467764748 ref 2 fl Rpc:XN/0/ffffffff rc 
0/-1
2016-07-05 17:25:48 [366795.234284] Lustre: 14874:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 60 previous similar messages
2016-07-05 17:26:33 [366840.473716] Lustre: lquake-MDT0001: Client f3232a59-daa1-33eb-94f5-ae42e9a1f202 (at 192.168.128.143@o2ib18) reconnecting
2016-07-05 17:26:34 [366840.487104] Lustre: lquake-MDT0001: Connection restored to f3232a59-daa1-33eb-94f5-ae42e9a1f202 (at 192.168.128.143@o2ib18)
2016-07-05 17:26:39 [366846.241092] Lustre: lquake-MDT0008-osp-MDT0001: Connection to lquake-MDT0008 (at 172.19.1.119@o2ib100) was lost; in progress operations using thi
s service will wait for recovery to complete
2016-07-05 17:28:17 [366943.590377] Lustre: lquake-MDT0001: haven&apos;t heard from client lquake-MDT0008-mdtlov_UUID (at 172.19.1.119@o2ib100) in 227 seconds. I think it&apos;s d
ead, and I am evicting it. exp ffff887e999b2000, cur 1467764897 expire 1467764747 last 1467764670
2016-07-05 17:28:26 [366952.624959] LustreError: 14934:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size ) failed: 
2016-07-05 17:28:26 [366952.641374] LustreError: 14934:0:(update_trans.c:275:sub_updates_write()) LBUG
2016-07-05 17:28:26 [366952.650631] Pid: 14934, comm: mdt01_001
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="37985">LU-8370</key>
            <summary>ASSERTION( lur-&gt;lur_hdr.lrh_len &lt;= ctxt-&gt;loc_chunk_size )</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Wed, 6 Jul 2016 00:33:45 +0000</created>
                <updated>Thu, 14 Jun 2018 21:41:18 +0000</updated>
                            <resolved>Thu, 11 Aug 2016 11:57:22 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.9.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="157745" author="morrone" created="Wed, 6 Jul 2016 00:51:44 +0000"  >&lt;p&gt;This hit again immediately after recovery (at least I think recovery had finished, all the other nodes did).  So our testbed might be hosed until this is fixed.  Or we writeconf or reformat or something.&lt;/p&gt;</comment>
                            <comment id="157800" author="bfaccini" created="Wed, 6 Jul 2016 13:52:03 +0000"  >&lt;p&gt;Hello Chris,&lt;br/&gt;
Well, if I correctly understand that this problem is likely to reproduce during recovery upon MDS restart, may be you could enable the full Lustre debug mask on all MDSs, and then try to reproduce. Then you would need to save the full log on all MDTs staying up, when we could then extract the one for the crashing MDS from its crash-dump? BTW, do you already have crash-dumps from the previous occurrences?&lt;/p&gt;
</comment>
                            <comment id="157862" author="morrone" created="Wed, 6 Jul 2016 18:43:18 +0000"  >&lt;p&gt;Unfortunately, crash dumps are not currently working on this cluster.  Its definitely a pain point for us.  I&apos;m hoping we&apos;ll get those working in the not-too-distant future.&lt;/p&gt;</comment>
                            <comment id="158075" author="morrone" created="Fri, 8 Jul 2016 00:26:50 +0000"  >&lt;p&gt;No crash dumps yet, but I changed the LASSERT to an LASSERTF and got us the numbers that fail:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-07-07 17:21:37 [ 7753.358956] LustreError: 145766:0:(update_trans.c:277:sub_updates_write()) ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size ) failed: lrh_len 32776 loc_chunk_size 32768
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Only 8 bytes over.  So close! &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="158082" author="laisiyao" created="Fri, 8 Jul 2016 03:43:47 +0000"  >&lt;p&gt;do you have a place to checkout your test code (2.8.0)? or the commit in master branch? this can help verify this is a fixed issue in latest master branch.&lt;/p&gt;</comment>
                            <comment id="158166" author="morrone" created="Fri, 8 Jul 2016 19:13:08 +0000"  >&lt;p&gt;See your gerrit server, repository named &quot;lustre-release-fe-llnl&quot;.  Look for the highest numbered 2.8-llnl-preview* branch.&lt;/p&gt;

&lt;p&gt;There is not a whole lot that is different code wise.  Most of the change is in packaging and scripts.&lt;/p&gt;

&lt;p&gt;But don&apos;t keep me in suspense, which commit in the latest master branch do you suspect fixes this?&lt;/p&gt;</comment>
                            <comment id="158190" author="morrone" created="Fri, 8 Jul 2016 21:24:46 +0000"  >&lt;p&gt;And now the backtrace, in case that is helpful:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-07-08 14:15:20 [  337.504715] Pid: 26705, comm: mdt00_002
2016-07-08 14:15:20 [  337.509679] 
2016-07-08 14:15:20 [  337.509679] Call Trace:
2016-07-08 14:15:20 [  337.515418]  [&amp;lt;ffffffffa08857e3&amp;gt;] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
2016-07-08 14:15:20 [  337.523920]  [&amp;lt;ffffffffa108c4cf&amp;gt;] sub_updates_write+0xb99/0xf2e [ptlrpc]
2016-07-08 14:15:20 [  337.532065]  [&amp;lt;ffffffff81179632&amp;gt;] ? __free_memcg_kmem_pages+0x22/0x50
2016-07-08 14:15:20 [  337.539953]  [&amp;lt;ffffffffa107ab4f&amp;gt;] top_trans_stop+0x62f/0x970 [ptlrpc]
2016-07-08 14:15:20 [  337.547819]  [&amp;lt;ffffffffa1307399&amp;gt;] lod_trans_stop+0x259/0x340 [lod]
2016-07-08 14:15:20 [  337.555394]  [&amp;lt;ffffffffa0e1f320&amp;gt;] ? linkea_add_buf+0x80/0x170 [obdclass]
2016-07-08 14:15:20 [  337.563543]  [&amp;lt;ffffffffa13958fa&amp;gt;] mdd_trans_stop+0x1a/0x1c [mdd]
2016-07-08 14:15:20 [  337.570904]  [&amp;lt;ffffffffa1380b08&amp;gt;] mdd_link+0x2e8/0x930 [mdd]
2016-07-08 14:15:20 [  337.577908]  [&amp;lt;ffffffffa127555a&amp;gt;] ? mdt_lookup_version_check+0xca/0x2f0 [mdt]
2016-07-08 14:15:20 [  337.586560]  [&amp;lt;ffffffffa127c08e&amp;gt;] mdt_reint_link+0xade/0xc30 [mdt]
2016-07-08 14:15:20 [  337.594113]  [&amp;lt;ffffffff816500a5&amp;gt;] ? mutex_lock+0x25/0x42
2016-07-08 14:15:20 [  337.600723]  [&amp;lt;ffffffffa1057ae6&amp;gt;] ? tgt_lookup_reply_by_xid+0x46/0x60 [ptlrpc]
2016-07-08 14:15:20 [  337.609471]  [&amp;lt;ffffffffa127e470&amp;gt;] mdt_reint_rec+0x80/0x210 [mdt]
2016-07-08 14:15:20 [  337.616845]  [&amp;lt;ffffffffa1261971&amp;gt;] mdt_reint_internal+0x5e1/0x990 [mdt]
2016-07-08 14:15:20 [  337.624811]  [&amp;lt;ffffffffa126b0d7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
2016-07-08 14:15:20 [  337.631807]  [&amp;lt;ffffffffa10665d5&amp;gt;] tgt_request_handle+0x915/0x1320 [ptlrpc]
2016-07-08 14:15:20 [  337.640164]  [&amp;lt;ffffffffa10130cb&amp;gt;] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
2016-07-08 14:15:20 [  337.649379]  [&amp;lt;ffffffffa0892758&amp;gt;] ? lc_watchdog_touch+0x68/0x180 [libcfs]
2016-07-08 14:15:20 [  337.657650]  [&amp;lt;ffffffffa1010c9b&amp;gt;] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
2016-07-08 14:15:20 [  337.665878]  [&amp;lt;ffffffff810bd4c2&amp;gt;] ? default_wake_function+0x12/0x20
2016-07-08 14:15:20 [  337.673534]  [&amp;lt;ffffffff810b33f8&amp;gt;] ? __wake_up_common+0x58/0x90
2016-07-08 14:15:20 [  337.680721]  [&amp;lt;ffffffffa1017170&amp;gt;] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
2016-07-08 14:15:20 [  337.688368]  [&amp;lt;ffffffff81015588&amp;gt;] ? __switch_to+0xf8/0x4d0
2016-07-08 14:15:20 [  337.695159]  [&amp;lt;ffffffffa10166e0&amp;gt;] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
2016-07-08 14:15:20 [  337.702804]  [&amp;lt;ffffffff810a997f&amp;gt;] kthread+0xcf/0xe0
2016-07-08 14:15:20 [  337.708875]  [&amp;lt;ffffffff810a98b0&amp;gt;] ? kthread+0x0/0xe0
2016-07-08 14:15:20 [  337.715068]  [&amp;lt;ffffffff8165d658&amp;gt;] ret_from_fork+0x58/0x90
2016-07-08 14:15:20 [  337.721710]  [&amp;lt;ffffffff810a98b0&amp;gt;] ? kthread+0x0/0xe0
2016-07-08 14:15:20 [  337.727861] 
2016-07-08 14:15:20 [  337.730107] LustreError: 26705:0:(update_trans.c:279:sub_updates_write()) ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size ) failed: lrh_len 32776 loc_chunk_size 32768
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="158464" author="laisiyao" created="Tue, 12 Jul 2016 08:18:41 +0000"  >&lt;p&gt;no, I&apos;m reviewing the code.&lt;/p&gt;</comment>
                            <comment id="158543" author="morrone" created="Tue, 12 Jul 2016 21:09:58 +0000"  >&lt;p&gt;The attached jet13.log file is from a node that is about to hit this assertion.  I put a check for the same condition and a forced log dump in the code right before the assertion to get this.&lt;/p&gt;</comment>
                            <comment id="158782" author="gerrit" created="Thu, 14 Jul 2016 05:00:35 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/21307&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21307&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8370&quot; title=&quot;ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8370&quot;&gt;&lt;del&gt;LU-8370&lt;/del&gt;&lt;/a&gt; updates: size round for update record.&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_8&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: b7ebae2b3b799aec76be207f7370c2129010fa26&lt;/p&gt;</comment>
                            <comment id="158784" author="di.wang" created="Thu, 14 Jul 2016 05:04:51 +0000"  >&lt;p&gt;Chris: it looks like when splitting the update records (by ctxt chunk_size), it should use cfs_size_round(). I just pushed a patch (&lt;a href=&quot;http://review.whamcloud.com/21307&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21307&lt;/a&gt;), and also added some debugging msg there.&lt;/p&gt;

&lt;p&gt;Could you please try it and tell me what are those console message if it hits LBUG again. Thanks.&lt;/p&gt;</comment>
                            <comment id="158939" author="gerrit" created="Fri, 15 Jul 2016 13:20:18 +0000"  >&lt;p&gt;Lai Siyao (lai.siyao@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/21334&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21334&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8370&quot; title=&quot;ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8370&quot;&gt;&lt;del&gt;LU-8370&lt;/del&gt;&lt;/a&gt; dne: error in spliting update records&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 707164929f73698e9ff25eccebc67425e55efe07&lt;/p&gt;</comment>
                            <comment id="158940" author="laisiyao" created="Fri, 15 Jul 2016 13:22:47 +0000"  >&lt;p&gt;hi Di, I make a patch to use the same function to estimate and set update llog size, do you think it&apos;s better?&lt;/p&gt;</comment>
                            <comment id="158962" author="di.wang" created="Fri, 15 Jul 2016 15:55:11 +0000"  >&lt;p&gt;Yes, it is better. Thanks Lai.&lt;/p&gt;</comment>
                            <comment id="158974" author="morrone" created="Fri, 15 Jul 2016 17:13:58 +0000"  >&lt;p&gt;I was out sick yesterday, but I&apos;m back today.  I&apos;ll give the 21334 patch a try today.  Thanks!&lt;/p&gt;</comment>
                            <comment id="159006" author="morrone" created="Fri, 15 Jul 2016 20:46:54 +0000"  >&lt;p&gt;I was able to start the stuck filesystem and complete recovery with the 21334 patch applied.  It looks good so far.&lt;/p&gt;</comment>
                            <comment id="160831" author="morrone" created="Thu, 4 Aug 2016 17:56:24 +0000"  >&lt;p&gt;Yesterday I updated to Patch Set 5 of change &lt;a href=&quot;http://review.whamcloud.com/21334&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;21334&lt;/a&gt; and installed it on our testbed.  Last night we hit this NULL pointer dereference down pretty much the same call path:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2016-08-04 02:47:57 [115783.326226] BUG: unable to handle kernel NULL pointer dereference at           (null)
2016-08-04 02:47:57 [115783.335993] IP: [&amp;lt;ffffffff81651232&amp;gt;] down_write+0x32/0x43
2016-08-04 02:47:57 [115783.342892] PGD 0 
2016-08-04 02:47:57 [115783.345995] Oops: 0002 [#1] SMP 
2016-08-04 02:47:57 [115783.350434] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE
) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) nfsv3 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
 rdma_cm ib_cm iw_cm ib_sa ib_mad mlx5_ib iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl kvm ib_core ib_addr pcspkr mlx5_core sb_ed
ac edac_core mei_me lpc_ich mei mfd_core ses enclosure ipmi_devintf zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) sg znvpair(POE) spl(OE) zlib_deflate
 i2c_i801 ioatdma shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc nfsd nfs_acl ip_tables auth_rpcgss nfsv4 dns_resolver nfs l
ockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect crct10dif_pclmul sysimgblt crct10dif_common i2c_
algo_bit crc32_pclmul crc32c_intel drm_kms_helper mxm_wmi ghash_clmulni_intel ttm ahci aesni_intel lrw ixgbe gf128mul glue_helper libahci ablk_helper 
dca drm mpt3sas ptp libata i2c_core cryptd raid_class pps_core scsi_transport_sas mdio wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi
 dm_multipath dm_mod
2016-08-04 02:47:57 [115783.476527] CPU: 14 PID: 162546 Comm: mdt03_008 Tainted: P        W  OE  ------------   3.10.0-327.22.2.2chaos.ch6.x86_64 #1
2016-08-04 02:47:57 [115783.489957] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-08-04 02:47:57 [115783.502713] task: ffff883f17ffb980 ti: ffff887ed50d4000 task.ti: ffff887ed50d4000
2016-08-04 02:47:57 [115783.511991] RIP: 0010:[&amp;lt;ffffffff81651232&amp;gt;]  [&amp;lt;ffffffff81651232&amp;gt;] down_write+0x32/0x43
2016-08-04 02:47:57 [115783.521679] RSP: 0018:ffff887ed50d78e0  EFLAGS: 00010246
2016-08-04 02:47:57 [115783.528512] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff887ed50d7fd8
2016-08-04 02:47:57 [115783.537390] RDX: ffffffff00000001 RSI: 000000000000002f RDI: ffffffff81886372
2016-08-04 02:47:57 [115783.546265] RBP: ffff887ed50d78e8 R08: ffff8870f59ada00 R09: ffff883f7ec07b00
2016-08-04 02:47:57 [115783.555130] R10: ffffffffa0f6f1ec R11: ffff88768502ba60 R12: ffff887f246b6a00
2016-08-04 02:47:57 [115783.563974] R13: 0000000000000000 R14: ffff88768502ba40 R15: ffff8870f59ada00
2016-08-04 02:47:57 [115783.572814] FS:  0000000000000000(0000) GS:ffff887f7eac0000(0000) knlGS:0000000000000000
2016-08-04 02:47:57 [115783.582719] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2016-08-04 02:47:57 [115783.589997] CR2: 0000000000000000 CR3: 0000000001962000 CR4: 00000000001407e0
2016-08-04 02:47:57 [115783.598840] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2016-08-04 02:47:57 [115783.607672] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2016-08-04 02:47:57 [115783.616493] Stack:
2016-08-04 02:47:57 [115783.619577]  0000000000000000 ffff887ed50d7938 ffffffffa0ca6f84 ffff88768502ba40
2016-08-04 02:47:57 [115783.628726]  ffff8878897d8000 ffff887ef6bcd2c0 ffff887f246b6a00 ffff887ef6bcd2c0
2016-08-04 02:47:57 [115783.637872]  ffff8878897d8000 ffff88768502ba40 ffff8870f59ada00 ffff887ed50d7978
2016-08-04 02:47:57 [115783.647011] Call Trace:
2016-08-04 02:47:57 [115783.650601]  [&amp;lt;ffffffffa0ca6f84&amp;gt;] llog_cat_add_rec+0x1d4/0x780 [obdclass]
2016-08-04 02:47:57 [115783.659046]  [&amp;lt;ffffffffa0c9fa3a&amp;gt;] llog_add+0x7a/0x1a0 [obdclass]
2016-08-04 02:47:57 [115783.666616]  [&amp;lt;ffffffffa0f6f1ec&amp;gt;] ? sub_updates_write+0x7f6/0xef8 [ptlrpc]
2016-08-04 02:47:57 [115783.675142]  [&amp;lt;ffffffffa0f6f5e3&amp;gt;] sub_updates_write+0xbed/0xef8 [ptlrpc]
2016-08-04 02:47:57 [115783.683642]  [&amp;lt;ffffffffa0f5dc0f&amp;gt;] top_trans_stop+0x62f/0x970 [ptlrpc]
2016-08-04 02:47:57 [115783.691833]  [&amp;lt;ffffffffa1213399&amp;gt;] lod_trans_stop+0x259/0x340 [lod]
2016-08-04 02:47:57 [115783.699562]  [&amp;lt;ffffffffa0d02380&amp;gt;] ? linkea_add_buf+0x80/0x170 [obdclass]
2016-08-04 02:47:57 [115783.707871]  [&amp;lt;ffffffffa12a18fa&amp;gt;] mdd_trans_stop+0x1a/0x1c [mdd]
2016-08-04 02:47:57 [115783.715384]  [&amp;lt;ffffffffa128cb08&amp;gt;] mdd_link+0x2e8/0x930 [mdd]
2016-08-04 02:47:57 [115783.722511]  [&amp;lt;ffffffffa0ee9522&amp;gt;] ? lustre_msg_get_versions+0x22/0xf0 [ptlrpc]
2016-08-04 02:47:57 [115783.731563]  [&amp;lt;ffffffffa115f08e&amp;gt;] mdt_reint_link+0xade/0xc30 [mdt]
2016-08-04 02:47:57 [115783.739437]  [&amp;lt;ffffffff81309f82&amp;gt;] ? strlcpy+0x42/0x60
2016-08-04 02:47:57 [115783.745864]  [&amp;lt;ffffffffa1161470&amp;gt;] mdt_reint_rec+0x80/0x210 [mdt]
2016-08-04 02:47:57 [115783.753352]  [&amp;lt;ffffffffa1144971&amp;gt;] mdt_reint_internal+0x5e1/0x990 [mdt]
2016-08-04 02:47:57 [115783.761426]  [&amp;lt;ffffffffa114e0d7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
2016-08-04 02:47:57 [115783.768525]  [&amp;lt;ffffffffa0f49695&amp;gt;] tgt_request_handle+0x915/0x1320 [ptlrpc]
2016-08-04 02:47:57 [115783.776987]  [&amp;lt;ffffffffa0ef60cb&amp;gt;] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
2016-08-04 02:47:57 [115783.786323]  [&amp;lt;ffffffffa0784758&amp;gt;] ? lc_watchdog_touch+0x68/0x180 [libcfs]
2016-08-04 02:47:57 [115783.794830]  [&amp;lt;ffffffffa0ef3c9b&amp;gt;] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
2016-08-04 02:47:57 [115783.803172]  [&amp;lt;ffffffff810bd4c2&amp;gt;] ? default_wake_function+0x12/0x20
2016-08-04 02:47:57 [115783.810903]  [&amp;lt;ffffffff810b33f8&amp;gt;] ? __wake_up_common+0x58/0x90
2016-08-04 02:47:57 [115783.818150]  [&amp;lt;ffffffffa0efa170&amp;gt;] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
2016-08-04 02:47:57 [115783.825862]  [&amp;lt;ffffffff81015588&amp;gt;] ? __switch_to+0xf8/0x4d0
2016-08-04 02:47:57 [115783.832686]  [&amp;lt;ffffffffa0ef96e0&amp;gt;] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
2016-08-04 02:47:57 [115783.841633]  [&amp;lt;ffffffff810a997f&amp;gt;] kthread+0xcf/0xe0
2016-08-04 02:47:57 [115783.847742]  [&amp;lt;ffffffff810a98b0&amp;gt;] ? kthread_create_on_node+0x140/0x140
2016-08-04 02:47:57 [115783.855707]  [&amp;lt;ffffffff8165d658&amp;gt;] ret_from_fork+0x58/0x90
2016-08-04 02:47:57 [115783.862363]  [&amp;lt;ffffffff810a98b0&amp;gt;] ? kthread_create_on_node+0x140/0x140
2016-08-04 02:47:57 [115783.870270] Code: d2 be 2f 00 00 00 48 89 e5 53 48 89 fb 48 c7 c7 72 63 88 81 e8 00 49 a6 ff e8 8b 10 00 00 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 &amp;lt;f0&amp;gt; 48 0f c1 10 85 d2 74 05 e8 f0 e9 cb ff 5b 5d c3 55 48 89 e5 
2016-08-04 02:47:57 [115783.893325] RIP  [&amp;lt;ffffffff81651232&amp;gt;] down_write+0x32/0x43
2016-08-04 02:47:57 [115783.900088]  RSP &amp;lt;ffff887ed50d78e0&amp;gt;
2016-08-04 02:47:57 [115783.904604] CR2: 0000000000000000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="160848" author="di.wang" created="Thu, 4 Aug 2016 20:24:02 +0000"  >&lt;p&gt;This looks like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7800&quot; title=&quot;Panic during recovery of soak-test.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7800&quot;&gt;&lt;del&gt;LU-7800&lt;/del&gt;&lt;/a&gt;. Could you please try this patch &lt;a href=&quot;http://review.whamcloud.com/#/c/18542/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/18542/&lt;/a&gt; . &lt;/p&gt;</comment>
                            <comment id="161359" author="morrone" created="Tue, 9 Aug 2016 22:56:24 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7800&quot; title=&quot;Panic during recovery of soak-test.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7800&quot;&gt;&lt;del&gt;LU-7800&lt;/del&gt;&lt;/a&gt; caused more problems then it was intended to solve.  I opened &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8489&quot; title=&quot;llog_cat_add_rec NULL pointer dereference&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8489&quot;&gt;&lt;del&gt;LU-8489&lt;/del&gt;&lt;/a&gt; to track the NULL pointer dereference in llog_cat_add_rec(), since Intel believes that it is a separate problem.&lt;/p&gt;</comment>
                            <comment id="161539" author="gerrit" created="Thu, 11 Aug 2016 05:49:35 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/21334/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/21334/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8370&quot; title=&quot;ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8370&quot;&gt;&lt;del&gt;LU-8370&lt;/del&gt;&lt;/a&gt; dne: error in spliting update records&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: aa5d6bc2aa9abc745e6f590048e270c59265f699&lt;/p&gt;</comment>
                            <comment id="161577" author="pjones" created="Thu, 11 Aug 2016 11:57:22 +0000"  >&lt;p&gt;Laned for 2.9. Residual issues being tracked under separate tickets&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10120">
                    <name>Blocker</name>
                                                                <inwardlinks description="is blocked by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="34835">LU-7800</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="22221" name="jet13.log" size="1311131" author="morrone" created="Tue, 12 Jul 2016 21:09:58 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzygnr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>