<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:08:44 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7419] llog corruption after hitting ASSERTION( handle-&gt;lgh_hdr == ((void *)0) ) in llog_init_handle</title>
                <link>https://jira.whamcloud.com/browse/LU-7419</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;A kernel panic occurred while initializing a new llog catalog record (details below) and an assertion was hit that was attempting to ensure that the llog_handle had a NULL pointer to a llog header struct. The panic occurred while I was running an mdtest job which was writing to a striped directory from 32 client nodes running 4 threads each. The panic caused an llog file to become corrupt. I manually repaired the llog file and restarted my MDSs and recovery now completes.&lt;/p&gt;

&lt;p&gt;To summarize my setup, I am running a test cluster with 4 MDSs, 2 OSTs, and 32 client nodes with the filesystem mounted. No failover. The lustre version running is 2.7.62. The error messages and call stack from MDS1 are below:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) ASSERTION( handle-&amp;gt;lgh_hdr == ((void *)0) ) failed:
2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) LBUG
2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005
2015-11-04 13:36:26 Nov  4 13:36:26
...
2015-11-04 13:36:26 Kernel panic - not syncing: LBUG
2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 Tainted: P           ---------------    2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1
2015-11-04 13:36:26 Call Trace:
2015-11-04 13:36:26  [&amp;lt;ffffffff8152d471&amp;gt;] ? panic+0xa7/0x16f
2015-11-04 13:36:26  [&amp;lt;ffffffffa0847f2b&amp;gt;] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2015-11-04 13:36:26  [&amp;lt;ffffffffa09a62cf&amp;gt;] ? llog_init_handle+0x86f/0xb10 [obdclass]
2015-11-04 13:36:26  [&amp;lt;ffffffffa09ac809&amp;gt;] ? llog_cat_new_log+0x3d9/0xdc0 [obdclass]
2015-11-04 13:36:26  [&amp;lt;ffffffffa09a4663&amp;gt;] ? llog_declare_write_rec+0x93/0x210 [obdclass]
2015-11-04 13:36:26  [&amp;lt;ffffffffa09ad616&amp;gt;] ? llog_cat_declare_add_rec+0x426/0x430 [obdclass]
2015-11-04 13:36:26  [&amp;lt;ffffffffa09a406f&amp;gt;] ? llog_declare_add+0x7f/0x1b0 [obdclass]
2015-11-04 13:36:26  [&amp;lt;ffffffffa0c9c19c&amp;gt;] ? top_trans_start+0x17c/0x960 [ptlrpc]
2015-11-04 13:36:26  [&amp;lt;ffffffffa127cc11&amp;gt;] ? lod_trans_start+0x61/0x70 [lod]
2015-11-04 13:36:26  [&amp;lt;ffffffffa13248b4&amp;gt;] ? mdd_trans_start+0x14/0x20 [mdd]
2015-11-04 13:36:26  [&amp;lt;ffffffffa1313333&amp;gt;] ? mdd_create+0xe53/0x1aa0 [mdd]
2015-11-04 13:36:26  [&amp;lt;ffffffffa11c6784&amp;gt;] ? mdt_version_save+0x84/0x1a0 [mdt]
2015-11-04 13:36:26  [&amp;lt;ffffffffa11c8f46&amp;gt;] ? mdt_reint_create+0xbb6/0xcc0 [mdt]
2015-11-04 13:36:26  [&amp;lt;ffffffffa0a13230&amp;gt;] ? lu_ucred+0x20/0x30 [obdclass]
2015-11-04 13:36:26  [&amp;lt;ffffffffa11a8675&amp;gt;] ? mdt_ucred+0x15/0x20 [mdt]
2015-11-04 13:36:26  [&amp;lt;ffffffffa11c183c&amp;gt;] ? mdt_root_squash+0x2c/0x3f0 [mdt]
2015-11-04 13:36:26  [&amp;lt;ffffffffa0c43d32&amp;gt;] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
2015-11-04 13:36:26  [&amp;lt;ffffffffa11c597d&amp;gt;] ? mdt_reint_rec+0x5d/0x200 [mdt]
2015-11-04 13:36:26  [&amp;lt;ffffffffa11b177b&amp;gt;] ? mdt_reint_internal+0x62b/0xb80 [mdt]
2015-11-04 13:36:26  [&amp;lt;ffffffffa11b216b&amp;gt;] ? mdt_reint+0x6b/0x120 [mdt]
2015-11-04 13:36:26  [&amp;lt;ffffffffa0c8621c&amp;gt;] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
2015-11-04 13:36:26  [&amp;lt;ffffffffa0c2da21&amp;gt;] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
2015-11-04 13:36:26  [&amp;lt;ffffffff8106d740&amp;gt;] ? pick_next_task_fair+0xd0/0x130
2015-11-04 13:36:26  [&amp;lt;ffffffff8152d8f6&amp;gt;] ? schedule+0x176/0x3a0
2015-11-04 13:36:26  [&amp;lt;ffffffffa0c2cbe0&amp;gt;] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
2015-11-04 13:36:26  [&amp;lt;ffffffff8109fffe&amp;gt;] ? kthread+0x9e/0xc0
2015-11-04 13:36:27  [&amp;lt;ffffffff8100c24a&amp;gt;] ? child_rip+0xa/0x20
2015-11-04 13:36:27  [&amp;lt;ffffffff8109ff60&amp;gt;] ? kthread+0x0/0xc0
2015-11-04 13:36:27  [&amp;lt;ffffffff8100c240&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After rebooting MDS1, I started to see llog corruption messages for an llog file that was on MDS4 (remember the panic was on MDS1) shown below:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can&apos;t read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5
2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Eventually, the recovery timer goes negative and recovery never ends (see &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-6994&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;LU-6994&lt;/a&gt;). I manually fixed the llog file on MDS4 and recovery now completes. Attached are the original corrupted llog (0x300000401.0x1.0x0) and the version after the fix (0x300000401.0x1.0x0.replace).&lt;/p&gt;</description>
                <environment>2.6.32-504.16.2.1chaos.ch5.3.x86_64</environment>
        <key id="33100">LU-7419</key>
            <summary>llog corruption after hitting ASSERTION( handle-&gt;lgh_hdr == ((void *)0) ) in llog_init_handle</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="di.wang">Di Wang</assignee>
                                    <reporter username="dinatale2">Giuseppe Di Natale</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Wed, 11 Nov 2015 23:56:08 +0000</created>
                <updated>Sat, 9 Jan 2016 15:20:23 +0000</updated>
                            <resolved>Mon, 14 Dec 2015 05:29:32 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="133324" author="di.wang" created="Thu, 12 Nov 2015 01:20:22 +0000"  >&lt;p&gt;Hmm, we need lock log handle (lgh_lock) during log_cat_new_log, otherwise it might cause race like this.&lt;br/&gt;
I will cook a patch.  Btw: I am trying to resolve all of the update llog corruption issue in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7039&quot; title=&quot;llog_osd.c:778:llog_osd_next_block()) ASSERTION( last_rec-&amp;gt;lrh_index == tail-&amp;gt;lrt_index ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7039&quot;&gt;&lt;del&gt;LU-7039&lt;/del&gt;&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Btw: recovery never ends mostly likely due to the failure of update recovery.&lt;/p&gt;</comment>
                            <comment id="133330" author="gerrit" created="Thu, 12 Nov 2015 05:17:43 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/17132&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17132&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7419&quot; title=&quot;llog corruption after hitting ASSERTION( handle-&amp;gt;lgh_hdr == ((void *)0) ) in llog_init_handle&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7419&quot;&gt;&lt;del&gt;LU-7419&lt;/del&gt;&lt;/a&gt; llog: lock new llog object creation&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 1917a3075506fd7879888bf1698022ecbc7cdc5d&lt;/p&gt;</comment>
                            <comment id="133524" author="gerrit" created="Fri, 13 Nov 2015 22:54:05 +0000"  >&lt;p&gt;Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/17196&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17196&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7419&quot; title=&quot;llog corruption after hitting ASSERTION( handle-&amp;gt;lgh_hdr == ((void *)0) ) in llog_init_handle&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7419&quot;&gt;&lt;del&gt;LU-7419&lt;/del&gt;&lt;/a&gt; llog: lock new llog object creation&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: ef35db927b07c08ca32f5bfd0b43e1e07430439b&lt;/p&gt;</comment>
                            <comment id="134275" author="adilger" created="Mon, 23 Nov 2015 18:57:48 +0000"  >&lt;p&gt;The patch &lt;a href=&quot;http://review.whamcloud.com/17132&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17132&lt;/a&gt; is intended for landing, the 17196 patch is just for testing.&lt;/p&gt;</comment>
                            <comment id="135978" author="dinatale2" created="Thu, 10 Dec 2015 23:49:26 +0000"  >&lt;p&gt;I applied this patch locally to lustre running on one of our test clusters. I am still using the setup mentioned in the comments of &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-6994&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;LU-6994&lt;/a&gt;. I no longer see MDSs crashing due to this call stack.&lt;/p&gt;</comment>
                            <comment id="136166" author="gerrit" created="Sun, 13 Dec 2015 21:57:45 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/17132/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/17132/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7419&quot; title=&quot;llog corruption after hitting ASSERTION( handle-&amp;gt;lgh_hdr == ((void *)0) ) in llog_init_handle&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7419&quot;&gt;&lt;del&gt;LU-7419&lt;/del&gt;&lt;/a&gt; llog: lock new llog object creation&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 63a3e412bddcc94b7497aecee91864813a614f83&lt;/p&gt;</comment>
                            <comment id="136182" author="pjones" created="Mon, 14 Dec 2015 05:29:33 +0000"  >&lt;p&gt;Landed for 2.8&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="31450">LU-6994</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="19579" name="0x300000401.0x1.0x0" size="34624" author="dinatale2" created="Wed, 11 Nov 2015 23:56:08 +0000"/>
                            <attachment id="19580" name="0x300000401.0x1.0x0.replace" size="34624" author="dinatale2" created="Wed, 11 Nov 2015 23:56:09 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxsuv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>