<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:21:29 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15809] replay-dual test_29: timeout llog_verify_record() lustre-MDT0000-osp-MDT0001: record is too large: 0 &gt; 32768</title>
                <link>https://jira.whamcloud.com/browse/LU-15809</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite run: &lt;a href=&quot;https://testing.whamcloud.com/test_sets/9c32c9af-c574-4023-9286-07091f92769c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/9c32c9af-c574-4023-9286-07091f92769c&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;test_29 failed with the following error:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Started lustre-MDT0001

Timeout occurred after 135 mins, last suite running was replay-dual
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It looks like the MDS is having trouble reading the recovery llog and is stuck doing this forever with &quot;&lt;tt&gt;retry remote llog process&lt;/tt&gt;&quot;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 0 &amp;gt; 32768
[Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 0/2: rc = -22
[Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) lustre-MDT0000-osp-MDT0001: magic 0 is bad
[Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) lustre-MDT0000-osp-MDT0001 retry remote llog process
[Mon Dec 27 22:49:22 2021] Lustre: lustre-MDT0001: in recovery but waiting for the first client to connect
[Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 400547 &amp;gt; 32768
[Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) Skipped 205 previous similar messages
[Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 96/0: rc = -22
[Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) Skipped 309 previous similar messages
:
:
[Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) lustre-MDT0000-osp-MDT0001: magic 0 is bad
[Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) Skipped 129784 previous similar messages
[Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) lustre-MDT0000-osp-MDT0001 retry remote llog process
[Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) Skipped 32445 previous similar messages
[Mon Dec 27 23:36:29 2021] Lustre: 113052:0:(ldlm_lib.c:1962:extend_recovery_timer()) lustre-MDT0001: extended recovery timer reached hard limit: 180, extend: 1
[Mon Dec 27 23:36:29 2021] Lustre: 113052:0:(ldlm_lib.c:1962:extend_recovery_timer()) Skipped 29 previous similar messages
[Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 0 &amp;gt; 32768
[Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) Skipped 258999 previous similar messages
[Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 0/0: rc = -22
[Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) Skipped 388499 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;





&lt;p&gt;VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV&lt;br/&gt;
replay-dual test_29 - Timeout occurred after 135 mins, last suite running was replay-dual&lt;/p&gt;</description>
                <environment></environment>
        <key id="70092">LU-15809</key>
            <summary>replay-dual test_29: timeout llog_verify_record() lustre-MDT0000-osp-MDT0001: record is too large: 0 &gt; 32768</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                    </labels>
                <created>Fri, 29 Apr 2022 20:34:36 +0000</created>
                <updated>Wed, 14 Jun 2023 00:01:20 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="333475" author="bzzz" created="Sat, 30 Apr 2022 04:27:22 +0000"  >&lt;p&gt;this is likely a dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15139&quot; title=&quot;sanity test_160h: dt_record_write() ASSERTION( dt-&amp;gt;do_body_ops-&amp;gt;dbo_write ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15139&quot;&gt;&lt;del&gt;LU-15139&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="348923" author="ake_s" created="Thu, 6 Oct 2022 16:52:09 +0000"  >&lt;p&gt;I&apos;m getting this on a production system, but I don&apos;t see any traces of what&apos;s described in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15139&quot; title=&quot;sanity test_160h: dt_record_write() ASSERTION( dt-&amp;gt;do_body_ops-&amp;gt;dbo_write ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15139&quot;&gt;&lt;del&gt;LU-15139&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Running DDN ExaScaler 5.2.5 with Lustre 2.12.8_ddn9&lt;/p&gt;</comment>
                            <comment id="367742" author="sthiell" created="Wed, 29 Mar 2023 16:00:57 +0000"  >&lt;p&gt;We hit the same issue today with 2.12.8 with patches (close to 2.12.9) when failing over MDT0001 to another server.&lt;/p&gt;

&lt;p&gt;symptoms:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:461:llog_verify_record()) oak-MDT0000-osp-MDT0001: record is too large: 0 &amp;gt; 32768
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:461:llog_verify_record()) Skipped 409732 previous similar messages
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:660:llog_process_thread()) oak-MDT0000-osp-MDT0001: invalid record in llog [0x1:0x181e9:0x2] record for index 0/6: rc = -22
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:660:llog_process_thread()) Skipped 409722 previous similar messages
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:785:llog_process_thread()) oak-MDT0000-osp-MDT0001 retry remote llog process
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:785:llog_process_thread()) Skipped 409721 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And a thread named &lt;tt&gt;lod0001_rec0000&lt;/tt&gt; was very active. Live backtrace for this thread:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; bt 9824
PID: 9824   TASK: ffff8c1c61d16300  CPU: 19  COMMAND: &quot;lod0001_rec0000&quot;
(active)
crash&amp;gt; bt 9824
PID: 9824   TASK: ffff8c1c61d16300  CPU: 19  COMMAND: &quot;lod0001_rec0000&quot;
(active)
crash&amp;gt; bt 9824
PID: 9824   TASK: ffff8c1c61d16300  CPU: 19  COMMAND: &quot;lod0001_rec0000&quot;
 #0 [ffff8c1bbc7c75f0] __schedule at ffffffff9e1b78d8
 #1 [ffff8c1bbc7c7650] ptlrpc_check_set at ffffffffc162afb8 [ptlrpc]
 #2 [ffff8c1bbc7c76f8] get_page_from_freelist at ffffffff9dbd34df
 #3 [ffff8c1bbc7c7810] __alloc_pages_nodemask at ffffffff9dbd3dc4
 #4 [ffff8c1bbc7c7940] alloc_pages_current at ffffffff9dc256c8
 #5 [ffff8c1bbc7c7988] kmalloc_order at ffffffff9dbf1688
 #6 [ffff8c1bbc7c79f8] llog_process_thread at ffffffffc135e1ac [obdclass]
 #7 [ffff8c1bbc7c7b38] llog_process_or_fork at ffffffffc135fca9 [obdclass]
 #8 [ffff8c1bbc7c7ba8] llog_cat_process_cb at ffffffffc1365419 [obdclass]
 #9 [ffff8c1bbc7c7c00] llog_process_thread at ffffffffc135eaba [obdclass]
#10 [ffff8c1bbc7c7d18] llog_process_or_fork at ffffffffc135fca9 [obdclass]
#11 [ffff8c1bbc7c7d88] llog_cat_process_or_fork at ffffffffc1361cb9 [obdclass]
#12 [ffff8c1bbc7c7e08] llog_cat_process at ffffffffc1361e6e [obdclass]
#13 [ffff8c1bbc7c7e28] lod_sub_recovery_thread at ffffffffc1299b49 [lod]
#14 [ffff8c1bbc7c7ec8] kthread at ffffffff9dacb511
#15 [ffff8c1bbc7c7f50] ret_from_fork_nospec_begin at ffffffff9e1c51dd
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The recovery of MDT1 could not complete.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oak-md1-s2 ~]# cat /proc/fs/lustre/mdt/oak-MDT0001/recovery_status 
status: WAITING
non-ready MDTs:  0000
recovery_start: 677513
time_waited: 1679427391
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As a workaround, we used a manual &lt;tt&gt;abort_recovery&lt;/tt&gt; which succeeded and stopped the thread &lt;tt&gt;lod0001_rec0000&lt;/tt&gt;.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl --device oak-MDT0001 abort_recovery
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 29 08:55:39 oak-md1-s2 kernel: Lustre: oak-MDT0001: Recovery over after 16:16, of 1885 clients 1122 recovered and 763 were evicted.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="375346" author="adilger" created="Tue, 13 Jun 2023 23:56:47 +0000"  >&lt;p&gt;Failed 13x on master over the past week.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="71587">LU-16066</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="74391">LU-16539</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="63086">LU-15139</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02oo7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>