<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:59:23 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13215] sanity-pfl test 17 hangs with &#8220;incorrect message magic&#8221;</title>
                <link>https://jira.whamcloud.com/browse/LU-13215</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;With sanity-pfl tests 16a (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13205&quot; title=&quot;sanity-pfl test 16a fails with &#8220;setstripe /mnt/lustre/d16.sanity-pfl/f16.sanity-pfl.copy failed&#8220;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13205&quot;&gt;LU-13205&lt;/a&gt;) and 16b (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13207&quot; title=&quot;sanity-pfl test 16b crashes in &#8220;Oops: Kernel access of bad area&#8221;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13207&quot;&gt;LU-13207&lt;/a&gt;) being skipped (on the ALWAYS_EXCEPT list), now test 17 hangs repeatedly for PPC client testing. See&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/efe243fc-4908-11ea-b69a-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/efe243fc-4908-11ea-b69a-52540065bddc&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The output for each of these hangs look different although root cause for the crash may be the same. For example looking at the client1 (77vm7) console log for the hang at &lt;a href=&quot;https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc&lt;/a&gt;, we see&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 1175.191331] Lustre: DEBUG MARKER: == sanity-pfl test 17: Verify LOVEA grows with more component inited ================================= 20:56:28 (1580936188)
[ 1175.237552] LustreError: 2178:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1
[ 1175.269569] LustreError: 2188:0:(pack_generic.c:1161:lustre_msg_get_transno()) incorrect message magic: d30bd00b
[ 1175.269697] LustreError: 2188:0:(pack_generic.c:1071:lustre_msg_get_opc()) incorrect message magic: d30bd00b (msg:c0000000b55af000)
[ 1175.269781] LustreError: 2188:0:(pack_generic.c:1341:lustre_msg_get_jobid()) incorrect message magic: d30bd00b

&amp;lt;ConMan&amp;gt; Console [trevis-77vm7] disconnected from &amp;lt;trevis-77:6006&amp;gt; at 02-05 22:26.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking at the client1 console log for the hang at &lt;a href=&quot;https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc&lt;/a&gt;, we see&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 1170.068238] Lustre: DEBUG MARKER: == sanity-pfl test 17: Verify LOVEA grows with more component inited ================================= 16:07:54 (1581005274)
[ 1170.164526] LustreError: 2131:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1
[ 1170.185519] LustreError: 2141:0:(pack_generic.c:1161:lustre_msg_get_transno()) incorrect message magic: 00000200
[ 1170.185636] LustreError: 2141:0:(pack_generic.c:1071:lustre_msg_get_opc()) incorrect message magic: 00000200 (msg:c0000000b9a42000)
[ 1170.185719] LustreError: 2141:0:(pack_generic.c:1341:lustre_msg_get_jobid()) incorrect message magic: 00000200
[ 1170.192095] swap_info_get: Bad swap file entry 8000002ac500000
[ 1170.192172] BUG: Bad page map in process lfs  pte:ab14000007010088 pmd:c0000000b9a52000
[ 1170.192226] addr:0000000010000000 vm_flags:00000875 anon_vma:          (null) mapping:c0000000b84c8a20 index:0
[ 1170.192337] vma-&amp;gt;vm_ops-&amp;gt;fault: ext4_filemap_fault+0x0/0xfffffffffffe74d8 [ext4]
[ 1170.192401] vma-&amp;gt;vm_file-&amp;gt;f_op-&amp;gt;mmap: ext4_file_mmap+0x0/0xfffffffffffe7a78 [ext4]
[ 1170.192457] CPU: 1 PID: 2144 Comm: lfs Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.9.1.el7.ppc64 #1
[ 1170.192530] Call Trace:
[ 1170.192583] [c0000000b42a76a0] [c00000000001f078] .show_stack+0x88/0x330 (unreliable)
[ 1170.192667] [c0000000b42a7760] [c000000000b1c56c] .dump_stack+0x28/0x3c
[ 1170.192730] [c0000000b42a77d0] [c00000000030b018] .print_bad_pte+0x198/0x260
[ 1170.192788] [c0000000b42a7880] [c00000000030c8e4] .unmap_page_range+0x414/0xc20
[ 1170.192863] [c0000000b42a7a40] [c00000000031087c] .unmap_vmas+0xec/0x1c0
[ 1170.192917] [c0000000b42a7af0] [c00000000031e318] .exit_mmap+0xf8/0x250
[ 1170.192972] [c0000000b42a7c20] [c0000000000ecd98] .mmput+0xa8/0x1a0
[ 1170.193026] [c0000000b42a7ca0] [c0000000000ffd10] .do_exit+0x310/0xbb0
[ 1170.193080] [c0000000b42a7da0] [c000000000100774] .SyS_exit_group+0x54/0x100
[ 1170.193134] [c0000000b42a7e30] [c00000000000a288] system_call+0x3c/0x100
[ 1170.193189] Disabling lock debugging due to kernel taint
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the other hand, there is obviously something wrong with the output for other hangs. Looking at the hang at &lt;a href=&quot;https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc&lt;/a&gt;, the most interesting output is for the client1 (77vm1) with &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 1190.287068] LustreError: 2154:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1
[ 1190.317013] Oops: Exception in kernel mode, sig: 4 [#1]
[ 1190.317109] SMP NR_CPUS=2048 NUMA pSeries
[ 1190.317155] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt gratc_te
[ 1190.332221] Unrecoverable exception 0 at d000000004362060
[ 1190.332347] Oops: Unrecoverable exception, sig: 6 [#2]
[ 1190.332385] SMP NR_CPUS=2048 NUMA pSeries
[ 1190.332428] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) &#65533;(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) () () () () () () () () () () () () () () () () &#8230;&amp;lt; several hundred () cut out to keep sane&amp;gt;
 () () () () () () () () () () () () ()
[ 1191.364389]  &#65533;() &#65533;() &#65533;() &#65533;() &#65533;() &#65533;() &#65533;() &#65533;) &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#8230; &amp;lt; several hundred &#8220; &#65533;(&#8220; removed&amp;gt;
 &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() 
&#65533;&#65533;li&#65533; )&#65533;z( u&#65533;&apos;[/&#65533;&#65533;&#65533;6Y@ C&#65533;x&#65533;LB&#65533;&#65533;@m &#65533;*&#65533; &#65533;&#65533;^&#65533;&#65533;c &#65533;&#65533;&#65533;&#65533;wkgN1dx3 d&#65533;G	
&#65533;^	&#65533;Bj 
&#65533;() &#65533;() &#65533;() &#65533;() &#65533;&#1174;AMT&#65533; &#65533;x&#65533;&#289;&#65533;&#65533; &#293;n d&#2029;&#65533;&amp;gt;&#46683;&#65533;&#65533;A&#65533;() &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() () &#65533;() &#65533;() &#65533;() bute it and/or modify it
&#8230;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and it only gets worse from there.&lt;/p&gt;</description>
                <environment>PPC clients</environment>
        <key id="58022">LU-13215</key>
            <summary>sanity-pfl test 17 hangs with &#8220;incorrect message magic&#8221;</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                            <label>always_except</label>
                            <label>ppc</label>
                    </labels>
                <created>Fri, 7 Feb 2020 16:35:27 +0000</created>
                <updated>Fri, 21 Feb 2020 23:17:12 +0000</updated>
                                            <version>Lustre 2.14.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="262837" author="adilger" created="Fri, 7 Feb 2020 16:59:28 +0000"  >&lt;p&gt;It looks like there is either missing swabbing, or incorrect swabbing of RPC message buffers.  Did the (or the previously skipped test) failures start after specific patch landings? That would make it relatively easy to track down the source, otherwise it needs to be done by looking through the code. &lt;/p&gt;</comment>
                            <comment id="262969" author="jamesanunez" created="Mon, 10 Feb 2020 02:32:41 +0000"  >&lt;p&gt;Here&apos;s what I can see and this information is only for PPC testing.&lt;/p&gt;

&lt;p&gt;It looks like sanity-pfl test 17 has failed, with &quot;Create /mnt/lustre/d16b.sanity-pfl/f16b.sanity-pfl failed&quot;, since before January 2019 through 26 JULY 2019 and didn&#8217;t run (and hung) again until 17 OCT 19 and again on 06 FEB 2020 until added to the ALWAYS_EXCEPT list patch. &lt;/p&gt;

&lt;p&gt;sanity-pfl test 16b as added on 03 JUNE 2019 by patch &lt;a href=&quot;https://review.whamcloud.com/28425/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28425/&lt;/a&gt; and failed with  &quot;Create /mnt/lustre/d16b.sanity-pfl/f16b.sanity-pfl failed&quot; until it started crashing continuously from 30 JULY 2019 with one failure on 17 OCT 2019 until it was added to the ALWAYS_EXCEPT list patch on 05 FEB 2020. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="58001">LU-13205</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="58003">LU-13207</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00tdj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>