<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:16:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8362] page fault: exception RIP: lnet_mt_match_md+135</title>
                <link>https://jira.whamcloud.com/browse/LU-8362</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;OSS console errors&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;LNet: Can&apos;t send to 17456000@&amp;lt;65535:34821&amp;gt;: src 0@&amp;lt;0:0&amp;gt; is not a local nid^M
LNet: 46045:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@&amp;lt;65535:34821&amp;gt;: -22^M
LNet: Can&apos;t send to 17456000@&amp;lt;65535:34821&amp;gt;: src 0@&amp;lt;0:0&amp;gt; is not a local nid^M
LNet: 56154:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@&amp;lt;65535:34821&amp;gt;: -22^M
------------[ cut here ]------------^M
WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)^M
Hardware name: SUMMIT^M
list_del corruption. prev-&amp;gt;next should be ffff881d63ead4d0, but was (&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt;)^M
Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 acpi_cpufreq freq_table mperf lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dm_round_robin scsi_dh_rdac lpfc scsi_transport_fc scsi_tgt sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_mirror dm_region_hash dm_log dm_multipath dm_mod iTCO_wdt iTCO_vendor_support microcode sg wmi igb hwmon dca i2c_algo_bit ptp pps_core i2c_i801 i2c_core lpc_ich mfd_core shpchp tcp_bic ext3 jbd sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) ahci gru [last unloaded: scsi_wait_scan]^M
Pid: 8603, comm: kiblnd_sd_02_01 Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1^M
Call Trace:^M
 [&amp;lt;ffffffff81074127&amp;gt;] ? warn_slowpath_common+0x87/0xc0^M
 [&amp;lt;ffffffff81074216&amp;gt;] ? warn_slowpath_fmt+0x46/0x50^M
 [&amp;lt;ffffffff812bda6e&amp;gt;] ? list_del+0x6e/0xa0^M
 [&amp;lt;ffffffffa052c5c9&amp;gt;] ? lnet_me_unlink+0x39/0x140 [lnet]^M
 [&amp;lt;ffffffffa05303f8&amp;gt;] ? lnet_md_unlink+0x2f8/0x3e0 [lnet]^M
 [&amp;lt;ffffffffa0531b9f&amp;gt;] ? lnet_try_match_md+0x22f/0x310 [lnet]^M
 [&amp;lt;ffffffffa0a1f727&amp;gt;] ? kiblnd_recv+0x107/0x780 [ko2iblnd]^M
 [&amp;lt;ffffffffa0531d1c&amp;gt;] ? lnet_mt_match_md+0x9c/0x1c0 [lnet]^M
 [&amp;lt;ffffffffa0532621&amp;gt;] ? lnet_ptl_match_md+0x281/0x870 [lnet]^M
 [&amp;lt;ffffffffa05396e7&amp;gt;] ? lnet_parse_local+0x307/0xc60 [lnet]^M
 [&amp;lt;ffffffffa053a6da&amp;gt;] ? lnet_parse+0x69a/0xcf0 [lnet]^M
 [&amp;lt;ffffffffa0a1ff3b&amp;gt;] ? kiblnd_handle_rx+0x19b/0x620 [ko2iblnd]^M
 [&amp;lt;ffffffffa0a212be&amp;gt;] ? kiblnd_scheduler+0xefe/0x10d0 [ko2iblnd]^M
 [&amp;lt;ffffffff81064f90&amp;gt;] ? default_wake_function+0x0/0x20^M
 [&amp;lt;ffffffffa0a203c0&amp;gt;] ? kiblnd_scheduler+0x0/0x10d0 [ko2iblnd]^M
 [&amp;lt;ffffffff8109dc8e&amp;gt;] ? kthread+0x9e/0xc0^M
 [&amp;lt;ffffffff8100c28a&amp;gt;] ? child_rip+0xa/0x20^M
 [&amp;lt;ffffffff8109dbf0&amp;gt;] ? kthread+0x0/0xc0^M
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20^M
---[ end trace 1063d2ffc2578a2f ]---^M
------------[ cut here ]------------^M
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From the crash dump bt looks like this.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;PID: 8603   TASK: ffff8810271fa040  CPU: 11  COMMAND: &lt;span class=&quot;code-quote&quot;&gt;&quot;kiblnd_sd_02_01&quot;&lt;/span&gt;
 #0 [ffff880ff8b734f0] machine_kexec at ffffffff8103b5db
 #1 [ffff880ff8b73550] crash_kexec at ffffffff810c9412
 #2 [ffff880ff8b73620] kdb_kdump_check at ffffffff812973d7
 #3 [ffff880ff8b73630] kdb_main_loop at ffffffff8129a5c7
 #4 [ffff880ff8b73740] kdb_save_running at ffffffff8129472e
 #5 [ffff880ff8b73750] kdba_main_loop at ffffffff8147cd68
 #6 [ffff880ff8b73790] kdb at ffffffff812978c6
 #7 [ffff880ff8b73800] kdba_entry at ffffffff8147c687
 #8 [ffff880ff8b73810] notifier_call_chain at ffffffff81568515
 #9 [ffff880ff8b73850] atomic_notifier_call_chain at ffffffff8156857a
#10 [ffff880ff8b73860] notify_die at ffffffff810a44fe
#11 [ffff880ff8b73890] __die at ffffffff815663e2
#12 [ffff880ff8b738c0] no_context at ffffffff8104c822
#13 [ffff880ff8b73910] __bad_area_nosemaphore at ffffffff8104cad5
#14 [ffff880ff8b73960] bad_area_nosemaphore at ffffffff8104cba3
#15 [ffff880ff8b73970] __do_page_fault at ffffffff8104d29c
#16 [ffff880ff8b73a90] do_page_fault at ffffffff8156845e
#17 [ffff880ff8b73ac0] page_fault at ffffffff81565765
    [exception RIP: lnet_mt_match_md+135]
    RIP: ffffffffa0531d07  RSP: ffff880ff8b73b70  RFLAGS: 00010286
    RAX: ffff881d88420000  RBX: ffff880ff8b73c70  RCX: 0000000000000007
    RDX: 0000000000000004  RSI: ffff880ff8b73c70  RDI: ffffffffffffffff
    RBP: ffff880ff8b73bb0   R8: 0000000000000001   R9: d400000000000000
    R10: 0000000000000001  R11: 0000000000000012  R12: 0000000000000000
    R13: ffff881730ca6200  R14: 00d100120be91b91  R15: 0000000000000008
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#18 [ffff880ff8b73bb8] lnet_ptl_match_md at ffffffffa0532621 [lnet]
#19 [ffff880ff8b73c38] lnet_parse_local at ffffffffa05396e7 [lnet]
#20 [ffff880ff8b73cd8] lnet_parse at ffffffffa053a6da [lnet]
#21 [ffff880ff8b73d68] kiblnd_handle_rx at ffffffffa0a1ff3b [ko2iblnd]
#22 [ffff880ff8b73db8] kiblnd_scheduler at ffffffffa0a212be [ko2iblnd]
#23 [ffff880ff8b73ee8] kthread at ffffffff8109dc8e
#24 [ffff880ff8b73f48] kernel_thread at ffffffff8100c28a
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>lustre 2.7.1-fe</environment>
        <key id="37956">LU-8362</key>
            <summary>page fault: exception RIP: lnet_mt_match_md+135</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="mhanafi">Mahmoud Hanafi</reporter>
                        <labels>
                    </labels>
                <created>Sat, 2 Jul 2016 10:03:28 +0000</created>
                <updated>Thu, 14 Jun 2018 21:41:16 +0000</updated>
                            <resolved>Thu, 22 Sep 2016 21:41:36 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="157582" author="jgmitter" created="Sat, 2 Jul 2016 12:38:45 +0000"  >&lt;p&gt;Hi Mahmoud,&lt;/p&gt;

&lt;p&gt;Can you confirm that this is a severity 1 (production filesystem out of service)?&lt;/p&gt;

&lt;p&gt;Thanks.&lt;br/&gt;
Joe&lt;/p&gt;</comment>
                            <comment id="157584" author="bfaccini" created="Sat, 2 Jul 2016 13:44:50 +0000"  >&lt;p&gt;This problem looks similar to the one reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="157586" author="mhanafi" created="Sat, 2 Jul 2016 22:54:25 +0000"  >&lt;p&gt;should be severity 2 or 3&lt;br/&gt;
Also if it is a dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; we will need a 2.7fe patch&lt;/p&gt;</comment>
                            <comment id="157587" author="pjones" created="Sat, 2 Jul 2016 23:22:51 +0000"  >&lt;p&gt;ok Mahmoud. Then we&apos;ll assess this more fully on Monday.&lt;/p&gt;</comment>
                            <comment id="157596" author="bfaccini" created="Mon, 4 Jul 2016 10:12:39 +0000"  >&lt;p&gt;Mahmoud, I know that as a non-US citizen I am not allowed to work on the crash-dump, but can you at least  provide some pieces of memory content to help qualify the corruption ?&lt;br/&gt;
Let say,  32 quad-words starting from address 0xffff881d63ead440 (ie, &quot;rd 0xffff881d63ead440 32&quot;), and 32 quad-words starting from address found at location 0xffff881d63ead4d8 minus 0x90.&lt;/p&gt;</comment>
                            <comment id="157611" author="mhanafi" created="Mon, 4 Jul 2016 17:51:22 +0000"  >&lt;p&gt;Here is the info. btw how did you come up with those address locations.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
crash&amp;gt; rd 0xffff881d63ead440 32
ffff881d63ead440:  3463346339376238 65342d616630622d   8b79c4c4-b0fa-4e
ffff881d63ead450:  2d393762652d3134 3162326230316464   41-eb79-dd10b2b1
ffff881d63ead460:  0000000064616535 000000230199f7ea   5ead........#...
ffff881d63ead470:  000574241b1d26fc 0000000000000000   .&amp;amp;..$t..........
ffff881d63ead480:  0000000000000000 0000000000000000   ................
ffff881d63ead490:  0000000000000000 0000000000000000   ................
ffff881d63ead4a0:  0000000000000000 0000000000000000   ................
ffff881d63ead4b0:  0000000000000000 0000000000000000   ................
ffff881d63ead4c0:  ffff8815da7214c0 ffff881dbcd19740   ..r.....@.......
ffff881d63ead4d0:  ffff881b6dcd3cd0 ffffc900220983f0   .&amp;lt;.m.......&quot;....
ffff881d63ead4e0:  0000001490b5d3fa ffffffffffffffff   ................
ffff881d63ead4f0:  ffff881dffffffff 000001000000001c   ................
ffff881d63ead500:  0000000000000000 ffffffffffffffff   ................
ffff881d63ead510:  0000000000000001 ffff881a20848840   ........@.. ....
ffff881d63ead520:  0000000000000000 0000000000000000   ................
ffff881d63ead530:  0000000000000000 0000000000000000   ................


crash&amp;gt; rd ffff881d63ead448 32
ffff881d63ead448:  65342d616630622d 2d393762652d3134   -b0fa-4e41-eb79-
ffff881d63ead458:  3162326230316464 0000000064616535   dd10b2b15ead....
ffff881d63ead468:  000000230199f7ea 000574241b1d26fc   ....#....&amp;amp;..$t..
ffff881d63ead478:  0000000000000000 0000000000000000   ................
ffff881d63ead488:  0000000000000000 0000000000000000   ................
ffff881d63ead498:  0000000000000000 0000000000000000   ................
ffff881d63ead4a8:  0000000000000000 0000000000000000   ................
ffff881d63ead4b8:  0000000000000000 ffff8815da7214c0   ..........r.....
ffff881d63ead4c8:  ffff881dbcd19740 ffff881b6dcd3cd0   @........&amp;lt;.m....
ffff881d63ead4d8:  ffffc900220983f0 0000001490b5d3fa   ...&quot;............
ffff881d63ead4e8:  ffffffffffffffff ffff881dffffffff   ................
ffff881d63ead4f8:  000001000000001c 0000000000000000   ................
ffff881d63ead508:  ffffffffffffffff 0000000000000001   ................
ffff881d63ead518:  ffff881a20848840 0000000000000000   @.. ............
ffff881d63ead528:  0000000000000000 0000000000000000   ................
ffff881d63ead538:  0000000000000000 ffff881d36d68d40   ........@..6....
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="157649" author="bfaccini" created="Tue, 5 Jul 2016 09:54:51 +0000"  >&lt;p&gt;Thanks Mahmoud, the address comes from the msg just before the crash,  &quot;list_del corruption. prev-&amp;gt;next should be ffff881d63ead4d0, but was (null)&quot;.&lt;/p&gt;

&lt;p&gt;Well, according to the memory content, there is no real evidence of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; for the moment.&lt;/p&gt;

&lt;p&gt;Can you provide me with, &quot;log&quot; sub-cmd output, and also the assembly listing of lnet_mt_match_md() (you will need to load the Kernel+Lustre modules debuginfo before using the &quot;mod -S&quot; crash sub-cmd) and also with the memory content from &quot;rd ffff881d88420000 100&quot;.&lt;/p&gt;</comment>
                            <comment id="157765" author="bfaccini" created="Wed, 6 Jul 2016 07:25:49 +0000"  >&lt;p&gt;Thanks Mahmoud, but I had asked you the assembly listing of lnet_mt_match_md() instead of lnet_ptl_match_md(), can you also provide it ?&lt;/p&gt;</comment>
                            <comment id="157853" author="mhanafi" created="Wed, 6 Jul 2016 17:52:39 +0000"  >&lt;p&gt;attaching lnet_mt_match_md.dis and lnet_mt_match_md.withlinenumbers.dis&lt;/p&gt;</comment>
                            <comment id="158098" author="bfaccini" created="Fri, 8 Jul 2016 09:06:02 +0000"  >&lt;p&gt;Thanks, so seems we crash here in the code :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;int
lnet_mt_match_md(struct lnet_match_table *mtable,
                 struct lnet_match_info *info, struct lnet_msg *msg)
{
        struct list_head        *head;
        lnet_me_t               *me;
        lnet_me_t               *tmp;
        int                     exhausted = 0;
        int                     rc;

        /* any ME with ignore bits? */
        if (!list_empty(&amp;amp;mtable-&amp;gt;mt_mhash[LNET_MT_HASH_IGNORE]))
                head = &amp;amp;mtable-&amp;gt;mt_mhash[LNET_MT_HASH_IGNORE];
        else
                head = lnet_mt_match_head(mtable, info-&amp;gt;mi_id, info-&amp;gt;mi_mbits);
 again:
        /* NB: only wildcard portal needs to return LNET_MATCHMD_EXHAUSTED */
        if (lnet_ptl_is_wildcard(the_lnet.ln_portals[mtable-&amp;gt;mt_portal]))
                exhausted = LNET_MATCHMD_EXHAUSTED;

        list_for_each_entry_safe(me, tmp, head, me_list) {
                /* ME attached but MD not attached yet */
                if (me-&amp;gt;me_md == NULL)
                        continue;

                LASSERT(me == me-&amp;gt;me_md-&amp;gt;md_me); &amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;

                rc = lnet_try_match_md(me-&amp;gt;me_md, info, msg);
                if ((rc &amp;amp; LNET_MATCHMD_EXHAUSTED) == 0)
                        exhausted = 0; /* mlist is not empty */

                if ((rc &amp;amp; LNET_MATCHMD_FINISH) != 0) {
                        /* don&apos;t return EXHAUSTED bit because we don&apos;t know
                         * whether the mlist is empty or not */
                        return rc &amp;amp; ~LNET_MATCHMD_EXHAUSTED;
                }
        }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;when dereferencing me-&amp;gt;me_md, using rax, is found to be non-NULL but 0xf...f in fact! &lt;br/&gt;
But me (rax=0xffff881d88420000) seems already wrong in fact which means that the previous on linked-list may have its me_list corrupted ...&lt;br/&gt;
And to find this we need to go back from list start because, according to assembly code, there is no reference to previous left in any register.&lt;br/&gt;
So, could you now provide output of &quot;rd 0xffff880ff8b73b78 2&quot; and then also run &quot;list &amp;lt;address&amp;gt;&quot; cmd with the 1st dumped value from previous &quot;rd&quot; as the address argument (you may need to add a &quot;0x&quot; in front).&lt;/p&gt;</comment>
                            <comment id="158595" author="bfaccini" created="Wed, 13 Jul 2016 08:40:42 +0000"  >&lt;p&gt;Yes, only for the first value/pointer in fact!&lt;/p&gt;

&lt;p&gt;Well, this 0xffff88201fc9b000 value should be &quot;head&quot; and thus point to the choosen hash-list head in the mtable/match-table, and seems also wrong (I mean  strangely page-aligned, like the next/0xffff881d88420000 on the list which is also the current one causing the crash!).&lt;/p&gt;

&lt;p&gt;Thus, I need more from the current stack frame to confirm my current thoughts, so can you also dump its full content using &quot;rd ffff880ff8b73b70 20&quot; cmd ??&lt;/p&gt;</comment>
                            <comment id="158898" author="mhanafi" created="Thu, 14 Jul 2016 21:55:29 +0000"  >&lt;p&gt;Here you go&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;crash&amp;gt; rd ffff880ff8b73b70 20
ffff880ff8b73b70:  ffff88201fa43e00 ffff88201fc9b000   .&amp;gt;.. ....... ...
ffff880ff8b73b80:  ffff880ff8b73c50 ffff880ff8b73c70   P&amp;lt;......p&amp;lt;......
ffff880ff8b73b90:  ffff881730ca6200 ffff880ffb3c6140   .b.0....@a&amp;lt;.....
ffff880ff8b73ba0:  ffff88201fa43e00 0000000000000004   .&amp;gt;.. ...........
ffff880ff8b73bb0:  ffff880ff8b73c30 ffffffffa0532621   0&amp;lt;......!&amp;amp;S.....
ffff880ff8b73bc0:  0000000000000000 ffffffff00000000   ................
ffff880ff8b73bd0:  ffff881000000000 ffff880f000000e0   ................
ffff880ff8b73be0:  000000000000ec80 ffff881026e79ac0   ...........&amp;amp;....
ffff880ff8b73bf0:  ffff881730ca6200 0000000000000002   .b.0............
ffff880ff8b73c00:  ffff880ff8b73c30 ffff881730ca6200   0&amp;lt;.......b.0....
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="159306" author="bfaccini" created="Wed, 20 Jul 2016 13:29:35 +0000"  >&lt;p&gt;Hello Mahmoud, thanks for these last infos (it confirms the wrong &quot;head&quot; value, and helps go forward in analysis) and sorry for this late feed-back.&lt;/p&gt;

&lt;p&gt;At this point of the crash-dump analysis I am almost convinced that this crash is not related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt;, as I first have first suspected.&lt;/p&gt;

&lt;p&gt;In fact, I am now more inclined to suspect it could be related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7324&quot; title=&quot;Race condition on deleting lnet_msg&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7324&quot;&gt;&lt;del&gt;LU-7324&lt;/del&gt;&lt;/a&gt;, and BTW I would like to ask you if you can check if both of its patches are present in the version you are running?&lt;br/&gt;
Also to investigate in this new direction, can you attach the result of both &quot;p/x *(struct lnet_msg *)0xffff881730ca6200&quot; and &quot;p/x *(struct lnet_match_table *)0xffff88201fa43e00&quot; from crash-dump ?&lt;/p&gt;</comment>
                            <comment id="159377" author="jaylan" created="Wed, 20 Jul 2016 18:30:40 +0000"  >&lt;p&gt;We added &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; after you suspected &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; might be the cause.&lt;br/&gt;
So, we did not have &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; when we hit this page fault.&lt;/p&gt;

&lt;p&gt;We did have two &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7324&quot; title=&quot;Race condition on deleting lnet_msg&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7324&quot;&gt;&lt;del&gt;LU-7324&lt;/del&gt;&lt;/a&gt; in the code back then:&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7324&quot; title=&quot;Race condition on deleting lnet_msg&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7324&quot;&gt;&lt;del&gt;LU-7324&lt;/del&gt;&lt;/a&gt; lnet: recv could access freed message&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7324&quot; title=&quot;Race condition on deleting lnet_msg&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7324&quot;&gt;&lt;del&gt;LU-7324&lt;/del&gt;&lt;/a&gt; lnet: Use after free in lnet_ptl_match_delay()&lt;/p&gt;
</comment>
                            <comment id="159426" author="jaylan" created="Wed, 20 Jul 2016 21:48:27 +0000"  >&lt;p&gt;The lnet_msg-lnet_match_table.data has been attached to this ticket.&lt;/p&gt;</comment>
                            <comment id="159589" author="bfaccini" created="Fri, 22 Jul 2016 15:00:10 +0000"  >&lt;p&gt;Well lnet_msg and lnet_match_table look ok, so can you get &quot;kmem 0xffff88204b64a000&quot;, &quot;rd 0xffff88204b64a000 1024&quot; (to confirm that head pointer 0xffff88201fc9b000 should be finally ok, even if strangely aligned as I stated before but this can happen due to hash-table sizing), and also &quot;kmem 0xffff881d88420000&quot; and &quot;rd ffff88138ecf5c40 32&quot; for me now??&lt;/p&gt;</comment>
                            <comment id="159659" author="jaylan" created="Fri, 22 Jul 2016 23:57:23 +0000"  >&lt;p&gt;The first two failed. The third did not return for &amp;gt; 5 minutes before I terminated it.&lt;/p&gt;

&lt;p&gt;crash&amp;gt; kmem 0xffff88204b64a000&lt;br/&gt;
kmem: WARNING: cannot find mem_map page for address: ffff88204b64a000&lt;br/&gt;
204b64a000: kernel virtual address not found in mem map&lt;br/&gt;
crash&amp;gt; rd 0xffff88204b64a000 1024&lt;br/&gt;
rd: seek error: kernel virtual address: ffff88204b64a000  type: &quot;64-bit KVADDR&quot;&lt;br/&gt;
crash&amp;gt; kmem 0xffff881d88420000&lt;/p&gt;

&lt;p&gt;crash&amp;gt; rd ffff88138ecf5c40 32&lt;br/&gt;
ffff88138ecf5c40:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ&lt;br/&gt;
ffff88138ecf5c50:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ&lt;br/&gt;
ffff88138ecf5c60:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ&lt;br/&gt;
ffff88138ecf5c70:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a   ZZZZZZZZZZZZZZZZ&lt;br/&gt;
ffff88138ecf5c80:  5a5a5a5a5a5a5a5a 0000000000000000   ZZZZZZZZ........&lt;br/&gt;
ffff88138ecf5c90:  0000000000000000 0000000000000000   ................&lt;br/&gt;
ffff88138ecf5ca0:  0000000000000000 0000000000000000   ................&lt;br/&gt;
ffff88138ecf5cb0:  0000000000000000 0000000000000000   ................&lt;br/&gt;
ffff88138ecf5cc0:  ffff881659b2c340 ffff881d1b641ec0   @..Y......d.....&lt;br/&gt;
ffff88138ecf5cd0:  ffff88166a9e5ed0 ffffc90022095750   .^.j....PW.&quot;....&lt;br/&gt;
ffff88138ecf5ce0:  0000001490b6a75a ffffffffffffffff   Z...............&lt;br/&gt;
ffff88138ecf5cf0:  ffff881effffffff 000001000000001c   ................&lt;br/&gt;
ffff88138ecf5d00:  0000000000000000 ffffffffffffffff   ................&lt;br/&gt;
ffff88138ecf5d10:  0000000000000001 ffff881b54a070c0   .........p.T....&lt;br/&gt;
ffff88138ecf5d20:  0000000000000000 0000000000000000   ................&lt;br/&gt;
ffff88138ecf5d30:  0000000000000000 0000000000000000   ................&lt;br/&gt;
crash&amp;gt; &lt;/p&gt;</comment>
                            <comment id="159721" author="bfaccini" created="Mon, 25 Jul 2016 13:48:10 +0000"  >&lt;p&gt;Humm I have made a mistake with the first address I have asked you to check and dump, it has nothing to do with your crash-dump because it comes from the one I was using to mimic what I wanted to be extracted from yours ... So replace the previous 2 first commands by &quot;kmem 0xffff88201fc9a000&quot; and &quot;rd 0xffff88201fc9a000 1024&quot;, and also let the 3rd/&quot;kmem 0xffff881d88420000&quot; command go to completion.&lt;/p&gt;</comment>
                            <comment id="159754" author="jaylan" created="Mon, 25 Jul 2016 16:02:01 +0000"  >&lt;p&gt;Requested information in attachment lu-8362.20160725.&lt;/p&gt;</comment>
                            <comment id="160279" author="bfaccini" created="Fri, 29 Jul 2016 08:27:39 +0000"  >&lt;p&gt;There is still no evidence of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; and still no understanding of the corruption.&lt;br/&gt;
To be sure that nothing has been left besides, can you provide the output of &quot;list -o 8 0xffff881d8d8bff40&quot;, &quot;rd ffff881d1b641e40 32&quot;, &quot;kmem ffff8805f040b2e8&quot;.&lt;/p&gt;</comment>
                            <comment id="160625" author="jaylan" created="Wed, 3 Aug 2016 00:19:13 +0000"  >&lt;p&gt;Uploaded information in lu8362.20160802.&lt;/p&gt;

&lt;p&gt;The list contains 24564 entries and ended with a duplicate.&lt;/p&gt;
</comment>
                            <comment id="160690" author="bfaccini" created="Wed, 3 Aug 2016 17:08:48 +0000"  >&lt;p&gt;Many thanks again Jay.&lt;/p&gt;

&lt;p&gt;With these new datas I still can not definitely conclude about the exact content/cause of the corruption. This is mainly due to the fact that several list corruptions have been detected (and kind of corrected during the unlink mechanism, thus very likely to create the loop in the &quot;prev&quot; linked-list of MEs shown in &quot;list -o 8 0xffff881d8d8bff40&quot;) preceding the crash.&lt;/p&gt;

&lt;p&gt;But what I can say now, is that it finally looks very similar to several occurrences I have examined during &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; tracking.&lt;br/&gt;
In order to find a definitive proof of what I presume, can you provide the &quot;kmem ffff882029ab2578&quot;, &quot;rd ffff881659b2c2c0 32&quot;, &quot;rd ffff881d88425c40 32&quot;, and &quot;rd ffff88161caf1040 32&quot; crash sub-cmds output?&lt;/p&gt;

&lt;p&gt;Also, during &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; tracking I have used patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4330&quot; title=&quot;LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4330&quot;&gt;LU-4330&lt;/a&gt; to move LNet MEs/small-MDs out of &amp;lt;size-128&amp;gt; Slabs, is it something that you may also try at your site ?&lt;/p&gt;</comment>
                            <comment id="160698" author="jaylan" created="Wed, 3 Aug 2016 17:33:07 +0000"  >&lt;p&gt;File lu8362-29169893 is attached that contains crash data you requested.&lt;/p&gt;</comment>
                            <comment id="160759" author="bfaccini" created="Thu, 4 Aug 2016 07:13:50 +0000"  >&lt;p&gt;Thanks one more time for these datas.&lt;br/&gt;
Mapping of ffff882029ab2578 address to a bdev_inode confirms this is a similar crash than already encountered as part of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
That means that you are safe since you have integrated patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt;. As I already indicated before, you may also want to integrate patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4330&quot; title=&quot;LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4330&quot;&gt;LU-4330&lt;/a&gt;, which causes LNet MEs/small-MDs to be allocated in their own kmem_cache and thus no longer be affected by bugs/corruptions from all others pieces of software sharing &amp;lt;size-128&amp;gt; slabs. This has been proofed to help in further debugging new occurrences without the noise of MEs/MDs activity.&lt;/p&gt;</comment>
                            <comment id="166923" author="mhanafi" created="Thu, 22 Sep 2016 16:09:20 +0000"  >&lt;p&gt;This can be closed out We will track &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7980&quot; title=&quot;Overrun in generic &amp;lt;size-128&amp;gt; kmem_cache Slabs causing OSS to crash&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7980&quot;&gt;&lt;del&gt;LU-7980&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4330&quot; title=&quot;LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4330&quot;&gt;LU-4330&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="22283">LU-4330</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="35796">LU-7980</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="22287" name="lnet_msg-lnet_match_table.data" size="3093" author="jaylan" created="Wed, 20 Jul 2016 21:48:27 +0000"/>
                            <attachment id="22124" name="lnet_mt_match_md.dis" size="8423" author="mhanafi" created="Wed, 6 Jul 2016 17:52:39 +0000"/>
                            <attachment id="22123" name="lnet_mt_match_md.withlinenumbers.dis" size="10385" author="mhanafi" created="Wed, 6 Jul 2016 17:52:39 +0000"/>
                            <attachment id="22340" name="lu-8362.20160725" size="37573" author="jaylan" created="Mon, 25 Jul 2016 16:02:00 +0000"/>
                            <attachment id="22460" name="lu8362-20160803" size="4044" author="jaylan" created="Wed, 3 Aug 2016 17:33:07 +0000"/>
                            <attachment id="22455" name="lu8362.20160802" size="419055" author="jaylan" created="Wed, 3 Aug 2016 00:19:13 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>lnet</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyghb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>