[LU-8362] page fault: exception RIP: lnet_mt_match_md+135 Created: 02/Jul/16  Updated: 14/Jun/18  Resolved: 22/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

lustre 2.7.1-fe


Attachments: File lnet_msg-lnet_match_table.data     File lnet_mt_match_md.dis     File lnet_mt_match_md.withlinenumbers.dis     File lu-8362.20160725     HTML File lu8362-20160803     File lu8362.20160802    
Issue Links:
Related
is related to LU-4330 LustreError: 46336:0:(events.c:433:pt... Reopened
is related to LU-7980 Overrun in generic <size-128> kmem_ca... Resolved
Epic/Theme: lnet
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

OSS console errors

LNet: Can't send to 17456000@<65535:34821>: src 0@<0:0> is not a local nid^M
LNet: 46045:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@<65535:34821>: -22^M
LNet: Can't send to 17456000@<65535:34821>: src 0@<0:0> is not a local nid^M
LNet: 56154:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@<65535:34821>: -22^M
------------[ cut here ]------------^M
WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)^M
Hardware name: SUMMIT^M
list_del corruption. prev->next should be ffff881d63ead4d0, but was (null)^M
Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 acpi_cpufreq freq_table mperf lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dm_round_robin scsi_dh_rdac lpfc scsi_transport_fc scsi_tgt sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_mirror dm_region_hash dm_log dm_multipath dm_mod iTCO_wdt iTCO_vendor_support microcode sg wmi igb hwmon dca i2c_algo_bit ptp pps_core i2c_i801 i2c_core lpc_ich mfd_core shpchp tcp_bic ext3 jbd sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) ahci gru [last unloaded: scsi_wait_scan]^M
Pid: 8603, comm: kiblnd_sd_02_01 Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1^M
Call Trace:^M
 [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0^M
 [<ffffffff81074216>] ? warn_slowpath_fmt+0x46/0x50^M
 [<ffffffff812bda6e>] ? list_del+0x6e/0xa0^M
 [<ffffffffa052c5c9>] ? lnet_me_unlink+0x39/0x140 [lnet]^M
 [<ffffffffa05303f8>] ? lnet_md_unlink+0x2f8/0x3e0 [lnet]^M
 [<ffffffffa0531b9f>] ? lnet_try_match_md+0x22f/0x310 [lnet]^M
 [<ffffffffa0a1f727>] ? kiblnd_recv+0x107/0x780 [ko2iblnd]^M
 [<ffffffffa0531d1c>] ? lnet_mt_match_md+0x9c/0x1c0 [lnet]^M
 [<ffffffffa0532621>] ? lnet_ptl_match_md+0x281/0x870 [lnet]^M
 [<ffffffffa05396e7>] ? lnet_parse_local+0x307/0xc60 [lnet]^M
 [<ffffffffa053a6da>] ? lnet_parse+0x69a/0xcf0 [lnet]^M
 [<ffffffffa0a1ff3b>] ? kiblnd_handle_rx+0x19b/0x620 [ko2iblnd]^M
 [<ffffffffa0a212be>] ? kiblnd_scheduler+0xefe/0x10d0 [ko2iblnd]^M
 [<ffffffff81064f90>] ? default_wake_function+0x0/0x20^M
 [<ffffffffa0a203c0>] ? kiblnd_scheduler+0x0/0x10d0 [ko2iblnd]^M
 [<ffffffff8109dc8e>] ? kthread+0x9e/0xc0^M
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20^M
 [<ffffffff8109dbf0>] ? kthread+0x0/0xc0^M
 [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
---[ end trace 1063d2ffc2578a2f ]---^M
------------[ cut here ]------------^M

From the crash dump bt looks like this.

PID: 8603   TASK: ffff8810271fa040  CPU: 11  COMMAND: "kiblnd_sd_02_01"
 #0 [ffff880ff8b734f0] machine_kexec at ffffffff8103b5db
 #1 [ffff880ff8b73550] crash_kexec at ffffffff810c9412
 #2 [ffff880ff8b73620] kdb_kdump_check at ffffffff812973d7
 #3 [ffff880ff8b73630] kdb_main_loop at ffffffff8129a5c7
 #4 [ffff880ff8b73740] kdb_save_running at ffffffff8129472e
 #5 [ffff880ff8b73750] kdba_main_loop at ffffffff8147cd68
 #6 [ffff880ff8b73790] kdb at ffffffff812978c6
 #7 [ffff880ff8b73800] kdba_entry at ffffffff8147c687
 #8 [ffff880ff8b73810] notifier_call_chain at ffffffff81568515
 #9 [ffff880ff8b73850] atomic_notifier_call_chain at ffffffff8156857a
#10 [ffff880ff8b73860] notify_die at ffffffff810a44fe
#11 [ffff880ff8b73890] __die at ffffffff815663e2
#12 [ffff880ff8b738c0] no_context at ffffffff8104c822
#13 [ffff880ff8b73910] __bad_area_nosemaphore at ffffffff8104cad5
#14 [ffff880ff8b73960] bad_area_nosemaphore at ffffffff8104cba3
#15 [ffff880ff8b73970] __do_page_fault at ffffffff8104d29c
#16 [ffff880ff8b73a90] do_page_fault at ffffffff8156845e
#17 [ffff880ff8b73ac0] page_fault at ffffffff81565765
    [exception RIP: lnet_mt_match_md+135]
    RIP: ffffffffa0531d07  RSP: ffff880ff8b73b70  RFLAGS: 00010286
    RAX: ffff881d88420000  RBX: ffff880ff8b73c70  RCX: 0000000000000007
    RDX: 0000000000000004  RSI: ffff880ff8b73c70  RDI: ffffffffffffffff
    RBP: ffff880ff8b73bb0   R8: 0000000000000001   R9: d400000000000000
    R10: 0000000000000001  R11: 0000000000000012  R12: 0000000000000000
    R13: ffff881730ca6200  R14: 00d100120be91b91  R15: 0000000000000008
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#18 [ffff880ff8b73bb8] lnet_ptl_match_md at ffffffffa0532621 [lnet]
#19 [ffff880ff8b73c38] lnet_parse_local at ffffffffa05396e7 [lnet]
#20 [ffff880ff8b73cd8] lnet_parse at ffffffffa053a6da [lnet]
#21 [ffff880ff8b73d68] kiblnd_handle_rx at ffffffffa0a1ff3b [ko2iblnd]
#22 [ffff880ff8b73db8] kiblnd_scheduler at ffffffffa0a212be [ko2iblnd]
#23 [ffff880ff8b73ee8] kthread at ffffffff8109dc8e
#24 [ffff880ff8b73f48] kernel_thread at ffffffff8100c28a


 Comments   
Comment by Joseph Gmitter (Inactive) [ 02/Jul/16 ]

Hi Mahmoud,

Can you confirm that this is a severity 1 (production filesystem out of service)?

Thanks.
Joe

Comment by Bruno Faccini (Inactive) [ 02/Jul/16 ]

This problem looks similar to the one reported in LU-7980.

Comment by Mahmoud Hanafi [ 02/Jul/16 ]

should be severity 2 or 3
Also if it is a dup of LU-7980 we will need a 2.7fe patch

Comment by Peter Jones [ 02/Jul/16 ]

ok Mahmoud. Then we'll assess this more fully on Monday.

Comment by Bruno Faccini (Inactive) [ 04/Jul/16 ]

Mahmoud, I know that as a non-US citizen I am not allowed to work on the crash-dump, but can you at least provide some pieces of memory content to help qualify the corruption ?
Let say, 32 quad-words starting from address 0xffff881d63ead440 (ie, "rd 0xffff881d63ead440 32"), and 32 quad-words starting from address found at location 0xffff881d63ead4d8 minus 0x90.

Comment by Mahmoud Hanafi [ 04/Jul/16 ]

Here is the info. btw how did you come up with those address locations.

crash> rd 0xffff881d63ead440 32
ffff881d63ead440:  3463346339376238 65342d616630622d   8b79c4c4-b0fa-4e
ffff881d63ead450:  2d393762652d3134 3162326230316464   41-eb79-dd10b2b1
ffff881d63ead460:  0000000064616535 000000230199f7ea   5ead........#...
ffff881d63ead470:  000574241b1d26fc 0000000000000000   .&..$t..........
ffff881d63ead480:  0000000000000000 0000000000000000   ................
ffff881d63ead490:  0000000000000000 0000000000000000   ................
ffff881d63ead4a0:  0000000000000000 0000000000000000   ................
ffff881d63ead4b0:  0000000000000000 0000000000000000   ................
ffff881d63ead4c0:  ffff8815da7214c0 ffff881dbcd19740   ..r.....@.......
ffff881d63ead4d0:  ffff881b6dcd3cd0 ffffc900220983f0   .<.m......."....
ffff881d63ead4e0:  0000001490b5d3fa ffffffffffffffff   ................
ffff881d63ead4f0:  ffff881dffffffff 000001000000001c   ................
ffff881d63ead500:  0000000000000000 ffffffffffffffff   ................
ffff881d63ead510:  0000000000000001 ffff881a20848840   ........@.. ....
ffff881d63ead520:  0000000000000000 0000000000000000   ................
ffff881d63ead530:  0000000000000000 0000000000000000   ................


crash> rd ffff881d63ead448 32
ffff881d63ead448:  65342d616630622d 2d393762652d3134   -b0fa-4e41-eb79-
ffff881d63ead458:  3162326230316464 0000000064616535   dd10b2b15ead....
ffff881d63ead468:  000000230199f7ea 000574241b1d26fc   ....#....&..$t..
ffff881d63ead478:  0000000000000000 0000000000000000   ................
ffff881d63ead488:  0000000000000000 0000000000000000   ................
ffff881d63ead498:  0000000000000000 0000000000000000   ................
ffff881d63ead4a8:  0000000000000000 0000000000000000   ................
ffff881d63ead4b8:  0000000000000000 ffff8815da7214c0   ..........r.....
ffff881d63ead4c8:  ffff881dbcd19740 ffff881b6dcd3cd0   @........<.m....
ffff881d63ead4d8:  ffffc900220983f0 0000001490b5d3fa   ..."............
ffff881d63ead4e8:  ffffffffffffffff ffff881dffffffff   ................
ffff881d63ead4f8:  000001000000001c 0000000000000000   ................
ffff881d63ead508:  ffffffffffffffff 0000000000000001   ................
ffff881d63ead518:  ffff881a20848840 0000000000000000   @.. ............
ffff881d63ead528:  0000000000000000 0000000000000000   ................
ffff881d63ead538:  0000000000000000 ffff881d36d68d40   ........@..6....
Comment by Bruno Faccini (Inactive) [ 05/Jul/16 ]

Thanks Mahmoud, the address comes from the msg just before the crash, "list_del corruption. prev->next should be ffff881d63ead4d0, but was (null)".

Well, according to the memory content, there is no real evidence of LU-7980 for the moment.

Can you provide me with, "log" sub-cmd output, and also the assembly listing of lnet_mt_match_md() (you will need to load the Kernel+Lustre modules debuginfo before using the "mod -S" crash sub-cmd) and also with the memory content from "rd ffff881d88420000 100".

Comment by Bruno Faccini (Inactive) [ 06/Jul/16 ]

Thanks Mahmoud, but I had asked you the assembly listing of lnet_mt_match_md() instead of lnet_ptl_match_md(), can you also provide it ?

Comment by Mahmoud Hanafi [ 06/Jul/16 ]

attaching lnet_mt_match_md.dis and lnet_mt_match_md.withlinenumbers.dis

Comment by Bruno Faccini (Inactive) [ 08/Jul/16 ]

Thanks, so seems we crash here in the code :

int
lnet_mt_match_md(struct lnet_match_table *mtable,
                 struct lnet_match_info *info, struct lnet_msg *msg)
{
        struct list_head        *head;
        lnet_me_t               *me;
        lnet_me_t               *tmp;
        int                     exhausted = 0;
        int                     rc;

        /* any ME with ignore bits? */
        if (!list_empty(&mtable->mt_mhash[LNET_MT_HASH_IGNORE]))
                head = &mtable->mt_mhash[LNET_MT_HASH_IGNORE];
        else
                head = lnet_mt_match_head(mtable, info->mi_id, info->mi_mbits);
 again:
        /* NB: only wildcard portal needs to return LNET_MATCHMD_EXHAUSTED */
        if (lnet_ptl_is_wildcard(the_lnet.ln_portals[mtable->mt_portal]))
                exhausted = LNET_MATCHMD_EXHAUSTED;

        list_for_each_entry_safe(me, tmp, head, me_list) {
                /* ME attached but MD not attached yet */
                if (me->me_md == NULL)
                        continue;

                LASSERT(me == me->me_md->md_me); <<<<<<<<<<<<<<<<<

                rc = lnet_try_match_md(me->me_md, info, msg);
                if ((rc & LNET_MATCHMD_EXHAUSTED) == 0)
                        exhausted = 0; /* mlist is not empty */

                if ((rc & LNET_MATCHMD_FINISH) != 0) {
                        /* don't return EXHAUSTED bit because we don't know
                         * whether the mlist is empty or not */
                        return rc & ~LNET_MATCHMD_EXHAUSTED;
                }
        }

when dereferencing me->me_md, using rax, is found to be non-NULL but 0xf...f in fact!
But me (rax=0xffff881d88420000) seems already wrong in fact which means that the previous on linked-list may have its me_list corrupted ...
And to find this we need to go back from list start because, according to assembly code, there is no reference to previous left in any register.
So, could you now provide output of "rd 0xffff880ff8b73b78 2" and then also run "list <address>" cmd with the 1st dumped value from previous "rd" as the address argument (you may need to add a "0x" in front).

Comment by Bruno Faccini (Inactive) [ 13/Jul/16 ]

Yes, only for the first value/pointer in fact!

Well, this 0xffff88201fc9b000 value should be "head" and thus point to the choosen hash-list head in the mtable/match-table, and seems also wrong (I mean strangely page-aligned, like the next/0xffff881d88420000 on the list which is also the current one causing the crash!).

Thus, I need more from the current stack frame to confirm my current thoughts, so can you also dump its full content using "rd ffff880ff8b73b70 20" cmd ??

Comment by Mahmoud Hanafi [ 14/Jul/16 ]

Here you go

crash> rd ffff880ff8b73b70 20
ffff880ff8b73b70:  ffff88201fa43e00 ffff88201fc9b000   .>.. ....... ...
ffff880ff8b73b80:  ffff880ff8b73c50 ffff880ff8b73c70   P<......p<......
ffff880ff8b73b90:  ffff881730ca6200 ffff880ffb3c6140   .b.0....@a<.....
ffff880ff8b73ba0:  ffff88201fa43e00 0000000000000004   .>.. ...........
ffff880ff8b73bb0:  ffff880ff8b73c30 ffffffffa0532621   0<......!&S.....
ffff880ff8b73bc0:  0000000000000000 ffffffff00000000   ................
ffff880ff8b73bd0:  ffff881000000000 ffff880f000000e0   ................
ffff880ff8b73be0:  000000000000ec80 ffff881026e79ac0   ...........&....
ffff880ff8b73bf0:  ffff881730ca6200 0000000000000002   .b.0............
ffff880ff8b73c00:  ffff880ff8b73c30 ffff881730ca6200   0<.......b.0....
Comment by Bruno Faccini (Inactive) [ 20/Jul/16 ]

Hello Mahmoud, thanks for these last infos (it confirms the wrong "head" value, and helps go forward in analysis) and sorry for this late feed-back.

At this point of the crash-dump analysis I am almost convinced that this crash is not related to LU-7980, as I first have first suspected.

In fact, I am now more inclined to suspect it could be related to LU-7324, and BTW I would like to ask you if you can check if both of its patches are present in the version you are running?
Also to investigate in this new direction, can you attach the result of both "p/x *(struct lnet_msg *)0xffff881730ca6200" and "p/x *(struct lnet_match_table *)0xffff88201fa43e00" from crash-dump ?

Comment by Jay Lan (Inactive) [ 20/Jul/16 ]

We added LU-7980 after you suspected LU-7980 might be the cause.
So, we did not have LU-7980 when we hit this page fault.

We did have two LU-7324 in the code back then:
LU-7324 lnet: recv could access freed message
LU-7324 lnet: Use after free in lnet_ptl_match_delay()

Comment by Jay Lan (Inactive) [ 20/Jul/16 ]

The lnet_msg-lnet_match_table.data has been attached to this ticket.

Comment by Bruno Faccini (Inactive) [ 22/Jul/16 ]

Well lnet_msg and lnet_match_table look ok, so can you get "kmem 0xffff88204b64a000", "rd 0xffff88204b64a000 1024" (to confirm that head pointer 0xffff88201fc9b000 should be finally ok, even if strangely aligned as I stated before but this can happen due to hash-table sizing), and also "kmem 0xffff881d88420000" and "rd ffff88138ecf5c40 32" for me now??

Comment by Jay Lan (Inactive) [ 22/Jul/16 ]

The first two failed. The third did not return for > 5 minutes before I terminated it.

crash> kmem 0xffff88204b64a000
kmem: WARNING: cannot find mem_map page for address: ffff88204b64a000
204b64a000: kernel virtual address not found in mem map
crash> rd 0xffff88204b64a000 1024
rd: seek error: kernel virtual address: ffff88204b64a000 type: "64-bit KVADDR"
crash> kmem 0xffff881d88420000

crash> rd ffff88138ecf5c40 32
ffff88138ecf5c40: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
ffff88138ecf5c50: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
ffff88138ecf5c60: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
ffff88138ecf5c70: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
ffff88138ecf5c80: 5a5a5a5a5a5a5a5a 0000000000000000 ZZZZZZZZ........
ffff88138ecf5c90: 0000000000000000 0000000000000000 ................
ffff88138ecf5ca0: 0000000000000000 0000000000000000 ................
ffff88138ecf5cb0: 0000000000000000 0000000000000000 ................
ffff88138ecf5cc0: ffff881659b2c340 ffff881d1b641ec0 @..Y......d.....
ffff88138ecf5cd0: ffff88166a9e5ed0 ffffc90022095750 .^.j....PW."....
ffff88138ecf5ce0: 0000001490b6a75a ffffffffffffffff Z...............
ffff88138ecf5cf0: ffff881effffffff 000001000000001c ................
ffff88138ecf5d00: 0000000000000000 ffffffffffffffff ................
ffff88138ecf5d10: 0000000000000001 ffff881b54a070c0 .........p.T....
ffff88138ecf5d20: 0000000000000000 0000000000000000 ................
ffff88138ecf5d30: 0000000000000000 0000000000000000 ................
crash>

Comment by Bruno Faccini (Inactive) [ 25/Jul/16 ]

Humm I have made a mistake with the first address I have asked you to check and dump, it has nothing to do with your crash-dump because it comes from the one I was using to mimic what I wanted to be extracted from yours ... So replace the previous 2 first commands by "kmem 0xffff88201fc9a000" and "rd 0xffff88201fc9a000 1024", and also let the 3rd/"kmem 0xffff881d88420000" command go to completion.

Comment by Jay Lan (Inactive) [ 25/Jul/16 ]

Requested information in attachment lu-8362.20160725.

Comment by Bruno Faccini (Inactive) [ 29/Jul/16 ]

There is still no evidence of LU-7980 and still no understanding of the corruption.
To be sure that nothing has been left besides, can you provide the output of "list -o 8 0xffff881d8d8bff40", "rd ffff881d1b641e40 32", "kmem ffff8805f040b2e8".

Comment by Jay Lan (Inactive) [ 03/Aug/16 ]

Uploaded information in lu8362.20160802.

The list contains 24564 entries and ended with a duplicate.

Comment by Bruno Faccini (Inactive) [ 03/Aug/16 ]

Many thanks again Jay.

With these new datas I still can not definitely conclude about the exact content/cause of the corruption. This is mainly due to the fact that several list corruptions have been detected (and kind of corrected during the unlink mechanism, thus very likely to create the loop in the "prev" linked-list of MEs shown in "list -o 8 0xffff881d8d8bff40") preceding the crash.

But what I can say now, is that it finally looks very similar to several occurrences I have examined during LU-7980 tracking.
In order to find a definitive proof of what I presume, can you provide the "kmem ffff882029ab2578", "rd ffff881659b2c2c0 32", "rd ffff881d88425c40 32", and "rd ffff88161caf1040 32" crash sub-cmds output?

Also, during LU-7980 tracking I have used patch for LU-4330 to move LNet MEs/small-MDs out of <size-128> Slabs, is it something that you may also try at your site ?

Comment by Jay Lan (Inactive) [ 03/Aug/16 ]

File lu8362-29169893 is attached that contains crash data you requested.

Comment by Bruno Faccini (Inactive) [ 04/Aug/16 ]

Thanks one more time for these datas.
Mapping of ffff882029ab2578 address to a bdev_inode confirms this is a similar crash than already encountered as part of LU-7980.
That means that you are safe since you have integrated patch for LU-7980. As I already indicated before, you may also want to integrate patch for LU-4330, which causes LNet MEs/small-MDs to be allocated in their own kmem_cache and thus no longer be affected by bugs/corruptions from all others pieces of software sharing <size-128> slabs. This has been proofed to help in further debugging new occurrences without the noise of MEs/MDs activity.

Comment by Mahmoud Hanafi [ 22/Sep/16 ]

This can be closed out We will track LU-7980 and LU-4330

Generated at Sat Feb 10 02:16:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.