Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8362

page fault: exception RIP: lnet_mt_match_md+135

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.7.0
    • None
    • lustre 2.7.1-fe
    • 2
    • 9223372036854775807

    Description

      OSS console errors

      LNet: Can't send to 17456000@<65535:34821>: src 0@<0:0> is not a local nid^M
      LNet: 46045:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@<65535:34821>: -22^M
      LNet: Can't send to 17456000@<65535:34821>: src 0@<0:0> is not a local nid^M
      LNet: 56154:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@<65535:34821>: -22^M
      ------------[ cut here ]------------^M
      WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)^M
      Hardware name: SUMMIT^M
      list_del corruption. prev->next should be ffff881d63ead4d0, but was (null)^M
      Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 acpi_cpufreq freq_table mperf lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dm_round_robin scsi_dh_rdac lpfc scsi_transport_fc scsi_tgt sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_mirror dm_region_hash dm_log dm_multipath dm_mod iTCO_wdt iTCO_vendor_support microcode sg wmi igb hwmon dca i2c_algo_bit ptp pps_core i2c_i801 i2c_core lpc_ich mfd_core shpchp tcp_bic ext3 jbd sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) ahci gru [last unloaded: scsi_wait_scan]^M
      Pid: 8603, comm: kiblnd_sd_02_01 Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1^M
      Call Trace:^M
       [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0^M
       [<ffffffff81074216>] ? warn_slowpath_fmt+0x46/0x50^M
       [<ffffffff812bda6e>] ? list_del+0x6e/0xa0^M
       [<ffffffffa052c5c9>] ? lnet_me_unlink+0x39/0x140 [lnet]^M
       [<ffffffffa05303f8>] ? lnet_md_unlink+0x2f8/0x3e0 [lnet]^M
       [<ffffffffa0531b9f>] ? lnet_try_match_md+0x22f/0x310 [lnet]^M
       [<ffffffffa0a1f727>] ? kiblnd_recv+0x107/0x780 [ko2iblnd]^M
       [<ffffffffa0531d1c>] ? lnet_mt_match_md+0x9c/0x1c0 [lnet]^M
       [<ffffffffa0532621>] ? lnet_ptl_match_md+0x281/0x870 [lnet]^M
       [<ffffffffa05396e7>] ? lnet_parse_local+0x307/0xc60 [lnet]^M
       [<ffffffffa053a6da>] ? lnet_parse+0x69a/0xcf0 [lnet]^M
       [<ffffffffa0a1ff3b>] ? kiblnd_handle_rx+0x19b/0x620 [ko2iblnd]^M
       [<ffffffffa0a212be>] ? kiblnd_scheduler+0xefe/0x10d0 [ko2iblnd]^M
       [<ffffffff81064f90>] ? default_wake_function+0x0/0x20^M
       [<ffffffffa0a203c0>] ? kiblnd_scheduler+0x0/0x10d0 [ko2iblnd]^M
       [<ffffffff8109dc8e>] ? kthread+0x9e/0xc0^M
       [<ffffffff8100c28a>] ? child_rip+0xa/0x20^M
       [<ffffffff8109dbf0>] ? kthread+0x0/0xc0^M
       [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      ---[ end trace 1063d2ffc2578a2f ]---^M
      ------------[ cut here ]------------^M
      

      From the crash dump bt looks like this.

      PID: 8603   TASK: ffff8810271fa040  CPU: 11  COMMAND: "kiblnd_sd_02_01"
       #0 [ffff880ff8b734f0] machine_kexec at ffffffff8103b5db
       #1 [ffff880ff8b73550] crash_kexec at ffffffff810c9412
       #2 [ffff880ff8b73620] kdb_kdump_check at ffffffff812973d7
       #3 [ffff880ff8b73630] kdb_main_loop at ffffffff8129a5c7
       #4 [ffff880ff8b73740] kdb_save_running at ffffffff8129472e
       #5 [ffff880ff8b73750] kdba_main_loop at ffffffff8147cd68
       #6 [ffff880ff8b73790] kdb at ffffffff812978c6
       #7 [ffff880ff8b73800] kdba_entry at ffffffff8147c687
       #8 [ffff880ff8b73810] notifier_call_chain at ffffffff81568515
       #9 [ffff880ff8b73850] atomic_notifier_call_chain at ffffffff8156857a
      #10 [ffff880ff8b73860] notify_die at ffffffff810a44fe
      #11 [ffff880ff8b73890] __die at ffffffff815663e2
      #12 [ffff880ff8b738c0] no_context at ffffffff8104c822
      #13 [ffff880ff8b73910] __bad_area_nosemaphore at ffffffff8104cad5
      #14 [ffff880ff8b73960] bad_area_nosemaphore at ffffffff8104cba3
      #15 [ffff880ff8b73970] __do_page_fault at ffffffff8104d29c
      #16 [ffff880ff8b73a90] do_page_fault at ffffffff8156845e
      #17 [ffff880ff8b73ac0] page_fault at ffffffff81565765
          [exception RIP: lnet_mt_match_md+135]
          RIP: ffffffffa0531d07  RSP: ffff880ff8b73b70  RFLAGS: 00010286
          RAX: ffff881d88420000  RBX: ffff880ff8b73c70  RCX: 0000000000000007
          RDX: 0000000000000004  RSI: ffff880ff8b73c70  RDI: ffffffffffffffff
          RBP: ffff880ff8b73bb0   R8: 0000000000000001   R9: d400000000000000
          R10: 0000000000000001  R11: 0000000000000012  R12: 0000000000000000
          R13: ffff881730ca6200  R14: 00d100120be91b91  R15: 0000000000000008
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      #18 [ffff880ff8b73bb8] lnet_ptl_match_md at ffffffffa0532621 [lnet]
      #19 [ffff880ff8b73c38] lnet_parse_local at ffffffffa05396e7 [lnet]
      #20 [ffff880ff8b73cd8] lnet_parse at ffffffffa053a6da [lnet]
      #21 [ffff880ff8b73d68] kiblnd_handle_rx at ffffffffa0a1ff3b [ko2iblnd]
      #22 [ffff880ff8b73db8] kiblnd_scheduler at ffffffffa0a212be [ko2iblnd]
      #23 [ffff880ff8b73ee8] kthread at ffffffff8109dc8e
      #24 [ffff880ff8b73f48] kernel_thread at ffffffff8100c28a
      

      Attachments

        1. lnet_msg-lnet_match_table.data
          3 kB
        2. lnet_mt_match_md.dis
          8 kB
        3. lnet_mt_match_md.withlinenumbers.dis
          10 kB
        4. lu-8362.20160725
          37 kB
        5. lu8362.20160802
          409 kB
        6. lu8362-20160803
          4 kB

        Issue Links

          Activity

            [LU-8362] page fault: exception RIP: lnet_mt_match_md+135

            Many thanks again Jay.

            With these new datas I still can not definitely conclude about the exact content/cause of the corruption. This is mainly due to the fact that several list corruptions have been detected (and kind of corrected during the unlink mechanism, thus very likely to create the loop in the "prev" linked-list of MEs shown in "list -o 8 0xffff881d8d8bff40") preceding the crash.

            But what I can say now, is that it finally looks very similar to several occurrences I have examined during LU-7980 tracking.
            In order to find a definitive proof of what I presume, can you provide the "kmem ffff882029ab2578", "rd ffff881659b2c2c0 32", "rd ffff881d88425c40 32", and "rd ffff88161caf1040 32" crash sub-cmds output?

            Also, during LU-7980 tracking I have used patch for LU-4330 to move LNet MEs/small-MDs out of <size-128> Slabs, is it something that you may also try at your site ?

            bfaccini Bruno Faccini (Inactive) added a comment - Many thanks again Jay. With these new datas I still can not definitely conclude about the exact content/cause of the corruption. This is mainly due to the fact that several list corruptions have been detected (and kind of corrected during the unlink mechanism, thus very likely to create the loop in the "prev" linked-list of MEs shown in "list -o 8 0xffff881d8d8bff40") preceding the crash. But what I can say now, is that it finally looks very similar to several occurrences I have examined during LU-7980 tracking. In order to find a definitive proof of what I presume, can you provide the "kmem ffff882029ab2578", "rd ffff881659b2c2c0 32", "rd ffff881d88425c40 32", and "rd ffff88161caf1040 32" crash sub-cmds output? Also, during LU-7980 tracking I have used patch for LU-4330 to move LNet MEs/small-MDs out of <size-128> Slabs, is it something that you may also try at your site ?

            Uploaded information in lu8362.20160802.

            The list contains 24564 entries and ended with a duplicate.

            jaylan Jay Lan (Inactive) added a comment - Uploaded information in lu8362.20160802. The list contains 24564 entries and ended with a duplicate.

            There is still no evidence of LU-7980 and still no understanding of the corruption.
            To be sure that nothing has been left besides, can you provide the output of "list -o 8 0xffff881d8d8bff40", "rd ffff881d1b641e40 32", "kmem ffff8805f040b2e8".

            bfaccini Bruno Faccini (Inactive) added a comment - There is still no evidence of LU-7980 and still no understanding of the corruption. To be sure that nothing has been left besides, can you provide the output of "list -o 8 0xffff881d8d8bff40", "rd ffff881d1b641e40 32", "kmem ffff8805f040b2e8".

            Requested information in attachment lu-8362.20160725.

            jaylan Jay Lan (Inactive) added a comment - Requested information in attachment lu-8362.20160725.

            Humm I have made a mistake with the first address I have asked you to check and dump, it has nothing to do with your crash-dump because it comes from the one I was using to mimic what I wanted to be extracted from yours ... So replace the previous 2 first commands by "kmem 0xffff88201fc9a000" and "rd 0xffff88201fc9a000 1024", and also let the 3rd/"kmem 0xffff881d88420000" command go to completion.

            bfaccini Bruno Faccini (Inactive) added a comment - Humm I have made a mistake with the first address I have asked you to check and dump, it has nothing to do with your crash-dump because it comes from the one I was using to mimic what I wanted to be extracted from yours ... So replace the previous 2 first commands by "kmem 0xffff88201fc9a000" and "rd 0xffff88201fc9a000 1024", and also let the 3rd/"kmem 0xffff881d88420000" command go to completion.

            The first two failed. The third did not return for > 5 minutes before I terminated it.

            crash> kmem 0xffff88204b64a000
            kmem: WARNING: cannot find mem_map page for address: ffff88204b64a000
            204b64a000: kernel virtual address not found in mem map
            crash> rd 0xffff88204b64a000 1024
            rd: seek error: kernel virtual address: ffff88204b64a000 type: "64-bit KVADDR"
            crash> kmem 0xffff881d88420000

            crash> rd ffff88138ecf5c40 32
            ffff88138ecf5c40: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
            ffff88138ecf5c50: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
            ffff88138ecf5c60: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
            ffff88138ecf5c70: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
            ffff88138ecf5c80: 5a5a5a5a5a5a5a5a 0000000000000000 ZZZZZZZZ........
            ffff88138ecf5c90: 0000000000000000 0000000000000000 ................
            ffff88138ecf5ca0: 0000000000000000 0000000000000000 ................
            ffff88138ecf5cb0: 0000000000000000 0000000000000000 ................
            ffff88138ecf5cc0: ffff881659b2c340 ffff881d1b641ec0 @..Y......d.....
            ffff88138ecf5cd0: ffff88166a9e5ed0 ffffc90022095750 .^.j....PW."....
            ffff88138ecf5ce0: 0000001490b6a75a ffffffffffffffff Z...............
            ffff88138ecf5cf0: ffff881effffffff 000001000000001c ................
            ffff88138ecf5d00: 0000000000000000 ffffffffffffffff ................
            ffff88138ecf5d10: 0000000000000001 ffff881b54a070c0 .........p.T....
            ffff88138ecf5d20: 0000000000000000 0000000000000000 ................
            ffff88138ecf5d30: 0000000000000000 0000000000000000 ................
            crash>

            jaylan Jay Lan (Inactive) added a comment - The first two failed. The third did not return for > 5 minutes before I terminated it. crash> kmem 0xffff88204b64a000 kmem: WARNING: cannot find mem_map page for address: ffff88204b64a000 204b64a000: kernel virtual address not found in mem map crash> rd 0xffff88204b64a000 1024 rd: seek error: kernel virtual address: ffff88204b64a000 type: "64-bit KVADDR" crash> kmem 0xffff881d88420000 crash> rd ffff88138ecf5c40 32 ffff88138ecf5c40: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ ffff88138ecf5c50: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ ffff88138ecf5c60: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ ffff88138ecf5c70: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ ffff88138ecf5c80: 5a5a5a5a5a5a5a5a 0000000000000000 ZZZZZZZZ........ ffff88138ecf5c90: 0000000000000000 0000000000000000 ................ ffff88138ecf5ca0: 0000000000000000 0000000000000000 ................ ffff88138ecf5cb0: 0000000000000000 0000000000000000 ................ ffff88138ecf5cc0: ffff881659b2c340 ffff881d1b641ec0 @..Y......d..... ffff88138ecf5cd0: ffff88166a9e5ed0 ffffc90022095750 .^.j....PW.".... ffff88138ecf5ce0: 0000001490b6a75a ffffffffffffffff Z............... ffff88138ecf5cf0: ffff881effffffff 000001000000001c ................ ffff88138ecf5d00: 0000000000000000 ffffffffffffffff ................ ffff88138ecf5d10: 0000000000000001 ffff881b54a070c0 .........p.T.... ffff88138ecf5d20: 0000000000000000 0000000000000000 ................ ffff88138ecf5d30: 0000000000000000 0000000000000000 ................ crash>

            Well lnet_msg and lnet_match_table look ok, so can you get "kmem 0xffff88204b64a000", "rd 0xffff88204b64a000 1024" (to confirm that head pointer 0xffff88201fc9b000 should be finally ok, even if strangely aligned as I stated before but this can happen due to hash-table sizing), and also "kmem 0xffff881d88420000" and "rd ffff88138ecf5c40 32" for me now??

            bfaccini Bruno Faccini (Inactive) added a comment - Well lnet_msg and lnet_match_table look ok, so can you get "kmem 0xffff88204b64a000", "rd 0xffff88204b64a000 1024" (to confirm that head pointer 0xffff88201fc9b000 should be finally ok, even if strangely aligned as I stated before but this can happen due to hash-table sizing), and also "kmem 0xffff881d88420000" and "rd ffff88138ecf5c40 32" for me now??

            The lnet_msg-lnet_match_table.data has been attached to this ticket.

            jaylan Jay Lan (Inactive) added a comment - The lnet_msg-lnet_match_table.data has been attached to this ticket.

            We added LU-7980 after you suspected LU-7980 might be the cause.
            So, we did not have LU-7980 when we hit this page fault.

            We did have two LU-7324 in the code back then:
            LU-7324 lnet: recv could access freed message
            LU-7324 lnet: Use after free in lnet_ptl_match_delay()

            jaylan Jay Lan (Inactive) added a comment - We added LU-7980 after you suspected LU-7980 might be the cause. So, we did not have LU-7980 when we hit this page fault. We did have two LU-7324 in the code back then: LU-7324 lnet: recv could access freed message LU-7324 lnet: Use after free in lnet_ptl_match_delay()

            Hello Mahmoud, thanks for these last infos (it confirms the wrong "head" value, and helps go forward in analysis) and sorry for this late feed-back.

            At this point of the crash-dump analysis I am almost convinced that this crash is not related to LU-7980, as I first have first suspected.

            In fact, I am now more inclined to suspect it could be related to LU-7324, and BTW I would like to ask you if you can check if both of its patches are present in the version you are running?
            Also to investigate in this new direction, can you attach the result of both "p/x *(struct lnet_msg *)0xffff881730ca6200" and "p/x *(struct lnet_match_table *)0xffff88201fa43e00" from crash-dump ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Mahmoud, thanks for these last infos (it confirms the wrong "head" value, and helps go forward in analysis) and sorry for this late feed-back. At this point of the crash-dump analysis I am almost convinced that this crash is not related to LU-7980 , as I first have first suspected. In fact, I am now more inclined to suspect it could be related to LU-7324 , and BTW I would like to ask you if you can check if both of its patches are present in the version you are running? Also to investigate in this new direction, can you attach the result of both "p/x *(struct lnet_msg *)0xffff881730ca6200" and "p/x *(struct lnet_match_table *)0xffff88201fa43e00" from crash-dump ?

            Here you go

            crash> rd ffff880ff8b73b70 20
            ffff880ff8b73b70:  ffff88201fa43e00 ffff88201fc9b000   .>.. ....... ...
            ffff880ff8b73b80:  ffff880ff8b73c50 ffff880ff8b73c70   P<......p<......
            ffff880ff8b73b90:  ffff881730ca6200 ffff880ffb3c6140   .b.0....@a<.....
            ffff880ff8b73ba0:  ffff88201fa43e00 0000000000000004   .>.. ...........
            ffff880ff8b73bb0:  ffff880ff8b73c30 ffffffffa0532621   0<......!&S.....
            ffff880ff8b73bc0:  0000000000000000 ffffffff00000000   ................
            ffff880ff8b73bd0:  ffff881000000000 ffff880f000000e0   ................
            ffff880ff8b73be0:  000000000000ec80 ffff881026e79ac0   ...........&....
            ffff880ff8b73bf0:  ffff881730ca6200 0000000000000002   .b.0............
            ffff880ff8b73c00:  ffff880ff8b73c30 ffff881730ca6200   0<.......b.0....
            
            mhanafi Mahmoud Hanafi added a comment - Here you go crash> rd ffff880ff8b73b70 20 ffff880ff8b73b70: ffff88201fa43e00 ffff88201fc9b000 .>.. ....... ... ffff880ff8b73b80: ffff880ff8b73c50 ffff880ff8b73c70 P<......p<...... ffff880ff8b73b90: ffff881730ca6200 ffff880ffb3c6140 .b.0....@a<..... ffff880ff8b73ba0: ffff88201fa43e00 0000000000000004 .>.. ........... ffff880ff8b73bb0: ffff880ff8b73c30 ffffffffa0532621 0<......!&S..... ffff880ff8b73bc0: 0000000000000000 ffffffff00000000 ................ ffff880ff8b73bd0: ffff881000000000 ffff880f000000e0 ................ ffff880ff8b73be0: 000000000000ec80 ffff881026e79ac0 ...........&.... ffff880ff8b73bf0: ffff881730ca6200 0000000000000002 .b.0............ ffff880ff8b73c00: ffff880ff8b73c30 ffff881730ca6200 0<.......b.0....

            People

              bfaccini Bruno Faccini (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: