Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8362

page fault: exception RIP: lnet_mt_match_md+135

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.7.0
    • None
    • lustre 2.7.1-fe
    • 2
    • 9223372036854775807

    Description

      OSS console errors

      LNet: Can't send to 17456000@<65535:34821>: src 0@<0:0> is not a local nid^M
      LNet: 46045:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@<65535:34821>: -22^M
      LNet: Can't send to 17456000@<65535:34821>: src 0@<0:0> is not a local nid^M
      LNet: 56154:0:(lib-move.c:2241:LNetPut()) Error sending PUT to 0-17456000@<65535:34821>: -22^M
      ------------[ cut here ]------------^M
      WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)^M
      Hardware name: SUMMIT^M
      list_del corruption. prev->next should be ffff881d63ead4d0, but was (null)^M
      Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 acpi_cpufreq freq_table mperf lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dm_round_robin scsi_dh_rdac lpfc scsi_transport_fc scsi_tgt sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_mirror dm_region_hash dm_log dm_multipath dm_mod iTCO_wdt iTCO_vendor_support microcode sg wmi igb hwmon dca i2c_algo_bit ptp pps_core i2c_i801 i2c_core lpc_ich mfd_core shpchp tcp_bic ext3 jbd sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) ahci gru [last unloaded: scsi_wait_scan]^M
      Pid: 8603, comm: kiblnd_sd_02_01 Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1^M
      Call Trace:^M
       [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0^M
       [<ffffffff81074216>] ? warn_slowpath_fmt+0x46/0x50^M
       [<ffffffff812bda6e>] ? list_del+0x6e/0xa0^M
       [<ffffffffa052c5c9>] ? lnet_me_unlink+0x39/0x140 [lnet]^M
       [<ffffffffa05303f8>] ? lnet_md_unlink+0x2f8/0x3e0 [lnet]^M
       [<ffffffffa0531b9f>] ? lnet_try_match_md+0x22f/0x310 [lnet]^M
       [<ffffffffa0a1f727>] ? kiblnd_recv+0x107/0x780 [ko2iblnd]^M
       [<ffffffffa0531d1c>] ? lnet_mt_match_md+0x9c/0x1c0 [lnet]^M
       [<ffffffffa0532621>] ? lnet_ptl_match_md+0x281/0x870 [lnet]^M
       [<ffffffffa05396e7>] ? lnet_parse_local+0x307/0xc60 [lnet]^M
       [<ffffffffa053a6da>] ? lnet_parse+0x69a/0xcf0 [lnet]^M
       [<ffffffffa0a1ff3b>] ? kiblnd_handle_rx+0x19b/0x620 [ko2iblnd]^M
       [<ffffffffa0a212be>] ? kiblnd_scheduler+0xefe/0x10d0 [ko2iblnd]^M
       [<ffffffff81064f90>] ? default_wake_function+0x0/0x20^M
       [<ffffffffa0a203c0>] ? kiblnd_scheduler+0x0/0x10d0 [ko2iblnd]^M
       [<ffffffff8109dc8e>] ? kthread+0x9e/0xc0^M
       [<ffffffff8100c28a>] ? child_rip+0xa/0x20^M
       [<ffffffff8109dbf0>] ? kthread+0x0/0xc0^M
       [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      ---[ end trace 1063d2ffc2578a2f ]---^M
      ------------[ cut here ]------------^M
      

      From the crash dump bt looks like this.

      PID: 8603   TASK: ffff8810271fa040  CPU: 11  COMMAND: "kiblnd_sd_02_01"
       #0 [ffff880ff8b734f0] machine_kexec at ffffffff8103b5db
       #1 [ffff880ff8b73550] crash_kexec at ffffffff810c9412
       #2 [ffff880ff8b73620] kdb_kdump_check at ffffffff812973d7
       #3 [ffff880ff8b73630] kdb_main_loop at ffffffff8129a5c7
       #4 [ffff880ff8b73740] kdb_save_running at ffffffff8129472e
       #5 [ffff880ff8b73750] kdba_main_loop at ffffffff8147cd68
       #6 [ffff880ff8b73790] kdb at ffffffff812978c6
       #7 [ffff880ff8b73800] kdba_entry at ffffffff8147c687
       #8 [ffff880ff8b73810] notifier_call_chain at ffffffff81568515
       #9 [ffff880ff8b73850] atomic_notifier_call_chain at ffffffff8156857a
      #10 [ffff880ff8b73860] notify_die at ffffffff810a44fe
      #11 [ffff880ff8b73890] __die at ffffffff815663e2
      #12 [ffff880ff8b738c0] no_context at ffffffff8104c822
      #13 [ffff880ff8b73910] __bad_area_nosemaphore at ffffffff8104cad5
      #14 [ffff880ff8b73960] bad_area_nosemaphore at ffffffff8104cba3
      #15 [ffff880ff8b73970] __do_page_fault at ffffffff8104d29c
      #16 [ffff880ff8b73a90] do_page_fault at ffffffff8156845e
      #17 [ffff880ff8b73ac0] page_fault at ffffffff81565765
          [exception RIP: lnet_mt_match_md+135]
          RIP: ffffffffa0531d07  RSP: ffff880ff8b73b70  RFLAGS: 00010286
          RAX: ffff881d88420000  RBX: ffff880ff8b73c70  RCX: 0000000000000007
          RDX: 0000000000000004  RSI: ffff880ff8b73c70  RDI: ffffffffffffffff
          RBP: ffff880ff8b73bb0   R8: 0000000000000001   R9: d400000000000000
          R10: 0000000000000001  R11: 0000000000000012  R12: 0000000000000000
          R13: ffff881730ca6200  R14: 00d100120be91b91  R15: 0000000000000008
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      #18 [ffff880ff8b73bb8] lnet_ptl_match_md at ffffffffa0532621 [lnet]
      #19 [ffff880ff8b73c38] lnet_parse_local at ffffffffa05396e7 [lnet]
      #20 [ffff880ff8b73cd8] lnet_parse at ffffffffa053a6da [lnet]
      #21 [ffff880ff8b73d68] kiblnd_handle_rx at ffffffffa0a1ff3b [ko2iblnd]
      #22 [ffff880ff8b73db8] kiblnd_scheduler at ffffffffa0a212be [ko2iblnd]
      #23 [ffff880ff8b73ee8] kthread at ffffffff8109dc8e
      #24 [ffff880ff8b73f48] kernel_thread at ffffffff8100c28a
      

      Attachments

        1. lnet_msg-lnet_match_table.data
          3 kB
        2. lnet_mt_match_md.dis
          8 kB
        3. lnet_mt_match_md.withlinenumbers.dis
          10 kB
        4. lu-8362.20160725
          37 kB
        5. lu8362.20160802
          409 kB
        6. lu8362-20160803
          4 kB

        Issue Links

          Activity

            [LU-8362] page fault: exception RIP: lnet_mt_match_md+135

            This can be closed out We will track LU-7980 and LU-4330

            mhanafi Mahmoud Hanafi added a comment - This can be closed out We will track LU-7980 and LU-4330

            Thanks one more time for these datas.
            Mapping of ffff882029ab2578 address to a bdev_inode confirms this is a similar crash than already encountered as part of LU-7980.
            That means that you are safe since you have integrated patch for LU-7980. As I already indicated before, you may also want to integrate patch for LU-4330, which causes LNet MEs/small-MDs to be allocated in their own kmem_cache and thus no longer be affected by bugs/corruptions from all others pieces of software sharing <size-128> slabs. This has been proofed to help in further debugging new occurrences without the noise of MEs/MDs activity.

            bfaccini Bruno Faccini (Inactive) added a comment - Thanks one more time for these datas. Mapping of ffff882029ab2578 address to a bdev_inode confirms this is a similar crash than already encountered as part of LU-7980 . That means that you are safe since you have integrated patch for LU-7980 . As I already indicated before, you may also want to integrate patch for LU-4330 , which causes LNet MEs/small-MDs to be allocated in their own kmem_cache and thus no longer be affected by bugs/corruptions from all others pieces of software sharing <size-128> slabs. This has been proofed to help in further debugging new occurrences without the noise of MEs/MDs activity.

            File lu8362-29169893 is attached that contains crash data you requested.

            jaylan Jay Lan (Inactive) added a comment - File lu8362-29169893 is attached that contains crash data you requested.

            Many thanks again Jay.

            With these new datas I still can not definitely conclude about the exact content/cause of the corruption. This is mainly due to the fact that several list corruptions have been detected (and kind of corrected during the unlink mechanism, thus very likely to create the loop in the "prev" linked-list of MEs shown in "list -o 8 0xffff881d8d8bff40") preceding the crash.

            But what I can say now, is that it finally looks very similar to several occurrences I have examined during LU-7980 tracking.
            In order to find a definitive proof of what I presume, can you provide the "kmem ffff882029ab2578", "rd ffff881659b2c2c0 32", "rd ffff881d88425c40 32", and "rd ffff88161caf1040 32" crash sub-cmds output?

            Also, during LU-7980 tracking I have used patch for LU-4330 to move LNet MEs/small-MDs out of <size-128> Slabs, is it something that you may also try at your site ?

            bfaccini Bruno Faccini (Inactive) added a comment - Many thanks again Jay. With these new datas I still can not definitely conclude about the exact content/cause of the corruption. This is mainly due to the fact that several list corruptions have been detected (and kind of corrected during the unlink mechanism, thus very likely to create the loop in the "prev" linked-list of MEs shown in "list -o 8 0xffff881d8d8bff40") preceding the crash. But what I can say now, is that it finally looks very similar to several occurrences I have examined during LU-7980 tracking. In order to find a definitive proof of what I presume, can you provide the "kmem ffff882029ab2578", "rd ffff881659b2c2c0 32", "rd ffff881d88425c40 32", and "rd ffff88161caf1040 32" crash sub-cmds output? Also, during LU-7980 tracking I have used patch for LU-4330 to move LNet MEs/small-MDs out of <size-128> Slabs, is it something that you may also try at your site ?

            Uploaded information in lu8362.20160802.

            The list contains 24564 entries and ended with a duplicate.

            jaylan Jay Lan (Inactive) added a comment - Uploaded information in lu8362.20160802. The list contains 24564 entries and ended with a duplicate.

            There is still no evidence of LU-7980 and still no understanding of the corruption.
            To be sure that nothing has been left besides, can you provide the output of "list -o 8 0xffff881d8d8bff40", "rd ffff881d1b641e40 32", "kmem ffff8805f040b2e8".

            bfaccini Bruno Faccini (Inactive) added a comment - There is still no evidence of LU-7980 and still no understanding of the corruption. To be sure that nothing has been left besides, can you provide the output of "list -o 8 0xffff881d8d8bff40", "rd ffff881d1b641e40 32", "kmem ffff8805f040b2e8".

            Requested information in attachment lu-8362.20160725.

            jaylan Jay Lan (Inactive) added a comment - Requested information in attachment lu-8362.20160725.

            Humm I have made a mistake with the first address I have asked you to check and dump, it has nothing to do with your crash-dump because it comes from the one I was using to mimic what I wanted to be extracted from yours ... So replace the previous 2 first commands by "kmem 0xffff88201fc9a000" and "rd 0xffff88201fc9a000 1024", and also let the 3rd/"kmem 0xffff881d88420000" command go to completion.

            bfaccini Bruno Faccini (Inactive) added a comment - Humm I have made a mistake with the first address I have asked you to check and dump, it has nothing to do with your crash-dump because it comes from the one I was using to mimic what I wanted to be extracted from yours ... So replace the previous 2 first commands by "kmem 0xffff88201fc9a000" and "rd 0xffff88201fc9a000 1024", and also let the 3rd/"kmem 0xffff881d88420000" command go to completion.

            The first two failed. The third did not return for > 5 minutes before I terminated it.

            crash> kmem 0xffff88204b64a000
            kmem: WARNING: cannot find mem_map page for address: ffff88204b64a000
            204b64a000: kernel virtual address not found in mem map
            crash> rd 0xffff88204b64a000 1024
            rd: seek error: kernel virtual address: ffff88204b64a000 type: "64-bit KVADDR"
            crash> kmem 0xffff881d88420000

            crash> rd ffff88138ecf5c40 32
            ffff88138ecf5c40: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
            ffff88138ecf5c50: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
            ffff88138ecf5c60: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
            ffff88138ecf5c70: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ
            ffff88138ecf5c80: 5a5a5a5a5a5a5a5a 0000000000000000 ZZZZZZZZ........
            ffff88138ecf5c90: 0000000000000000 0000000000000000 ................
            ffff88138ecf5ca0: 0000000000000000 0000000000000000 ................
            ffff88138ecf5cb0: 0000000000000000 0000000000000000 ................
            ffff88138ecf5cc0: ffff881659b2c340 ffff881d1b641ec0 @..Y......d.....
            ffff88138ecf5cd0: ffff88166a9e5ed0 ffffc90022095750 .^.j....PW."....
            ffff88138ecf5ce0: 0000001490b6a75a ffffffffffffffff Z...............
            ffff88138ecf5cf0: ffff881effffffff 000001000000001c ................
            ffff88138ecf5d00: 0000000000000000 ffffffffffffffff ................
            ffff88138ecf5d10: 0000000000000001 ffff881b54a070c0 .........p.T....
            ffff88138ecf5d20: 0000000000000000 0000000000000000 ................
            ffff88138ecf5d30: 0000000000000000 0000000000000000 ................
            crash>

            jaylan Jay Lan (Inactive) added a comment - The first two failed. The third did not return for > 5 minutes before I terminated it. crash> kmem 0xffff88204b64a000 kmem: WARNING: cannot find mem_map page for address: ffff88204b64a000 204b64a000: kernel virtual address not found in mem map crash> rd 0xffff88204b64a000 1024 rd: seek error: kernel virtual address: ffff88204b64a000 type: "64-bit KVADDR" crash> kmem 0xffff881d88420000 crash> rd ffff88138ecf5c40 32 ffff88138ecf5c40: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ ffff88138ecf5c50: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ ffff88138ecf5c60: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ ffff88138ecf5c70: 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ZZZZZZZZZZZZZZZZ ffff88138ecf5c80: 5a5a5a5a5a5a5a5a 0000000000000000 ZZZZZZZZ........ ffff88138ecf5c90: 0000000000000000 0000000000000000 ................ ffff88138ecf5ca0: 0000000000000000 0000000000000000 ................ ffff88138ecf5cb0: 0000000000000000 0000000000000000 ................ ffff88138ecf5cc0: ffff881659b2c340 ffff881d1b641ec0 @..Y......d..... ffff88138ecf5cd0: ffff88166a9e5ed0 ffffc90022095750 .^.j....PW.".... ffff88138ecf5ce0: 0000001490b6a75a ffffffffffffffff Z............... ffff88138ecf5cf0: ffff881effffffff 000001000000001c ................ ffff88138ecf5d00: 0000000000000000 ffffffffffffffff ................ ffff88138ecf5d10: 0000000000000001 ffff881b54a070c0 .........p.T.... ffff88138ecf5d20: 0000000000000000 0000000000000000 ................ ffff88138ecf5d30: 0000000000000000 0000000000000000 ................ crash>

            Well lnet_msg and lnet_match_table look ok, so can you get "kmem 0xffff88204b64a000", "rd 0xffff88204b64a000 1024" (to confirm that head pointer 0xffff88201fc9b000 should be finally ok, even if strangely aligned as I stated before but this can happen due to hash-table sizing), and also "kmem 0xffff881d88420000" and "rd ffff88138ecf5c40 32" for me now??

            bfaccini Bruno Faccini (Inactive) added a comment - Well lnet_msg and lnet_match_table look ok, so can you get "kmem 0xffff88204b64a000", "rd 0xffff88204b64a000 1024" (to confirm that head pointer 0xffff88201fc9b000 should be finally ok, even if strangely aligned as I stated before but this can happen due to hash-table sizing), and also "kmem 0xffff881d88420000" and "rd ffff88138ecf5c40 32" for me now??

            People

              bfaccini Bruno Faccini (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: