Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8022

LNet: BUG: unable to handle kernel NULL pointer dereference

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). DNE is enabled. OST have been formatted using zfs, MDTs using _ldiskfs. OSS and MDT nodes are configured in HA active-active failover configuration.

      During system boot, a MDS node that had been restarted, crashed with the following error during LNet initialization:

      Lustre: Lustre: Build Version: 2.8.51_28_gba2ac35
      LNetError: 3247:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 192.168.1.108@o2ib10 on NA (ib0:0:192.168.1.110): bad dst ni
      d 192.168.1.110@o2ib10
      BUG: unable to handle kernel NULL pointer dereference
      LNet: Added LNI 192.168.1.110@o2ib10 [8/256/0/180]
       at 0000000000000080
      IP: [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
      PGD 839067067 PUD 8383e1067 PMD 0 
      Oops: 0000 [#1] SMP 
      last sysfs file: /sys/module/lnet/initstate
      CPU 0 
      Modules linked in: ko2iblnd(U) ptlrpc(+)(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad r
      dma_cm ib_cm iw_cm dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate za
      vl(P)(U) zunicode(P)(U) sb_edac edac_core joydev lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t1
      0dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror
       dm_region_hash dm_log dm_mod scsi_dh_rdac [last unloaded: scsi_wait_scan]
      
      Pid: 3247, comm: ib_cm/0 Tainted: P           -- ------------    2.6.32-573.22.1.el6_lustre.x86_64 #1 Intel Corporation SandyBridge Platform/To be
       filled by O.E.M.
      RIP: 0010:[<ffffffffa0b861e6>]  [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
      RSP: 0018:ffff8804318e7b20  EFLAGS: 00010246
      RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000012
      RBP: ffff8804318e7be0 R08: 000000000001b9c2 R09: 00000000fffffffb
      R10: 0000000000000003 R11: 0000000000000000 R12: ffff880835a6dc20
      R13: ffffffffa0b92263 R14: ffff880432df7800 R15: ffffffffa06a1020
      FS:  0000000000000000(0000) GS:ffff880038600000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000000000000080 CR3: 0000000835bd0000 CR4: 00000000000407f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ib_cm/0 (pid: 3247, threadinfo ffff8804318e4000, task ffff880431015520)
      Stack:
       ffff880835a6dc20 ffffffffa06a1020 0000000000000004 ffff8808369be5d0
      <d> ffff8804318e7b80 00000000814e7731 0005000ac0a8016c ffff880800000012
      <d> 000300120be91b91 0000000000000000 0000100000000008 ffffffffa011bcbc
      Call Trace:
       [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core]
       [<ffffffffa0b87c3d>] kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd]
       [<ffffffffa034a011>] cma_req_handler+0x371/0x640 [rdma_cm]
       [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core]
       [<ffffffffa0322b27>] cm_process_work+0x27/0x110 [ib_cm]
       [<ffffffffa0323735>] cm_req_handler+0x6b5/0xac0 [ib_cm]
       [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
       [<ffffffffa0324275>] cm_work_handler+0x135/0x1206 [ib_cm]
       [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
       [<ffffffff8109ab40>] worker_thread+0x170/0x2a0
       [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0
       [<ffffffff810a138e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff810a12f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Code: e8 90 ff a6 ff 0f b7 95 78 ff ff ff 8b bd 78 ff ff ff 48 89 de 66 89 55 84 e8 27 46 00 00 83 bd 78 ff ff ff 11 66 89 45 90 74 10 <48> 8b 83 80 00 00 00 8b 50 1c 85 d2 89 d0 75 05 b8 00 01 00 00 
      RIP  [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
       RSP <ffff8804318e7b20>
      CR2: 0000000000000080
      ---[ end trace 01db8c57e9900e3f ]---
      Kernel panic - not syncing: Fatal exception
      Pid: 3247, comm: ib_cm/0 Tainted: P      D    -- ------------    2.6.32-573.22.1.el6_lustre.x86_64 #1
      Call Trace:
       [<ffffffff815394d1>] ? panic+0xa7/0x16f
       [<ffffffff8153e2d4>] ? oops_end+0xe4/0x100
       [<ffffffff8104e8cb>] ? no_context+0xfb/0x260
       [<ffffffff8104eb55>] ? __bad_area_nosemaphore+0x125/0x1e0
       [<ffffffff8104ec23>] ? bad_area_nosemaphore+0x13/0x20
       [<ffffffff8104f31c>] ? __do_page_fault+0x30c/0x500
       [<ffffffff81336a9f>] ? extract_buf+0x9f/0x130
       [<ffffffff815401fe>] ? do_page_fault+0x3e/0xa0
       [<ffffffff8153d5a5>] ? page_fault+0x25/0x30
       [<ffffffffa0b861e6>] ? kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
       [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core]
       [<ffffffffa0b87c3d>] ? kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd]
       [<ffffffffa034a011>] ? cma_req_handler+0x371/0x640 [rdma_cm]
       [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core]
       [<ffffffffa0322b27>] ? cm_process_work+0x27/0x110 [ib_cm]
       [<ffffffffa0323735>] ? cm_req_handler+0x6b5/0xac0 [ib_cm]
       [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
       [<ffffffffa0324275>] ? cm_work_handler+0x135/0x1206 [ib_cm]
       [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
       [<ffffffff8109ab40>] ? worker_thread+0x170/0x2a0
       [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0
       [<ffffffff810a138e>] ? kthread+0x9e/0xc0
       [<ffffffff8100c28a>] ? child_rip+0xa/0x20
       [<ffffffff810a12f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Unfortunately no crash dump was written. The only error message available was extracted from console log of the node affected (lola-10).
      Therefore only the console log of MDS has been attached.

      Attachments

        Issue Links

          Activity

            People

              doug Doug Oucharek (Inactive)
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: