Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8022

LNet: BUG: unable to handle kernel NULL pointer dereference

Details

    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). DNE is enabled. OST have been formatted using zfs, MDTs using _ldiskfs. OSS and MDT nodes are configured in HA active-active failover configuration.

      During system boot, a MDS node that had been restarted, crashed with the following error during LNet initialization:

      Lustre: Lustre: Build Version: 2.8.51_28_gba2ac35
      LNetError: 3247:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 192.168.1.108@o2ib10 on NA (ib0:0:192.168.1.110): bad dst ni
      d 192.168.1.110@o2ib10
      BUG: unable to handle kernel NULL pointer dereference
      LNet: Added LNI 192.168.1.110@o2ib10 [8/256/0/180]
       at 0000000000000080
      IP: [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
      PGD 839067067 PUD 8383e1067 PMD 0 
      Oops: 0000 [#1] SMP 
      last sysfs file: /sys/module/lnet/initstate
      CPU 0 
      Modules linked in: ko2iblnd(U) ptlrpc(+)(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad r
      dma_cm ib_cm iw_cm dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate za
      vl(P)(U) zunicode(P)(U) sb_edac edac_core joydev lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t1
      0dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror
       dm_region_hash dm_log dm_mod scsi_dh_rdac [last unloaded: scsi_wait_scan]
      
      Pid: 3247, comm: ib_cm/0 Tainted: P           -- ------------    2.6.32-573.22.1.el6_lustre.x86_64 #1 Intel Corporation SandyBridge Platform/To be
       filled by O.E.M.
      RIP: 0010:[<ffffffffa0b861e6>]  [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
      RSP: 0018:ffff8804318e7b20  EFLAGS: 00010246
      RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000012
      RBP: ffff8804318e7be0 R08: 000000000001b9c2 R09: 00000000fffffffb
      R10: 0000000000000003 R11: 0000000000000000 R12: ffff880835a6dc20
      R13: ffffffffa0b92263 R14: ffff880432df7800 R15: ffffffffa06a1020
      FS:  0000000000000000(0000) GS:ffff880038600000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000000000000080 CR3: 0000000835bd0000 CR4: 00000000000407f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ib_cm/0 (pid: 3247, threadinfo ffff8804318e4000, task ffff880431015520)
      Stack:
       ffff880835a6dc20 ffffffffa06a1020 0000000000000004 ffff8808369be5d0
      <d> ffff8804318e7b80 00000000814e7731 0005000ac0a8016c ffff880800000012
      <d> 000300120be91b91 0000000000000000 0000100000000008 ffffffffa011bcbc
      Call Trace:
       [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core]
       [<ffffffffa0b87c3d>] kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd]
       [<ffffffffa034a011>] cma_req_handler+0x371/0x640 [rdma_cm]
       [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core]
       [<ffffffffa0322b27>] cm_process_work+0x27/0x110 [ib_cm]
       [<ffffffffa0323735>] cm_req_handler+0x6b5/0xac0 [ib_cm]
       [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
       [<ffffffffa0324275>] cm_work_handler+0x135/0x1206 [ib_cm]
       [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
       [<ffffffff8109ab40>] worker_thread+0x170/0x2a0
       [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0
       [<ffffffff810a138e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff810a12f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Code: e8 90 ff a6 ff 0f b7 95 78 ff ff ff 8b bd 78 ff ff ff 48 89 de 66 89 55 84 e8 27 46 00 00 83 bd 78 ff ff ff 11 66 89 45 90 74 10 <48> 8b 83 80 00 00 00 8b 50 1c 85 d2 89 d0 75 05 b8 00 01 00 00 
      RIP  [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
       RSP <ffff8804318e7b20>
      CR2: 0000000000000080
      ---[ end trace 01db8c57e9900e3f ]---
      Kernel panic - not syncing: Fatal exception
      Pid: 3247, comm: ib_cm/0 Tainted: P      D    -- ------------    2.6.32-573.22.1.el6_lustre.x86_64 #1
      Call Trace:
       [<ffffffff815394d1>] ? panic+0xa7/0x16f
       [<ffffffff8153e2d4>] ? oops_end+0xe4/0x100
       [<ffffffff8104e8cb>] ? no_context+0xfb/0x260
       [<ffffffff8104eb55>] ? __bad_area_nosemaphore+0x125/0x1e0
       [<ffffffff8104ec23>] ? bad_area_nosemaphore+0x13/0x20
       [<ffffffff8104f31c>] ? __do_page_fault+0x30c/0x500
       [<ffffffff81336a9f>] ? extract_buf+0x9f/0x130
       [<ffffffff815401fe>] ? do_page_fault+0x3e/0xa0
       [<ffffffff8153d5a5>] ? page_fault+0x25/0x30
       [<ffffffffa0b861e6>] ? kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
       [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core]
       [<ffffffffa0b87c3d>] ? kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd]
       [<ffffffffa034a011>] ? cma_req_handler+0x371/0x640 [rdma_cm]
       [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core]
       [<ffffffffa0322b27>] ? cm_process_work+0x27/0x110 [ib_cm]
       [<ffffffffa0323735>] ? cm_req_handler+0x6b5/0xac0 [ib_cm]
       [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
       [<ffffffffa0324275>] ? cm_work_handler+0x135/0x1206 [ib_cm]
       [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
       [<ffffffff8109ab40>] ? worker_thread+0x170/0x2a0
       [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0
       [<ffffffff810a138e>] ? kthread+0x9e/0xc0
       [<ffffffff8100c28a>] ? child_rip+0xa/0x20
       [<ffffffff810a12f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Unfortunately no crash dump was written. The only error message available was extracted from console log of the node affected (lola-10).
      Therefore only the console log of MDS has been attached.

      Attachments

        Issue Links

          Activity

            [LU-8022] LNet: BUG: unable to handle kernel NULL pointer dereference

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21001/
            Subject: LU-8022 lnet: Correct position of lnet_ni_decref()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e8278552cfbcf518209a38f82548a16833686ae9

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21001/ Subject: LU-8022 lnet: Correct position of lnet_ni_decref() Project: fs/lustre-release Branch: master Current Patch Set: Commit: e8278552cfbcf518209a38f82548a16833686ae9

            Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/21001
            Subject: LU-8022 lnet: Correct position of lnet_ni_decref()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 046a485e69dc879bf112690c1434dee86292554b

            gerrit Gerrit Updater added a comment - Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/21001 Subject: LU-8022 lnet: Correct position of lnet_ni_decref() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 046a485e69dc879bf112690c1434dee86292554b
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19614/
            Subject: LU-8022 lnet: Don't access NULL NI on failure path
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f5c7fec23cb26219d959290a4a311119747cc609

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19614/ Subject: LU-8022 lnet: Don't access NULL NI on failure path Project: fs/lustre-release Branch: master Current Patch Set: Commit: f5c7fec23cb26219d959290a4a311119747cc609
            ezell Matt Ezell added a comment -

            We hit this today on our LNET routers when upgrading a cluster to 2.8 with LU-7101. Router pinger messages come in before all the NIs have been added, causing this failure.

            ezell Matt Ezell added a comment - We hit this today on our LNET routers when upgrading a cluster to 2.8 with LU-7101 . Router pinger messages come in before all the NIs have been added, causing this failure.

            In soak test session for build '20160427' which includes patch 1 of #19614, the error never occurred anymore. The duration for soak is 10 days now.

            heckes Frank Heckes (Inactive) added a comment - In soak test session for build '20160427' which includes patch 1 of #19614, the error never occurred anymore. The duration for soak is 10 days now.

            Patch has been included into build '20160427' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160427) and going to be verified in soak test session associated with this build.

            heckes Frank Heckes (Inactive) added a comment - Patch has been included into build '20160427' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160427 ) and going to be verified in soak test session associated with this build.

            Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/19614
            Subject: LU-8022 lnet: Don't access NULL NI on failure path
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7163df3dcd22199609539530a6a761acc6fd689e

            gerrit Gerrit Updater added a comment - Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/19614 Subject: LU-8022 lnet: Don't access NULL NI on failure path Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7163df3dcd22199609539530a6a761acc6fd689e

            Ok, I see two problems here:

            1- The network interface (NI) for the IB card seems to have "disappeared". Almost as if the device ib0 went down and was removed from our list of available NIs. However, we still received a connection request to that NI and that is the failure being reported in the error log.
            2- The failure path then tries to dereference the NULL NI pointer which causes the core dump. That, of course, must be fixed.

            I am going to use this ticket to fix problem 2 (dereferencing the NULL NI pointer) so we don't crash. I don't have enough info to address the first problem so will have to wait for that to be reproduced with this fix in place to prevent a core dump.

            doug Doug Oucharek (Inactive) added a comment - Ok, I see two problems here: 1- The network interface (NI) for the IB card seems to have "disappeared". Almost as if the device ib0 went down and was removed from our list of available NIs. However, we still received a connection request to that NI and that is the failure being reported in the error log. 2- The failure path then tries to dereference the NULL NI pointer which causes the core dump. That, of course, must be fixed. I am going to use this ticket to fix problem 2 (dereferencing the NULL NI pointer) so we don't crash. I don't have enough info to address the first problem so will have to wait for that to be reproduced with this fix in place to prevent a core dump.

            Hi Doug,

            Could you have a look at this?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Doug, Could you have a look at this? Thanks. Joe

            People

              doug Doug Oucharek (Inactive)
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: