Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2967

list_del corruption - client crashes

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.1.5
    • Hyperion/LLNL - SWL testing
    • 3
    • 7232

    Description

      After multiple hours of SWL runs, multiple client crashes.
      Example one

      2013-03-14 06:13:47 ------------[ cut here ]------------
      2013-03-14 06:13:47 WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Tainted: G        W  ---------------   )
      2013-03-14 06:13:47 Hardware name: XS23-TY
      2013-03-14 06:13:47 list_del corruption. next->prev should be ffff8801aee8bc50, but was 0504000006000001
      2013-03-14 06:13:47 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
      2013-03-14 06:13:47 Pid: 3160, comm: ipoib Tainted: G        W  ---------------    2.6.32-279.22.1.el6.x86_64 #1
      2013-03-14 06:13:47 Call Trace:
      2013-03-14 06:13:47  [<ffffffff8106a2a7>] ? warn_slowpath_common+0x87/0xc0
      2013-03-14 06:13:47  [<ffffffff8106a396>] ? warn_slowpath_fmt+0x46/0x50
      2013-03-14 06:13:47  [<ffffffff81279f0d>] ? list_del+0x8d/0xa0
      2013-03-14 06:13:47  [<ffffffffa0347619>] ? ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
      2013-03-14 06:13:47  [<ffffffffa0347550>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
      2013-03-14 06:13:47  [<ffffffff8108b370>] ? worker_thread+0x170/0x2a0
      2013-03-14 06:13:47  [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40
      2013-03-14 06:13:47  [<ffffffff8108b200>] ? worker_thread+0x0/0x2a0
      2013-03-14 06:13:47  [<ffffffff81090876>] ? kthread+0x96/0xa0
      2013-03-14 06:13:47  [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      2013-03-14 06:13:47  [<ffffffff810907e0>] ? kthread+0x0/0xa0
      2013-03-14 06:13:47  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-03-14 06:13:47 ---[ end trace e1288d85056fd00d ]---
      2013-03-14 06:13:47 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      2013-03-14 06:13:47 IP: [<ffffffff81279e9b>] list_del+0x1b/0xa0
      2013-03-14 06:13:47 PGD 174282067 PUD 145d8f067 PMD 0
      2013-03-14 06:13:47 Oops: 0000 [#1] SMP
      2013-03-14 06:13:47 last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/net/eth1/statistics/tx_errors
      2013-03-14 06:13:47 CPU 2
      2013-03-14 06:13:47 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
      2013-03-14 06:13:47
      2013-03-14 06:13:47 Pid: 3160, comm: ipoib Tainted: G        W  ---------------    2.6.32-279.22.1.el6.x86_64 #1 Dell        XS23-TY     /XS23-TY
      2013-03-14 06:13:47 RIP: 0010:[<ffffffff81279e9b>]  [<ffffffff81279e9b>] list_del+0x1b/0xa0
      2013-03-14 06:13:47 RSP: 0018:ffff880339053db0  EFLAGS: 00010046
      2013-03-14 06:13:47 RAX: 0000000000000000 RBX: ffff8801b082f8d0 RCX: 0000000000004aef
      2013-03-14 06:13:47 RDX: 0000000000000246 RSI: ffff8801bb8444d0 RDI: ffff8801b082f8d0
      2013-03-14 06:13:47 RBP: ffff880339053dc0 R08: ffff8801b082f8d0 R09: 0000000000000000
      2013-03-14 06:13:47 R10: ffff8801c0065680 R11: 0000000000000000 R12: ffff8801ba034020
      2013-03-14 06:13:47 R13: 0000000000000246 R14: ffff8801ba697e80 R15: ffff8801ba0346e0
      2013-03-14 06:13:47 FS:  0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
      2013-03-14 06:13:47 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      2013-03-14 06:13:47 CR2: 0000000000000008 CR3: 00000001a4639000 CR4: 00000000000006e0
      2013-03-14 06:13:47 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2013-03-14 06:13:47 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2013-03-14 06:13:47 Process ipoib (pid: 3160, threadinfo ffff880339052000, task ffff880339256040)
      2013-03-14 06:13:47 Stack:
      2013-03-14 06:13:47  0000000109b77ac5 ffff8801b082f8c0 ffff880339053e30 ffffffffa0347619
      2013-03-14 06:13:47 <d> ffff88033c1acaa0 ffff880339256040 ffff8801ba0352e8 ffff8801ba034340
      2013-03-14 06:13:47 <d> ffff880339053e30 ffffffff00000002 ffffe8fe62609a40 ffffe8fe62609a40
      2013-03-14 06:13:47 Call Trace:
      2013-03-14 06:13:47  [<ffffffffa0347619>] ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
      2013-03-14 06:13:47  [<ffffffffa0347550>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
      2013-03-14 06:13:47  [<ffffffff8108b370>] worker_thread+0x170/0x2a0
      2013-03-14 06:13:47  [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40
      2013-03-14 06:13:47  [<ffffffff8108b200>] ? worker_thread+0x0/0x2a0
      2013-03-14 06:13:47  [<ffffffff81090876>] kthread+0x96/0xa0
      2013-03-14 06:13:47  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      2013-03-14 06:13:47  [<ffffffff810907e0>] ? kthread+0x0/0xa0
      2013-03-14 06:13:47  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-03-14 06:13:47 Code: 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48
      2013-03-14 06:13:47 RIP  [<ffffffff81279e9b>] list_del+0x1b/0xa0
      2013-03-14 06:13:47  RSP <ffff880339053db0>
      2013-03-14 06:13:47 CR2: 0000000000000008
      

      Second Example:

      2013-03-14 07:15:50 ------------[ cut here ]------------
      2013-03-14 07:15:50 WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: G        W  ---------------   )
      2013-03-14 07:15:50 Hardware name: XS23-TY
      2013-03-14 07:15:50 list_add corruption. prev->next should be next (ffff8801af5ed2d0), but was ffff88033b3addd0. (prev=ffff8801ba25f2e8).
      2013-03-14 07:15:50 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ahci i7core_edac edac_core ioatdma dca shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
      2013-03-14 07:15:50 Pid: 4328, comm: kiblnd_sd_07 Tainted: G        W  ---------------    2.6.32-279.22.1.el6.x86_64 #1
      2013-03-14 07:15:50 Call Trace:
      2013-03-14 07:15:50  <IRQ>  [<ffffffff8106a2a7>] ? warn_slowpath_common+0x87/0xc0
      2013-03-14 07:15:50  [<ffffffff8106a396>] ? warn_slowpath_fmt+0x46/0x50
      2013-03-14 07:15:50  [<ffffffff81279faf>] ? __list_add+0x8f/0xa0
      2013-03-14 07:15:50  [<ffffffffa033fb7e>] ? ipoib_cm_destroy_tx+0x6e/0xc0 [ib_ipoib]
      2013-03-14 07:15:50  [<ffffffffa0337b39>] ? ipoib_neigh_dtor+0x89/0xf0 [ib_ipoib]
      2013-03-14 07:15:50  [<ffffffffa0337bc8>] ? ipoib_neigh_reclaim+0x28/0x30 [ib_ipoib]
      2013-03-14 07:15:50  [<ffffffff810de635>] ? __rcu_process_callbacks+0x135/0x350
      2013-03-14 07:15:50  [<ffffffff81012a69>] ? read_tsc+0x9/0x20
      2013-03-14 07:15:50  [<ffffffff810de87b>] ? rcu_process_callbacks+0x2b/0x50
      2013-03-14 07:15:50  [<ffffffff81072ac1>] ? __do_softirq+0xc1/0x1e0
      2013-03-14 07:15:50  [<ffffffff81095760>] ? hrtimer_interrupt+0x140/0x250
      2013-03-14 07:15:50  [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
      2013-03-14 07:15:50  [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
      2013-03-14 07:15:50  [<ffffffff810728a5>] ? irq_exit+0x85/0x90
      2013-03-14 07:15:50  [<ffffffff814f2360>] ? smp_apic_timer_interrupt+0x70/0x9b
      2013-03-14 07:15:50  [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
      2013-03-14 07:15:50  <EOI>  [<ffffffff814ec947>] ? _spin_unlock_irqrestore+0x17/0x20
      2013-03-14 07:15:50  [<ffffffffa0322a46>] ? mlx4_ib_poll_cq+0x2c6/0x7f0 [mlx4_ib]
      2013-03-14 07:15:50  [<ffffffffa07a4478>] ? kiblnd_scheduler+0xf8/0x760 [ko2iblnd]
      2013-03-14 07:15:50  [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      2013-03-14 07:15:50  [<ffffffffa07a4380>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
      2013-03-14 07:15:50  [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      2013-03-14 07:15:50  [<ffffffffa07a4380>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
      2013-03-14 07:15:50  [<ffffffffa07a4380>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
      2013-03-14 07:15:50  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-03-14 07:15:50 ---[ end trace ceec6f0d4be48403 ]---
      2013-03-14 07:15:50 general protection fault: 0000 [#1] SMP
      2013-03-14 07:15:50 last sysfs file: /sys/devices/virtual/dmi/id/sys_vendor
      2013-03-14 07:15:50 CPU 0
      2013-03-14 07:15:50 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ahci i7core_edac edac_core ioatdma dca shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
      2013-03-14 07:15:50
      2013-03-14 07:15:50 Pid: 3208, comm: ipoib Tainted: G        W  ---------------    2.6.32-279.22.1.el6.x86_64 #1 Dell        XS23-TY     /XS23-TY
      2013-03-14 07:15:50 RIP: 0010:[<ffffffff81279e9b>]  [<ffffffff81279e9b>] list_del+0x1b/0xa0
      2013-03-14 07:15:50 RSP: 0018:ffff8801bba1ddb0  EFLAGS: 00010046
      2013-03-14 07:15:50 RAX: dead000000100100 RBX: ffff8801af5ed2d0 RCX: 000000000000b9d4
      2013-03-14 07:15:50 RDX: 0000000000000246 RSI: ffff8801bfe979d0 RDI: ffff8801af5ed2d0
      2013-03-14 07:15:50 RBP: ffff8801bba1ddc0 R08: ffff8801af5ed2d0 R09: 0000000000000000
      2013-03-14 07:15:50 R10: ffff8801c0065880 R11: 0000000000000000 R12: ffff8801ba25e020
      2013-03-14 07:15:50 R13: 0000000000000246 R14: ffff8801ba021400 R15: ffff8801ba25e6e0
      2013-03-14 07:15:50 FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
      2013-03-14 07:15:50 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      2013-03-14 07:15:50 CR2: 00002aaab80041f8 CR3: 0000000175615000 CR4: 00000000000006f0
      2013-03-14 07:15:50 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2013-03-14 07:15:50 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2013-03-14 07:15:50 Process ipoib (pid: 3208, threadinfo ffff8801bba1c000, task ffff8801bb536080)
      2013-03-14 07:15:50 Stack:
      2013-03-14 07:15:50  0000000109f05306 ffff8801af5ed2c0 ffff8801bba1de30 ffffffffa0340619
      2013-03-14 07:15:50 <d> ffffffff81a8d020 ffff8801bb536080 ffff8801ba25f2e8 ffff8801ba25e340
      2013-03-14 07:15:50 <d> 00000078bba1de30 0000000000000000 ffff8801bba1de10 ffffe8fe62609a40
      2013-03-14 07:15:50 Call Trace:
      2013-03-14 07:15:50  [<ffffffffa0340619>] ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
      2013-03-14 07:15:50  [<ffffffffa0340550>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
      2013-03-14 07:15:50  [<ffffffff8108b370>] worker_thread+0x170/0x2a0
      2013-03-14 07:15:50  [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40
      2013-03-14 07:15:50  [<ffffffff8108b200>] ? worker_thread+0x0/0x2a0
      2013-03-14 07:15:50  [<ffffffff81090876>] kthread+0x96/0xa0
      2013-03-14 07:15:50  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      2013-03-14 07:15:50  [<ffffffff810907e0>] ? kthread+0x0/0xa0
      2013-03-14 07:15:50  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-03-14 07:15:50 Code: 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48
      2013-03-14 07:15:50 RIP  [<ffffffff81279e9b>] list_del+0x1b/0xa0
      2013-03-14 07:15:50  RSP <ffff8801bba1ddb0>
      

      Attachments

        Issue Links

          Activity

            [LU-2967] list_del corruption - client crashes
            ys Yang Sheng added a comment -

            2.6.32-358.11.1.el6 update already included this fix(LU-3461). So close this one.

            ys Yang Sheng added a comment - 2.6.32-358.11.1.el6 update already included this fix( LU-3461 ). So close this one.
            green Oleg Drokin added a comment -

            Change to pull in the upstream fix while RedHat waits for fix effectiveness confirmation master version is at http://review.whamcloud.com/5952

            Also, I just realized that we are not really sure if master is good enough to withstand SWL run at this time, so I made a b2_1 patch too: http://review.whamcloud.com/5953 (it reverts back to the problematic commit that was used originally for this bugreport, but with the fix added on top).

            green Oleg Drokin added a comment - Change to pull in the upstream fix while RedHat waits for fix effectiveness confirmation master version is at http://review.whamcloud.com/5952 Also, I just realized that we are not really sure if master is good enough to withstand SWL run at this time, so I made a b2_1 patch too: http://review.whamcloud.com/5953 (it reverts back to the problematic commit that was used originally for this bugreport, but with the fix added on top).
            ys Yang Sheng added a comment -

            The latest 2.6.32-358.2.1.el6 still not include the fix(upstream fa16ebed31f336e41970f3f0ea9e8279f6be2d27).

            ys Yang Sheng added a comment - The latest 2.6.32-358.2.1.el6 still not include the fix(upstream fa16ebed31f336e41970f3f0ea9e8279f6be2d27).
            pjones Peter Jones added a comment -

            Yangsheng

            Please confirm when a kernel update exists which fixes this Red Hat bug

            thanks

            Peter

            pjones Peter Jones added a comment - Yangsheng Please confirm when a kernel update exists which fixes this Red Hat bug thanks Peter
            green Oleg Drokin added a comment -

            RedHat bug (confirmed, with a reference to fix): https://bugzilla.redhat.com/show_bug.cgi?id=913645

            green Oleg Drokin added a comment - RedHat bug (confirmed, with a reference to fix): https://bugzilla.redhat.com/show_bug.cgi?id=913645

            Yes, the test failing is SWL which is run routinely.

            cliffw Cliff White (Inactive) added a comment - Yes, the test failing is SWL which is run routinely.
            mdiep Minh Diep added a comment -

            the changes around that function in the ipoib_cm.c between 14.1 and 22.1 are

            [root@fat-amd-4 infiniband]# diff ulp/ipoib/ipoib_cm.c /root/kernel14/linux-2.6.32-279.14.1.el6/drivers/infiniband/ulp/ipoib/ipoib_cm.c
            812c812,814
            < ipoib_neigh_free(neigh);

            > if (neigh->ah)
            > ipoib_put_ah(neigh->ah);
            > ipoib_neigh_free(dev, neigh);
            1229c1231,1233
            < ipoib_neigh_free(neigh);

            > if (neigh->ah)
            > ipoib_put_ah(neigh->ah);
            > ipoib_neigh_free(dev, neigh);
            1276c1280
            < tx->neigh->daddr + 4);

            > tx->neigh->dgid.raw);
            1301c1305
            < qpn = IPOIB_QPN(neigh->daddr);

            > qpn = IPOIB_QPN(neigh->neighbour->ha);
            1317c1321,1323
            < ipoib_neigh_free(neigh);

            > if (neigh->ah)
            > ipoib_put_ah(neigh->ah);
            > ipoib_neigh_free(dev, neigh);

            mdiep Minh Diep added a comment - the changes around that function in the ipoib_cm.c between 14.1 and 22.1 are [root@fat-amd-4 infiniband] # diff ulp/ipoib/ipoib_cm.c /root/kernel14/linux-2.6.32-279.14.1.el6/drivers/infiniband/ulp/ipoib/ipoib_cm.c 812c812,814 < ipoib_neigh_free(neigh); — > if (neigh->ah) > ipoib_put_ah(neigh->ah); > ipoib_neigh_free(dev, neigh); 1229c1231,1233 < ipoib_neigh_free(neigh); — > if (neigh->ah) > ipoib_put_ah(neigh->ah); > ipoib_neigh_free(dev, neigh); 1276c1280 < tx->neigh->daddr + 4); — > tx->neigh->dgid.raw); 1301c1305 < qpn = IPOIB_QPN(neigh->daddr); — > qpn = IPOIB_QPN(neigh->neighbour->ha); 1317c1321,1323 < ipoib_neigh_free(neigh); — > if (neigh->ah) > ipoib_put_ah(neigh->ah); > ipoib_neigh_free(dev, neigh);
            mdiep Minh Diep added a comment -

            have you run the same test on master which has version 279.19.1?

            mdiep Minh Diep added a comment - have you run the same test on master which has version 279.19.1?

            279.14.1 would be the last kernel that passed.

            cliffw Cliff White (Inactive) added a comment - 279.14.1 would be the last kernel that passed.
            green Oleg Drokin added a comment -

            Looking at the changelog for 279.22.1.el6 that introduced this I see:

            BZ#880085
            Previously, the IP over Infiniband (IPoIB) driver maintained state information about neighbors on the network by attaching it to the core network's neighbor structure. However, due to a race condition between the freeing of the core network neighbor struct and the freeing of the IPoIB network struct, a use after free condition could happen, resulting in either a kernel oops or 4 or 8 bytes of kernel memory being zeroed when it was not supposed to be. These patches decouple the IPoIB neighbor struct from the core networking stack's neighbor struct so that there is no race between the freeing of one and the freeing of the other.

            So this must be it, the failure is in neighbor handling code, but I do not have enough permissions in RH bz to check the patch.
            I think it's tiem to file a bug for RH.
            We first hit it going from lnxrel="279.14.1.el6" to lnxrel="279.22.1.el6"

            green Oleg Drokin added a comment - Looking at the changelog for 279.22.1.el6 that introduced this I see: BZ#880085 Previously, the IP over Infiniband (IPoIB) driver maintained state information about neighbors on the network by attaching it to the core network's neighbor structure. However, due to a race condition between the freeing of the core network neighbor struct and the freeing of the IPoIB network struct, a use after free condition could happen, resulting in either a kernel oops or 4 or 8 bytes of kernel memory being zeroed when it was not supposed to be. These patches decouple the IPoIB neighbor struct from the core networking stack's neighbor struct so that there is no race between the freeing of one and the freeing of the other. So this must be it, the failure is in neighbor handling code, but I do not have enough permissions in RH bz to check the patch. I think it's tiem to file a bug for RH. We first hit it going from lnxrel="279.14.1.el6" to lnxrel="279.22.1.el6"

            People

              ys Yang Sheng
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: