Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
Lustre 2.1.5
-
Hyperion/LLNL - SWL testing
-
3
-
7232
Description
After multiple hours of SWL runs, multiple client crashes.
Example one
2013-03-14 06:13:47 ------------[ cut here ]------------
2013-03-14 06:13:47 WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Tainted: G W --------------- )
2013-03-14 06:13:47 Hardware name: XS23-TY
2013-03-14 06:13:47 list_del corruption. next->prev should be ffff8801aee8bc50, but was 0504000006000001
2013-03-14 06:13:47 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
2013-03-14 06:13:47 Pid: 3160, comm: ipoib Tainted: G W --------------- 2.6.32-279.22.1.el6.x86_64 #1
2013-03-14 06:13:47 Call Trace:
2013-03-14 06:13:47 [<ffffffff8106a2a7>] ? warn_slowpath_common+0x87/0xc0
2013-03-14 06:13:47 [<ffffffff8106a396>] ? warn_slowpath_fmt+0x46/0x50
2013-03-14 06:13:47 [<ffffffff81279f0d>] ? list_del+0x8d/0xa0
2013-03-14 06:13:47 [<ffffffffa0347619>] ? ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
2013-03-14 06:13:47 [<ffffffffa0347550>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
2013-03-14 06:13:47 [<ffffffff8108b370>] ? worker_thread+0x170/0x2a0
2013-03-14 06:13:47 [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40
2013-03-14 06:13:47 [<ffffffff8108b200>] ? worker_thread+0x0/0x2a0
2013-03-14 06:13:47 [<ffffffff81090876>] ? kthread+0x96/0xa0
2013-03-14 06:13:47 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
2013-03-14 06:13:47 [<ffffffff810907e0>] ? kthread+0x0/0xa0
2013-03-14 06:13:47 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-14 06:13:47 ---[ end trace e1288d85056fd00d ]---
2013-03-14 06:13:47 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
2013-03-14 06:13:47 IP: [<ffffffff81279e9b>] list_del+0x1b/0xa0
2013-03-14 06:13:47 PGD 174282067 PUD 145d8f067 PMD 0
2013-03-14 06:13:47 Oops: 0000 [#1] SMP
2013-03-14 06:13:47 last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/net/eth1/statistics/tx_errors
2013-03-14 06:13:47 CPU 2
2013-03-14 06:13:47 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
2013-03-14 06:13:47
2013-03-14 06:13:47 Pid: 3160, comm: ipoib Tainted: G W --------------- 2.6.32-279.22.1.el6.x86_64 #1 Dell XS23-TY /XS23-TY
2013-03-14 06:13:47 RIP: 0010:[<ffffffff81279e9b>] [<ffffffff81279e9b>] list_del+0x1b/0xa0
2013-03-14 06:13:47 RSP: 0018:ffff880339053db0 EFLAGS: 00010046
2013-03-14 06:13:47 RAX: 0000000000000000 RBX: ffff8801b082f8d0 RCX: 0000000000004aef
2013-03-14 06:13:47 RDX: 0000000000000246 RSI: ffff8801bb8444d0 RDI: ffff8801b082f8d0
2013-03-14 06:13:47 RBP: ffff880339053dc0 R08: ffff8801b082f8d0 R09: 0000000000000000
2013-03-14 06:13:47 R10: ffff8801c0065680 R11: 0000000000000000 R12: ffff8801ba034020
2013-03-14 06:13:47 R13: 0000000000000246 R14: ffff8801ba697e80 R15: ffff8801ba0346e0
2013-03-14 06:13:47 FS: 0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
2013-03-14 06:13:47 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
2013-03-14 06:13:47 CR2: 0000000000000008 CR3: 00000001a4639000 CR4: 00000000000006e0
2013-03-14 06:13:47 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2013-03-14 06:13:47 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2013-03-14 06:13:47 Process ipoib (pid: 3160, threadinfo ffff880339052000, task ffff880339256040)
2013-03-14 06:13:47 Stack:
2013-03-14 06:13:47 0000000109b77ac5 ffff8801b082f8c0 ffff880339053e30 ffffffffa0347619
2013-03-14 06:13:47 <d> ffff88033c1acaa0 ffff880339256040 ffff8801ba0352e8 ffff8801ba034340
2013-03-14 06:13:47 <d> ffff880339053e30 ffffffff00000002 ffffe8fe62609a40 ffffe8fe62609a40
2013-03-14 06:13:47 Call Trace:
2013-03-14 06:13:47 [<ffffffffa0347619>] ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
2013-03-14 06:13:47 [<ffffffffa0347550>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
2013-03-14 06:13:47 [<ffffffff8108b370>] worker_thread+0x170/0x2a0
2013-03-14 06:13:47 [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40
2013-03-14 06:13:47 [<ffffffff8108b200>] ? worker_thread+0x0/0x2a0
2013-03-14 06:13:47 [<ffffffff81090876>] kthread+0x96/0xa0
2013-03-14 06:13:47 [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-03-14 06:13:47 [<ffffffff810907e0>] ? kthread+0x0/0xa0
2013-03-14 06:13:47 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-14 06:13:47 Code: 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48
2013-03-14 06:13:47 RIP [<ffffffff81279e9b>] list_del+0x1b/0xa0
2013-03-14 06:13:47 RSP <ffff880339053db0>
2013-03-14 06:13:47 CR2: 0000000000000008
Second Example:
2013-03-14 07:15:50 ------------[ cut here ]------------
2013-03-14 07:15:50 WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: G W --------------- )
2013-03-14 07:15:50 Hardware name: XS23-TY
2013-03-14 07:15:50 list_add corruption. prev->next should be next (ffff8801af5ed2d0), but was ffff88033b3addd0. (prev=ffff8801ba25f2e8).
2013-03-14 07:15:50 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ahci i7core_edac edac_core ioatdma dca shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
2013-03-14 07:15:50 Pid: 4328, comm: kiblnd_sd_07 Tainted: G W --------------- 2.6.32-279.22.1.el6.x86_64 #1
2013-03-14 07:15:50 Call Trace:
2013-03-14 07:15:50 <IRQ> [<ffffffff8106a2a7>] ? warn_slowpath_common+0x87/0xc0
2013-03-14 07:15:50 [<ffffffff8106a396>] ? warn_slowpath_fmt+0x46/0x50
2013-03-14 07:15:50 [<ffffffff81279faf>] ? __list_add+0x8f/0xa0
2013-03-14 07:15:50 [<ffffffffa033fb7e>] ? ipoib_cm_destroy_tx+0x6e/0xc0 [ib_ipoib]
2013-03-14 07:15:50 [<ffffffffa0337b39>] ? ipoib_neigh_dtor+0x89/0xf0 [ib_ipoib]
2013-03-14 07:15:50 [<ffffffffa0337bc8>] ? ipoib_neigh_reclaim+0x28/0x30 [ib_ipoib]
2013-03-14 07:15:50 [<ffffffff810de635>] ? __rcu_process_callbacks+0x135/0x350
2013-03-14 07:15:50 [<ffffffff81012a69>] ? read_tsc+0x9/0x20
2013-03-14 07:15:50 [<ffffffff810de87b>] ? rcu_process_callbacks+0x2b/0x50
2013-03-14 07:15:50 [<ffffffff81072ac1>] ? __do_softirq+0xc1/0x1e0
2013-03-14 07:15:50 [<ffffffff81095760>] ? hrtimer_interrupt+0x140/0x250
2013-03-14 07:15:50 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
2013-03-14 07:15:50 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
2013-03-14 07:15:50 [<ffffffff810728a5>] ? irq_exit+0x85/0x90
2013-03-14 07:15:50 [<ffffffff814f2360>] ? smp_apic_timer_interrupt+0x70/0x9b
2013-03-14 07:15:50 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
2013-03-14 07:15:50 <EOI> [<ffffffff814ec947>] ? _spin_unlock_irqrestore+0x17/0x20
2013-03-14 07:15:50 [<ffffffffa0322a46>] ? mlx4_ib_poll_cq+0x2c6/0x7f0 [mlx4_ib]
2013-03-14 07:15:50 [<ffffffffa07a4478>] ? kiblnd_scheduler+0xf8/0x760 [ko2iblnd]
2013-03-14 07:15:50 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
2013-03-14 07:15:50 [<ffffffffa07a4380>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
2013-03-14 07:15:50 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
2013-03-14 07:15:50 [<ffffffffa07a4380>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
2013-03-14 07:15:50 [<ffffffffa07a4380>] ? kiblnd_scheduler+0x0/0x760 [ko2iblnd]
2013-03-14 07:15:50 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-14 07:15:50 ---[ end trace ceec6f0d4be48403 ]---
2013-03-14 07:15:50 general protection fault: 0000 [#1] SMP
2013-03-14 07:15:50 last sysfs file: /sys/devices/virtual/dmi/id/sys_vendor
2013-03-14 07:15:50 CPU 0
2013-03-14 07:15:50 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ahci i7core_edac edac_core ioatdma dca shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
2013-03-14 07:15:50
2013-03-14 07:15:50 Pid: 3208, comm: ipoib Tainted: G W --------------- 2.6.32-279.22.1.el6.x86_64 #1 Dell XS23-TY /XS23-TY
2013-03-14 07:15:50 RIP: 0010:[<ffffffff81279e9b>] [<ffffffff81279e9b>] list_del+0x1b/0xa0
2013-03-14 07:15:50 RSP: 0018:ffff8801bba1ddb0 EFLAGS: 00010046
2013-03-14 07:15:50 RAX: dead000000100100 RBX: ffff8801af5ed2d0 RCX: 000000000000b9d4
2013-03-14 07:15:50 RDX: 0000000000000246 RSI: ffff8801bfe979d0 RDI: ffff8801af5ed2d0
2013-03-14 07:15:50 RBP: ffff8801bba1ddc0 R08: ffff8801af5ed2d0 R09: 0000000000000000
2013-03-14 07:15:50 R10: ffff8801c0065880 R11: 0000000000000000 R12: ffff8801ba25e020
2013-03-14 07:15:50 R13: 0000000000000246 R14: ffff8801ba021400 R15: ffff8801ba25e6e0
2013-03-14 07:15:50 FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
2013-03-14 07:15:50 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
2013-03-14 07:15:50 CR2: 00002aaab80041f8 CR3: 0000000175615000 CR4: 00000000000006f0
2013-03-14 07:15:50 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2013-03-14 07:15:50 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2013-03-14 07:15:50 Process ipoib (pid: 3208, threadinfo ffff8801bba1c000, task ffff8801bb536080)
2013-03-14 07:15:50 Stack:
2013-03-14 07:15:50 0000000109f05306 ffff8801af5ed2c0 ffff8801bba1de30 ffffffffa0340619
2013-03-14 07:15:50 <d> ffffffff81a8d020 ffff8801bb536080 ffff8801ba25f2e8 ffff8801ba25e340
2013-03-14 07:15:50 <d> 00000078bba1de30 0000000000000000 ffff8801bba1de10 ffffe8fe62609a40
2013-03-14 07:15:50 Call Trace:
2013-03-14 07:15:50 [<ffffffffa0340619>] ipoib_cm_tx_reap+0xc9/0x510 [ib_ipoib]
2013-03-14 07:15:50 [<ffffffffa0340550>] ? ipoib_cm_tx_reap+0x0/0x510 [ib_ipoib]
2013-03-14 07:15:50 [<ffffffff8108b370>] worker_thread+0x170/0x2a0
2013-03-14 07:15:50 [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40
2013-03-14 07:15:50 [<ffffffff8108b200>] ? worker_thread+0x0/0x2a0
2013-03-14 07:15:50 [<ffffffff81090876>] kthread+0x96/0xa0
2013-03-14 07:15:50 [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-03-14 07:15:50 [<ffffffff810907e0>] ? kthread+0x0/0xa0
2013-03-14 07:15:50 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-14 07:15:50 Code: 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48
2013-03-14 07:15:50 RIP [<ffffffff81279e9b>] list_del+0x1b/0xa0
2013-03-14 07:15:50 RSP <ffff8801bba1ddb0>