Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
Some sites have reported multiple OSSs crashes with the following different signatures/stacks :
------------[ cut here ]------------ WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted) Hardware name: PowerEdge R630 list_del corruption. prev->next should be ffff880d841d4350, but was 000001540100f30a Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) vfat fat usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core sg shpchp tg3 ptp pps_cor e lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 25008, comm: kiblnd_sd_03_00 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1 Call Trace: [<ffffffff81074e47>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81074f36>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff812a01be>] ? list_del+0x6e/0xa0 [<ffffffffa08a53b5>] ? lnet_md_unlink+0x45/0x340 [lnet] [<ffffffffa08a6d9f>] ? lnet_try_match_md+0x22f/0x310 [lnet] [<ffffffffa08a6f1c>] ? lnet_mt_match_md+0x9c/0x1c0 [lnet] [<ffffffffa08a7820>] ? lnet_ptl_match_md+0x280/0x870 [lnet] [<ffffffffa08b9d46>] ? lnet_nid2peer_locked+0x66/0x4b0 [lnet] [<ffffffffa08af0fb>] ? lnet_parse+0xb9b/0x18c0 [lnet] [<ffffffffa063b9f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa1e86b4b>] ? kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd] [<ffffffffa1e87833>] ? kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd] [<ffffffffa1e879e2>] ? kiblnd_complete+0x62/0xe0 [ko2iblnd] [<ffffffffa1e87d9a>] ? kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 [<ffffffffa1e87a60>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd] [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 ---[ end trace 6ffab147a7d87fa2 ]--- BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff812a016b>] list_del+0x1b/0xa0 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 18 Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) vfat fat usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 25008, comm: kiblnd_sd_03_00 Tainted: G W --------------- 2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW RIP: 0010:[<ffffffff812a016b>] [<ffffffff812a016b>] list_del+0x1b/0xa0 RSP: 0018:ffff88204cd59ae0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff880d841d4350 RCX: 000000000000cc9d RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000009 RBP: ffff88204cd59af0 R08: 000000000002fd34 R09: 0000000000000000 R10: 000000000000000f R11: 0000000000000006 R12: ffff880f4679b1c0 R13: 00000000000044e0 R14: 0000000000000000 R15: ffff88204cd59cf0 FS: 0000000000000000(0000) GS:ffff880062120000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000001066af8000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kiblnd_sd_03_00 (pid: 25008, threadinfo ffff88204cd58000, task ffff882063470040) Stack: ffff880d841d4340 ffff880d841d4340 ffff88204cd59b10 ffffffffa08a53b5 <d> ffff880d841d4340 ffff8808caa1f000 ffff88204cd59b90 ffffffffa08a6d9f <d> 00000000000000e0 0000000000000000 0000000000000001 00056cc200000000 Call Trace: [<ffffffffa08a53b5>] lnet_md_unlink+0x45/0x340 [lnet] [<ffffffffa08a6d9f>] lnet_try_match_md+0x22f/0x310 [lnet] [<ffffffffa08a6f1c>] lnet_mt_match_md+0x9c/0x1c0 [lnet] [<ffffffffa08a7820>] lnet_ptl_match_md+0x280/0x870 [lnet] [<ffffffffa08b9d46>] ? lnet_nid2peer_locked+0x66/0x4b0 [lnet] [<ffffffffa08af0fb>] lnet_parse+0xb9b/0x18c0 [lnet] [<ffffffffa063b9f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa1e86b4b>] kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd] [<ffffffffa1e87833>] kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd] [<ffffffffa1e879e2>] kiblnd_complete+0x62/0xe0 [ko2iblnd] [<ffffffffa1e87d9a>] kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 [<ffffffffa1e87a60>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Code: e8 38 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48 RIP [<ffffffff812a016b>] list_del+0x1b/0xa0 RSP <ffff88204cd59ae0> CR2: 0000000000000008
or
BUG: unable to handle kernel paging request at 000000004db05079 IP: [<ffffffff811c49d9>] __brelse+0x9/0x40 PGD 105ff8c067 PUD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:80/0000:80:01.0/0000:81:00.0/host2/port-2:0/end_device-2:0/target2:0:0/2:0:0:32/state CPU 21 Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dlm sctp libcrc32c configfs vfat fat usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_u mad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_co re sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [las t unloaded: speedstep_lib] Pid: 44291, comm: ll_ost_io04_003 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW RIP: 0010:[<ffffffff811c49d9>] [<ffffffff811c49d9>] __brelse+0x9/0x40 RSP: 0018:ffff88102447d6b0 EFLAGS: 00010202 RAX: 000000000000000b RBX: ffff881e925a6e30 RCX: 00000000000000b0 RDX: 0000000000000000 RSI: ffff881443e002e8 RDI: 000000004db05019 RBP: ffff88102447d6b0 R08: ffff88102447d780 R09: 0000000000000009 R10: 0000000000000001 R11: 00000000000000a5 R12: 0000000000000002 R13: 0000000000000002 R14: 0000000000000002 R15: ffff880e81050410 FS: 0000000000000000(0000) GS:ffff8810b8940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000004db05079 CR3: 0000001064c32000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ll_ost_io04_003 (pid: 44291, threadinfo ffff88102447c000, task ffff88102447b520) Stack: ffff88102447d6e0 ffffffffa18ff332 ffff881e925a6dc0 000000000000025c <d> ffff881e925a6dc0 ffff880e81050340 ffff88102447d770 ffffffffa18ffd9d <d> 00000000000003e8 ffff880200000002 ffff88102447d780 ffffffffa1a76870 Call Trace: [<ffffffffa18ff332>] ldiskfs_ext_drop_refs+0x32/0x50 [ldiskfs] [<ffffffffa18ffd9d>] ldiskfs_ext_walk_space+0x14d/0x310 [ldiskfs] [<ffffffffa1a76870>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [osd_ldiskfs] [<ffffffffa1a765dc>] osd_ldiskfs_map_nblocks+0xcc/0xf0 [osd_ldiskfs] [<ffffffffa1a7671c>] osd_ldiskfs_map_ext_inode_pages+0x11c/0x270 [osd_ldiskfs] [<ffffffffa1a76f65>] osd_ldiskfs_map_inode_pages.clone.0+0x25/0x30 [osd_ldiskfs] [<ffffffffa1a78b96>] osd_write_commit+0x2f6/0x610 [osd_ldiskfs] [<ffffffffa1c87fc4>] ofd_commitrw_write+0x684/0x11b0 [ofd] [<ffffffffa1c8ad45>] ofd_commitrw+0x5d5/0xbc0 [ofd] [<ffffffffa06a40d1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa1c1e1ad>] obd_commitrw+0x11d/0x390 [ost] [<ffffffffa1c28151>] ost_brw_write+0xea1/0x15d0 [ost] [<ffffffff8129717d>] ? pointer+0x8d/0x6a0 [<ffffffff8128d9e9>] ? cpumask_next_and+0x29/0x50 [<ffffffffa0ad5860>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa1c2e7bf>] ost_handle+0x43af/0x44e0 [ost] [<ffffffffa0b1e78b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] [<ffffffffa0b258d5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa057f4fa>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffffa0b1e289>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffff81057849>] ? __wake_up_common+0x59/0x90 [<ffffffffa0b2805d>] ptlrpc_main+0xaed/0x1780 [ptlrpc] [<ffffffffa0b27570>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Code: dd 81 fb ff 01 00 00 7e 91 48 83 c4 10 5b 41 5c 41 5d 41 5e c9 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <8b> 47 60 85 c0 74 10 f0 ff 4f 60 c9 c3 66 2e 0f 1f 84 00 00 00 RIP [<ffffffff811c49d9>] __brelse+0x9/0x40 RSP <ffff88102447d6b0> CR2: 000000004db05079
or
general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:80/0000:80:03.0/0000:82:00.0/infiniband_mad/umad0/port CPU 6 Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dlm confi gfs vfat fat usb_storage mpt2sas mptctl mptbase sctp libcrc32c dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_u mad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_co re lpc_ich mfd_core shpchp tg3 ptp pps_core sg ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [las t unloaded: configfs] Pid: 44561, comm: ll_ost_io01_007 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW RIP: 0010:[<ffffffffa1d995c9>] [<ffffffffa1d995c9>] ldiskfs_can_extents_be_merged+0x9/0xb0 [ldiskfs] RSP: 0018:ffff882038c61550 EFLAGS: 00010246 RAX: 0000000000000010 RBX: 0000000000000002 RCX: 5a5a5a5a5a5a5a5a RDX: ffff882038c616a0 RSI: 5a5a5a5a5a5a5a5a RDI: ffff880eb4e2c620 RBP: ffff882038c61550 R08: 0000000000000000 R09: ffff880f17dd9798 R10: 0000000000000000 R11: 0000000000000002 R12: ffff880e279f5840 R13: ffff880e279f58b0 R14: 0000000000000002 R15: 5a5a5a5a5a5a5a5a FS: 0000000000000000(0000) GS:ffff880062060000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f08013da000 CR3: 000000205023f000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ll_ost_io01_007 (pid: 44561, threadinfo ffff882038c60000, task ffff882038c5cab0) Stack: ffff882038c61630 ffffffffa1d9c77d ffff882038c61590 ffffffff811baf4b <d> 0000000000000002 5a5a5a5a5a5a5a5a ffff882050017000 ffff880eb4e2c620 <d> ffff882038c61630 ffffffffa1db6a61 ffff880e279f5840 ffff880eb4e2c620 Call Trace: [<ffffffffa1d9c77d>] ldiskfs_ext_insert_extent+0x81d/0x1190 [ldiskfs] [<ffffffff811baf4b>] ? __mark_inode_dirty+0x3b/0x160 [<ffffffffa1db6a61>] ? ldiskfs_mb_new_blocks+0x241/0x630 [ldiskfs] [<ffffffffa2036e49>] ldiskfs_ext_new_extent_cb+0x5d9/0x6d0 [osd_ldiskfs] [<ffffffffa1d9bd92>] ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs] [<ffffffffa2036870>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [osd_ldiskfs] [<ffffffffa20365dc>] osd_ldiskfs_map_nblocks+0xcc/0xf0 [osd_ldiskfs] [<ffffffffa203671c>] osd_ldiskfs_map_ext_inode_pages+0x11c/0x270 [osd_ldiskfs] [<ffffffffa2036f65>] osd_ldiskfs_map_inode_pages.clone.0+0x25/0x30 [osd_ldiskfs] [<ffffffffa2038b96>] osd_write_commit+0x2f6/0x610 [osd_ldiskfs] [<ffffffffa2247fc4>] ofd_commitrw_write+0x684/0x11b0 [ofd] [<ffffffffa224ad45>] ofd_commitrw+0x5d5/0xbc0 [ofd] [<ffffffffa099d0d1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa21de1ad>] obd_commitrw+0x11d/0x390 [ost] [<ffffffffa21e8151>] ost_brw_write+0xea1/0x15d0 [ost] [<ffffffffa0e4aeb5>] ? null_authorize+0x75/0x100 [ptlrpc] [<ffffffffa0e08a4e>] ? ptlrpc_send_reply+0x28e/0x7f0 [ptlrpc] [<ffffffff8128d9e9>] ? cpumask_next_and+0x29/0x50 [<ffffffffa0dce860>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa21ee7bf>] ost_handle+0x43af/0x44e0 [ost] [<ffffffffa0e1778b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] [<ffffffffa0e1e8d5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa05e34fa>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffffa0e17289>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffff81057849>] ? __wake_up_common+0x59/0x90 [<ffffffffa0e2105d>] ptlrpc_main+0xaed/0x1780 [ptlrpc] [<ffffffffa0e20570>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Code: 01 89 f0 c9 c3 0f 1f 44 00 00 41 83 c0 01 44 89 81 a8 03 00 00 eb e2 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <0f> b7 4e 04 0f b7 42 04 66 81 f9 00 80 40 0f 97 c7 66 3d 00 80 RIP [<ffffffffa1d995c9>] ldiskfs_can_extents_be_merged+0x9/0xb0 [ldiskfs] RSP <ffff882038c61550>
after I have analyzed almost all of these crash-dumps, I have been able to find that :
_ all these different crashes are caused by a corruption in a <size-128> Slab object
_ the corruption is an overrun of previous object in same Slab
_ previous object is a [ldiskfs,ext4]_ext_path[] array used to walk an inode extents tree and that has been allocated for a too small size (based on its original/evaluated depth) vs concurrent depth growth