Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7980

Overrun in generic <size-128> kmem_cache Slabs causing OSS to crash

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Some sites have reported multiple OSSs crashes with the following different signatures/stacks :

      ------------[ cut here ]------------
      WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
      Hardware name: PowerEdge R630
      list_del corruption. prev->next should be ffff880d841d4350, but was 000001540100f30a
      Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) vfat fat 
      usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ext3
       jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core sg shpchp tg3 ptp pps_cor
      e lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      Pid: 25008, comm: kiblnd_sd_03_00 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1
      Call Trace:
       [<ffffffff81074e47>] ? warn_slowpath_common+0x87/0xc0
       [<ffffffff81074f36>] ? warn_slowpath_fmt+0x46/0x50
       [<ffffffff812a01be>] ? list_del+0x6e/0xa0
       [<ffffffffa08a53b5>] ? lnet_md_unlink+0x45/0x340 [lnet]
       [<ffffffffa08a6d9f>] ? lnet_try_match_md+0x22f/0x310 [lnet]
       [<ffffffffa08a6f1c>] ? lnet_mt_match_md+0x9c/0x1c0 [lnet]
       [<ffffffffa08a7820>] ? lnet_ptl_match_md+0x280/0x870 [lnet]
       [<ffffffffa08b9d46>] ? lnet_nid2peer_locked+0x66/0x4b0 [lnet]
       [<ffffffffa08af0fb>] ? lnet_parse+0xb9b/0x18c0 [lnet]
       [<ffffffffa063b9f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa1e86b4b>] ? kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd]
       [<ffffffffa1e87833>] ? kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd]
       [<ffffffffa1e879e2>] ? kiblnd_complete+0x62/0xe0 [ko2iblnd]
       [<ffffffffa1e87d9a>] ? kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd]
       [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
       [<ffffffffa1e87a60>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd]
       [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
       [<ffffffff8100c28a>] ? child_rip+0xa/0x20
       [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      ---[ end trace 6ffab147a7d87fa2 ]---
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      IP: [<ffffffff812a016b>] list_del+0x1b/0xa0
      PGD 0 
      Oops: 0000 [#1] SMP 
      last sysfs file: /sys/devices/system/cpu/online
      CPU 18 
      Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) vfat fat usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      
      Pid: 25008, comm: kiblnd_sd_03_00 Tainted: G        W  ---------------    2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
      RIP: 0010:[<ffffffff812a016b>]  [<ffffffff812a016b>] list_del+0x1b/0xa0
      RSP: 0018:ffff88204cd59ae0  EFLAGS: 00010286
      RAX: 0000000000000000 RBX: ffff880d841d4350 RCX: 000000000000cc9d
      RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000009
      RBP: ffff88204cd59af0 R08: 000000000002fd34 R09: 0000000000000000
      R10: 000000000000000f R11: 0000000000000006 R12: ffff880f4679b1c0
      R13: 00000000000044e0 R14: 0000000000000000 R15: ffff88204cd59cf0
      FS:  0000000000000000(0000) GS:ffff880062120000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000000000000008 CR3: 0000001066af8000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process kiblnd_sd_03_00 (pid: 25008, threadinfo ffff88204cd58000, task ffff882063470040)
      Stack:
       ffff880d841d4340 ffff880d841d4340 ffff88204cd59b10 ffffffffa08a53b5
      <d> ffff880d841d4340 ffff8808caa1f000 ffff88204cd59b90 ffffffffa08a6d9f
      <d> 00000000000000e0 0000000000000000 0000000000000001 00056cc200000000
      Call Trace:
       [<ffffffffa08a53b5>] lnet_md_unlink+0x45/0x340 [lnet]
       [<ffffffffa08a6d9f>] lnet_try_match_md+0x22f/0x310 [lnet]
       [<ffffffffa08a6f1c>] lnet_mt_match_md+0x9c/0x1c0 [lnet]
       [<ffffffffa08a7820>] lnet_ptl_match_md+0x280/0x870 [lnet]
       [<ffffffffa08b9d46>] ? lnet_nid2peer_locked+0x66/0x4b0 [lnet]
       [<ffffffffa08af0fb>] lnet_parse+0xb9b/0x18c0 [lnet]
       [<ffffffffa063b9f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa1e86b4b>] kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd]
       [<ffffffffa1e87833>] kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd]
       [<ffffffffa1e879e2>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
       [<ffffffffa1e87d9a>] kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd]
       [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
       [<ffffffffa1e87a60>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd]
       [<ffffffff8109e78e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Code: e8 38 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48 
      RIP  [<ffffffff812a016b>] list_del+0x1b/0xa0
       RSP <ffff88204cd59ae0>
      CR2: 0000000000000008
      

      or

      BUG: unable to handle kernel paging request at 000000004db05079
      IP: [<ffffffff811c49d9>] __brelse+0x9/0x40
      PGD 105ff8c067 PUD 0 
      Oops: 0000 [#1] SMP 
      last sysfs file: /sys/devices/pci0000:80/0000:80:01.0/0000:81:00.0/host2/port-2:0/end_device-2:0/target2:0:0/2:0:0:32/state
      CPU 21 
      Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dlm sctp 
      libcrc32c configfs vfat fat usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_u
      mad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_co
      re sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [las
      t unloaded: speedstep_lib]
      
      Pid: 44291, comm: ll_ost_io04_003 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
      RIP: 0010:[<ffffffff811c49d9>]  [<ffffffff811c49d9>] __brelse+0x9/0x40
      RSP: 0018:ffff88102447d6b0  EFLAGS: 00010202
      RAX: 000000000000000b RBX: ffff881e925a6e30 RCX: 00000000000000b0
      RDX: 0000000000000000 RSI: ffff881443e002e8 RDI: 000000004db05019
      RBP: ffff88102447d6b0 R08: ffff88102447d780 R09: 0000000000000009
      R10: 0000000000000001 R11: 00000000000000a5 R12: 0000000000000002
      R13: 0000000000000002 R14: 0000000000000002 R15: ffff880e81050410
      FS:  0000000000000000(0000) GS:ffff8810b8940000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 000000004db05079 CR3: 0000001064c32000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ll_ost_io04_003 (pid: 44291, threadinfo ffff88102447c000, task ffff88102447b520)
      Stack:
       ffff88102447d6e0 ffffffffa18ff332 ffff881e925a6dc0 000000000000025c
      <d> ffff881e925a6dc0 ffff880e81050340 ffff88102447d770 ffffffffa18ffd9d
      <d> 00000000000003e8 ffff880200000002 ffff88102447d780 ffffffffa1a76870
      Call Trace:
       [<ffffffffa18ff332>] ldiskfs_ext_drop_refs+0x32/0x50 [ldiskfs]
       [<ffffffffa18ffd9d>] ldiskfs_ext_walk_space+0x14d/0x310 [ldiskfs]
       [<ffffffffa1a76870>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [osd_ldiskfs]
       [<ffffffffa1a765dc>] osd_ldiskfs_map_nblocks+0xcc/0xf0 [osd_ldiskfs]
       [<ffffffffa1a7671c>] osd_ldiskfs_map_ext_inode_pages+0x11c/0x270 [osd_ldiskfs]
       [<ffffffffa1a76f65>] osd_ldiskfs_map_inode_pages.clone.0+0x25/0x30 [osd_ldiskfs]
       [<ffffffffa1a78b96>] osd_write_commit+0x2f6/0x610 [osd_ldiskfs]
       [<ffffffffa1c87fc4>] ofd_commitrw_write+0x684/0x11b0 [ofd]
       [<ffffffffa1c8ad45>] ofd_commitrw+0x5d5/0xbc0 [ofd]
       [<ffffffffa06a40d1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
       [<ffffffffa1c1e1ad>] obd_commitrw+0x11d/0x390 [ost]
       [<ffffffffa1c28151>] ost_brw_write+0xea1/0x15d0 [ost]
       [<ffffffff8129717d>] ? pointer+0x8d/0x6a0
       [<ffffffff8128d9e9>] ? cpumask_next_and+0x29/0x50
       [<ffffffffa0ad5860>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa1c2e7bf>] ost_handle+0x43af/0x44e0 [ost]
       [<ffffffffa0b1e78b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
       [<ffffffffa0b258d5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
       [<ffffffffa057f4fa>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
       [<ffffffffa0b1e289>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
       [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
       [<ffffffffa0b2805d>] ptlrpc_main+0xaed/0x1780 [ptlrpc]
       [<ffffffffa0b27570>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
       [<ffffffff8109e78e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Code: dd 81 fb ff 01 00 00 7e 91 48 83 c4 10 5b 41 5c 41 5d 41 5e c9 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <8b> 47 60 85 c0 74 10 f0 ff 4f 60 c9 c3 66 2e 0f 1f 84 00 00 00 
      RIP  [<ffffffff811c49d9>] __brelse+0x9/0x40
       RSP <ffff88102447d6b0>
      CR2: 000000004db05079
      

      or

      general protection fault: 0000 [#1] SMP 
      last sysfs file: /sys/devices/pci0000:80/0000:80:03.0/0000:82:00.0/infiniband_mad/umad0/port
      CPU 6 
      Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dlm confi
      gfs vfat fat usb_storage mpt2sas mptctl mptbase sctp libcrc32c dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_u
      mad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_co
      re lpc_ich mfd_core shpchp tg3 ptp pps_core sg ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [las
      t unloaded: configfs]
      
      Pid: 44561, comm: ll_ost_io01_007 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
      RIP: 0010:[<ffffffffa1d995c9>]  [<ffffffffa1d995c9>] ldiskfs_can_extents_be_merged+0x9/0xb0 [ldiskfs]
      RSP: 0018:ffff882038c61550  EFLAGS: 00010246
      RAX: 0000000000000010 RBX: 0000000000000002 RCX: 5a5a5a5a5a5a5a5a
      RDX: ffff882038c616a0 RSI: 5a5a5a5a5a5a5a5a RDI: ffff880eb4e2c620
      RBP: ffff882038c61550 R08: 0000000000000000 R09: ffff880f17dd9798
      R10: 0000000000000000 R11: 0000000000000002 R12: ffff880e279f5840
      R13: ffff880e279f58b0 R14: 0000000000000002 R15: 5a5a5a5a5a5a5a5a
      FS:  0000000000000000(0000) GS:ffff880062060000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00007f08013da000 CR3: 000000205023f000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ll_ost_io01_007 (pid: 44561, threadinfo ffff882038c60000, task ffff882038c5cab0)
      Stack:
       ffff882038c61630 ffffffffa1d9c77d ffff882038c61590 ffffffff811baf4b
      <d> 0000000000000002 5a5a5a5a5a5a5a5a ffff882050017000 ffff880eb4e2c620
      <d> ffff882038c61630 ffffffffa1db6a61 ffff880e279f5840 ffff880eb4e2c620
      Call Trace:
       [<ffffffffa1d9c77d>] ldiskfs_ext_insert_extent+0x81d/0x1190 [ldiskfs]
       [<ffffffff811baf4b>] ? __mark_inode_dirty+0x3b/0x160
       [<ffffffffa1db6a61>] ? ldiskfs_mb_new_blocks+0x241/0x630 [ldiskfs]
       [<ffffffffa2036e49>] ldiskfs_ext_new_extent_cb+0x5d9/0x6d0 [osd_ldiskfs]
       [<ffffffffa1d9bd92>] ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs]
       [<ffffffffa2036870>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [osd_ldiskfs]
       [<ffffffffa20365dc>] osd_ldiskfs_map_nblocks+0xcc/0xf0 [osd_ldiskfs]
       [<ffffffffa203671c>] osd_ldiskfs_map_ext_inode_pages+0x11c/0x270 [osd_ldiskfs]
       [<ffffffffa2036f65>] osd_ldiskfs_map_inode_pages.clone.0+0x25/0x30 [osd_ldiskfs]
       [<ffffffffa2038b96>] osd_write_commit+0x2f6/0x610 [osd_ldiskfs]
       [<ffffffffa2247fc4>] ofd_commitrw_write+0x684/0x11b0 [ofd]
       [<ffffffffa224ad45>] ofd_commitrw+0x5d5/0xbc0 [ofd]
       [<ffffffffa099d0d1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
       [<ffffffffa21de1ad>] obd_commitrw+0x11d/0x390 [ost]
       [<ffffffffa21e8151>] ost_brw_write+0xea1/0x15d0 [ost]
       [<ffffffffa0e4aeb5>] ? null_authorize+0x75/0x100 [ptlrpc]
       [<ffffffffa0e08a4e>] ? ptlrpc_send_reply+0x28e/0x7f0 [ptlrpc]
       [<ffffffff8128d9e9>] ? cpumask_next_and+0x29/0x50
       [<ffffffffa0dce860>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa21ee7bf>] ost_handle+0x43af/0x44e0 [ost]
       [<ffffffffa0e1778b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
       [<ffffffffa0e1e8d5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
       [<ffffffffa05e34fa>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
       [<ffffffffa0e17289>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
       [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
       [<ffffffffa0e2105d>] ptlrpc_main+0xaed/0x1780 [ptlrpc]
       [<ffffffffa0e20570>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
       [<ffffffff8109e78e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Code: 01 89 f0 c9 c3 0f 1f 44 00 00 41 83 c0 01 44 89 81 a8 03 00 00 eb e2 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <0f> b7 4e 04 0f b7 42 04 66 81 f9 00 80 40 0f 97 c7 66 3d 00 80 
      RIP  [<ffffffffa1d995c9>] ldiskfs_can_extents_be_merged+0x9/0xb0 [ldiskfs]
       RSP <ffff882038c61550>
      

      after I have analyzed almost all of these crash-dumps, I have been able to find that :
      _ all these different crashes are caused by a corruption in a <size-128> Slab object
      _ the corruption is an overrun of previous object in same Slab
      _ previous object is a [ldiskfs,ext4]_ext_path[] array used to walk an inode extents tree and that has been allocated for a too small size (based on its original/evaluated depth) vs concurrent depth growth

      Attachments

        Issue Links

          Activity

            [LU-7980] Overrun in generic <size-128> kmem_cache Slabs causing OSS to crash

            This patch was landed for rhel7.x as commit v2_8_52_0-42-g655255b651. The comment from Ted was:

            The problem with the max possible extent depth assumption is that this assumes non-pathological trees. Unfortunately, at the moment we don't dont ever shrink the extent tree as we delete entries from the tree, and we aren't obeying the requirements of a formal B+ tree, which is that all nodes (except for a trivial tree consisting of a single leaf node at the root) must be at least half-full. So while it is highly unlikely, it is possible to create highly pathological trees that could potentially be deeper than five deep.

            They are extremely unlikely to happen in practice, granted, but if we are relying on this to prevent array bound overflow attacks, a malicious attacker could potentially be very happy to arrange such as situation.

            So at least in the short run, we may be better off finding all of the places where we drop i_data_sem after we've allocated the struct path array, and after we grab it again for writing, double check to see if we need to reallocate it. For performance reasons I'm happy always
            allocating an extra array element or two to minimize the need to do the reallocation, but for correctness's sake it would be good if we could easily test the code path where we need to do a reallocation, as well as demonstrate that we do the right thing if the reallocation fails...

            However, in the functions affected by this patch (ext4_ext_find_extent() and ext4_find_extent()) also have a check that the 5-level tree is never exceeded (from commit v4.17-rc4-30-gbc890a602471, but backported to at least RHEL7.7 (but not in 7.3):

                    if (depth < 0 || depth > EXT4_MAX_EXTENT_DEPTH) {
                            EXT4_ERROR_INODE(inode, "inode has invalid extent depth: %d",
                                             depth);
                            return ERR_PTR(-EFSCORRUPTED);
                    }
            

            so it is impossible for the tree to exceed EXT4_MAX_EXTENT_DEPTH as Ted claimed, so the "s_max_ext_tree_depth" calculation is at least not needed. The "depth was changed" chunk was removed in commit v3.17-rc2-138-g10809df84a4d, but not backported to RHEL7. The "(ppos > depth))" check was removed in commit v4.6-rc4-18-g816cd71b0c72.

            adilger Andreas Dilger added a comment - This patch was landed for rhel7.x as commit v2_8_52_0-42-g655255b651. The comment from Ted was: The problem with the max possible extent depth assumption is that this assumes non-pathological trees. Unfortunately, at the moment we don't dont ever shrink the extent tree as we delete entries from the tree, and we aren't obeying the requirements of a formal B+ tree, which is that all nodes (except for a trivial tree consisting of a single leaf node at the root) must be at least half-full. So while it is highly unlikely, it is possible to create highly pathological trees that could potentially be deeper than five deep. They are extremely unlikely to happen in practice, granted, but if we are relying on this to prevent array bound overflow attacks, a malicious attacker could potentially be very happy to arrange such as situation. So at least in the short run, we may be better off finding all of the places where we drop i_data_sem after we've allocated the struct path array, and after we grab it again for writing, double check to see if we need to reallocate it. For performance reasons I'm happy always allocating an extra array element or two to minimize the need to do the reallocation, but for correctness's sake it would be good if we could easily test the code path where we need to do a reallocation, as well as demonstrate that we do the right thing if the reallocation fails... However, in the functions affected by this patch ( ext4_ext_find_extent() and ext4_find_extent() ) also have a check that the 5-level tree is never exceeded (from commit v4.17-rc4-30-gbc890a602471, but backported to at least RHEL7.7 (but not in 7.3): if (depth < 0 || depth > EXT4_MAX_EXTENT_DEPTH) { EXT4_ERROR_INODE(inode, "inode has invalid extent depth: %d" , depth); return ERR_PTR(-EFSCORRUPTED); } so it is impossible for the tree to exceed EXT4_MAX_EXTENT_DEPTH as Ted claimed, so the " s_max_ext_tree_depth " calculation is at least not needed. The " depth was changed " chunk was removed in commit v3.17-rc2-138-g10809df84a4d, but not backported to RHEL7. The " (ppos > depth)) " check was removed in commit v4.6-rc4-18-g816cd71b0c72.
            pjones Peter Jones added a comment -

            Removing fixversion to decouple upstream work from fix for 2.9

            pjones Peter Jones added a comment - Removing fixversion to decouple upstream work from fix for 2.9

            A discussion came up on the list this week, and apparently the "maximum depth" of the extent tree is not actually fixed as we thought because the ext4 extent code does not balance the extent tree. This means in certain extremely rare situations the extent tree may become unbalanced and grow beyond the calculated s_max_ext_tree_depth. This is very unlikely to happen because the Lustre clients are already buffering and merging the data blocks, and we do not (yet) allow fallocate() to allocate or punch arbitrary block numbers. See http://marc.info/?l=linux-ext4&m=146790955304474&w=4 for more details.

            adilger Andreas Dilger added a comment - A discussion came up on the list this week, and apparently the "maximum depth" of the extent tree is not actually fixed as we thought because the ext4 extent code does not balance the extent tree. This means in certain extremely rare situations the extent tree may become unbalanced and grow beyond the calculated s_max_ext_tree_depth . This is very unlikely to happen because the Lustre clients are already buffering and merging the data blocks, and we do not (yet) allow fallocate() to allocate or punch arbitrary block numbers. See http://marc.info/?l=linux-ext4&m=146790955304474&w=4 for more details.

            I have pushed a re-newed version of my patch to upstream Kernel/ext4, via emails to linux-ext4@vger.kernel.org and with the "ext4: always pre-allocate max depth for path" title.

            bfaccini Bruno Faccini (Inactive) added a comment - I have pushed a re-newed version of my patch to upstream Kernel/ext4, via emails to linux-ext4@vger.kernel.org and with the "ext4: always pre-allocate max depth for path" title.

            Reopen this issue for Bruno to submit this patch to the upstream kernel as a simplified and more robust version of commit 10809df84.

            Bruno, from your comments, even the upstream kernel is not totally robust in this case and would benefit from this patch. If that is incorrect, please close again.

            adilger Andreas Dilger added a comment - Reopen this issue for Bruno to submit this patch to the upstream kernel as a simplified and more robust version of commit 10809df84. Bruno, from your comments, even the upstream kernel is not totally robust in this case and would benefit from this patch. If that is incorrect, please close again.

            Landed for 2.9.0

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.9.0

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19349/
            Subject: LU-7980 ldiskfs: always pre-allocate max depth for path
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 655255b651d80d115fbde6729bd58e22afce6483

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19349/ Subject: LU-7980 ldiskfs: always pre-allocate max depth for path Project: fs/lustre-release Branch: master Current Patch Set: Commit: 655255b651d80d115fbde6729bd58e22afce6483

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19349
            Subject: LU-7980 ldiskfs: always pre-allocate max depth for path
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 874cfbd471dff33e2a0a1ab9626a25fd7d3f3eff

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19349 Subject: LU-7980 ldiskfs: always pre-allocate max depth for path Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 874cfbd471dff33e2a0a1ab9626a25fd7d3f3eff

            I have identified that in recent Kernels (like >= v3.17-rc2-138-g10809df8) there are already ext4 patches, which try to implement safe _ext_path[] array re-[sizing,allocation], like :
            _ commit 705912ca9 ("ext4: teach ext4_ext_find_extent() to free path on error")
            _ commit dfe508093 ("ext4: drop EXT4_EX_NOFREE_ON_ERR from rest of extents handling code")
            _ commit 10809df84 ("ext4: teach ext4_ext_find_extent() to realloc path if necessary")
            but even with them, all cases/scenarios against concurrent depth growth do not seem tp be addressed.

            So the new way I have implemented is to avoid any _ext_path[] re-allocation need upon extent tree depth change, by originally allocating it of the maximum possible depth dimension, that is evaluated at mount time based on filesystem's basic and internal data-structures sizes/units/limits like max file size, block size, number of indirect extent entries that can fit in a block, ...

            And I must precise that this should not be overkill/costly because present maximum extent tree depth is 5, requiring 5x48 (or 56 in ldiskfs) = 240 (280) bytes to be allocated in a raw and for full _ext_path[] life, against possibly [re-]allocating multiple times [3-5]x entries.

            bfaccini Bruno Faccini (Inactive) added a comment - I have identified that in recent Kernels (like >= v3.17-rc2-138-g10809df8) there are already ext4 patches, which try to implement safe _ext_path[] array re- [sizing,allocation] , like : _ commit 705912ca9 ("ext4: teach ext4_ext_find_extent() to free path on error") _ commit dfe508093 ("ext4: drop EXT4_EX_NOFREE_ON_ERR from rest of extents handling code") _ commit 10809df84 ("ext4: teach ext4_ext_find_extent() to realloc path if necessary") but even with them, all cases/scenarios against concurrent depth growth do not seem tp be addressed. So the new way I have implemented is to avoid any _ext_path[] re-allocation need upon extent tree depth change, by originally allocating it of the maximum possible depth dimension, that is evaluated at mount time based on filesystem's basic and internal data-structures sizes/units/limits like max file size, block size, number of indirect extent entries that can fit in a block, ... And I must precise that this should not be overkill/costly because present maximum extent tree depth is 5, requiring 5x48 (or 56 in ldiskfs) = 240 (280) bytes to be allocated in a raw and for full _ext_path[] life, against possibly [re-] allocating multiple times [3-5] x entries.

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: