[LU-7980] Overrun in generic <size-128> kmem_cache Slabs causing OSS to crash Created: 04/Apr/16  Updated: 11/Sep/21  Resolved: 11/Sep/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Bruno Faccini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-8334 OSS lockup Resolved
is related to LU-8362 page fault: exception RIP: lnet_mt_ma... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Some sites have reported multiple OSSs crashes with the following different signatures/stacks :

------------[ cut here ]------------
WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
Hardware name: PowerEdge R630
list_del corruption. prev->next should be ffff880d841d4350, but was 000001540100f30a
Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) vfat fat 
usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ext3
 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core sg shpchp tg3 ptp pps_cor
e lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Pid: 25008, comm: kiblnd_sd_03_00 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff81074e47>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff81074f36>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff812a01be>] ? list_del+0x6e/0xa0
 [<ffffffffa08a53b5>] ? lnet_md_unlink+0x45/0x340 [lnet]
 [<ffffffffa08a6d9f>] ? lnet_try_match_md+0x22f/0x310 [lnet]
 [<ffffffffa08a6f1c>] ? lnet_mt_match_md+0x9c/0x1c0 [lnet]
 [<ffffffffa08a7820>] ? lnet_ptl_match_md+0x280/0x870 [lnet]
 [<ffffffffa08b9d46>] ? lnet_nid2peer_locked+0x66/0x4b0 [lnet]
 [<ffffffffa08af0fb>] ? lnet_parse+0xb9b/0x18c0 [lnet]
 [<ffffffffa063b9f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa1e86b4b>] ? kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd]
 [<ffffffffa1e87833>] ? kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd]
 [<ffffffffa1e879e2>] ? kiblnd_complete+0x62/0xe0 [ko2iblnd]
 [<ffffffffa1e87d9a>] ? kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd]
 [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
 [<ffffffffa1e87a60>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd]
 [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
---[ end trace 6ffab147a7d87fa2 ]---
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff812a016b>] list_del+0x1b/0xa0
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/online
CPU 18 
Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) vfat fat usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 25008, comm: kiblnd_sd_03_00 Tainted: G        W  ---------------    2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
RIP: 0010:[<ffffffff812a016b>]  [<ffffffff812a016b>] list_del+0x1b/0xa0
RSP: 0018:ffff88204cd59ae0  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff880d841d4350 RCX: 000000000000cc9d
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000009
RBP: ffff88204cd59af0 R08: 000000000002fd34 R09: 0000000000000000
R10: 000000000000000f R11: 0000000000000006 R12: ffff880f4679b1c0
R13: 00000000000044e0 R14: 0000000000000000 R15: ffff88204cd59cf0
FS:  0000000000000000(0000) GS:ffff880062120000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000001066af8000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kiblnd_sd_03_00 (pid: 25008, threadinfo ffff88204cd58000, task ffff882063470040)
Stack:
 ffff880d841d4340 ffff880d841d4340 ffff88204cd59b10 ffffffffa08a53b5
<d> ffff880d841d4340 ffff8808caa1f000 ffff88204cd59b90 ffffffffa08a6d9f
<d> 00000000000000e0 0000000000000000 0000000000000001 00056cc200000000
Call Trace:
 [<ffffffffa08a53b5>] lnet_md_unlink+0x45/0x340 [lnet]
 [<ffffffffa08a6d9f>] lnet_try_match_md+0x22f/0x310 [lnet]
 [<ffffffffa08a6f1c>] lnet_mt_match_md+0x9c/0x1c0 [lnet]
 [<ffffffffa08a7820>] lnet_ptl_match_md+0x280/0x870 [lnet]
 [<ffffffffa08b9d46>] ? lnet_nid2peer_locked+0x66/0x4b0 [lnet]
 [<ffffffffa08af0fb>] lnet_parse+0xb9b/0x18c0 [lnet]
 [<ffffffffa063b9f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa1e86b4b>] kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd]
 [<ffffffffa1e87833>] kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd]
 [<ffffffffa1e879e2>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
 [<ffffffffa1e87d9a>] kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd]
 [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
 [<ffffffffa1e87a60>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd]
 [<ffffffff8109e78e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Code: e8 38 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48 
RIP  [<ffffffff812a016b>] list_del+0x1b/0xa0
 RSP <ffff88204cd59ae0>
CR2: 0000000000000008

or

BUG: unable to handle kernel paging request at 000000004db05079
IP: [<ffffffff811c49d9>] __brelse+0x9/0x40
PGD 105ff8c067 PUD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:80/0000:80:01.0/0000:81:00.0/host2/port-2:0/end_device-2:0/target2:0:0/2:0:0:32/state
CPU 21 
Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dlm sctp 
libcrc32c configfs vfat fat usb_storage mpt2sas mptctl mptbase dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_u
mad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_co
re sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [las
t unloaded: speedstep_lib]

Pid: 44291, comm: ll_ost_io04_003 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
RIP: 0010:[<ffffffff811c49d9>]  [<ffffffff811c49d9>] __brelse+0x9/0x40
RSP: 0018:ffff88102447d6b0  EFLAGS: 00010202
RAX: 000000000000000b RBX: ffff881e925a6e30 RCX: 00000000000000b0
RDX: 0000000000000000 RSI: ffff881443e002e8 RDI: 000000004db05019
RBP: ffff88102447d6b0 R08: ffff88102447d780 R09: 0000000000000009
R10: 0000000000000001 R11: 00000000000000a5 R12: 0000000000000002
R13: 0000000000000002 R14: 0000000000000002 R15: ffff880e81050410
FS:  0000000000000000(0000) GS:ffff8810b8940000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000004db05079 CR3: 0000001064c32000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ll_ost_io04_003 (pid: 44291, threadinfo ffff88102447c000, task ffff88102447b520)
Stack:
 ffff88102447d6e0 ffffffffa18ff332 ffff881e925a6dc0 000000000000025c
<d> ffff881e925a6dc0 ffff880e81050340 ffff88102447d770 ffffffffa18ffd9d
<d> 00000000000003e8 ffff880200000002 ffff88102447d780 ffffffffa1a76870
Call Trace:
 [<ffffffffa18ff332>] ldiskfs_ext_drop_refs+0x32/0x50 [ldiskfs]
 [<ffffffffa18ffd9d>] ldiskfs_ext_walk_space+0x14d/0x310 [ldiskfs]
 [<ffffffffa1a76870>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [osd_ldiskfs]
 [<ffffffffa1a765dc>] osd_ldiskfs_map_nblocks+0xcc/0xf0 [osd_ldiskfs]
 [<ffffffffa1a7671c>] osd_ldiskfs_map_ext_inode_pages+0x11c/0x270 [osd_ldiskfs]
 [<ffffffffa1a76f65>] osd_ldiskfs_map_inode_pages.clone.0+0x25/0x30 [osd_ldiskfs]
 [<ffffffffa1a78b96>] osd_write_commit+0x2f6/0x610 [osd_ldiskfs]
 [<ffffffffa1c87fc4>] ofd_commitrw_write+0x684/0x11b0 [ofd]
 [<ffffffffa1c8ad45>] ofd_commitrw+0x5d5/0xbc0 [ofd]
 [<ffffffffa06a40d1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa1c1e1ad>] obd_commitrw+0x11d/0x390 [ost]
 [<ffffffffa1c28151>] ost_brw_write+0xea1/0x15d0 [ost]
 [<ffffffff8129717d>] ? pointer+0x8d/0x6a0
 [<ffffffff8128d9e9>] ? cpumask_next_and+0x29/0x50
 [<ffffffffa0ad5860>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa1c2e7bf>] ost_handle+0x43af/0x44e0 [ost]
 [<ffffffffa0b1e78b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
 [<ffffffffa0b258d5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [<ffffffffa057f4fa>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0b1e289>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0b2805d>] ptlrpc_main+0xaed/0x1780 [ptlrpc]
 [<ffffffffa0b27570>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
 [<ffffffff8109e78e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Code: dd 81 fb ff 01 00 00 7e 91 48 83 c4 10 5b 41 5c 41 5d 41 5e c9 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <8b> 47 60 85 c0 74 10 f0 ff 4f 60 c9 c3 66 2e 0f 1f 84 00 00 00 
RIP  [<ffffffff811c49d9>] __brelse+0x9/0x40
 RSP <ffff88102447d6b0>
CR2: 000000004db05079

or

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:80/0000:80:03.0/0000:82:00.0/infiniband_mad/umad0/port
CPU 6 
Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) dlm confi
gfs vfat fat usb_storage mpt2sas mptctl mptbase sctp libcrc32c dell_rbu 8021q garp stp llc autofs4 bonding ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_u
mad rdma_cm ib_cm iw_cm ext3 jbd scsi_dh_rdac dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm microcode iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_co
re lpc_ich mfd_core shpchp tg3 ptp pps_core sg ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core ahci mpt3sas scsi_transport_sas raid_class megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [las
t unloaded: configfs]

Pid: 44561, comm: ll_ost_io01_007 Not tainted 2.6.32-504.30.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
RIP: 0010:[<ffffffffa1d995c9>]  [<ffffffffa1d995c9>] ldiskfs_can_extents_be_merged+0x9/0xb0 [ldiskfs]
RSP: 0018:ffff882038c61550  EFLAGS: 00010246
RAX: 0000000000000010 RBX: 0000000000000002 RCX: 5a5a5a5a5a5a5a5a
RDX: ffff882038c616a0 RSI: 5a5a5a5a5a5a5a5a RDI: ffff880eb4e2c620
RBP: ffff882038c61550 R08: 0000000000000000 R09: ffff880f17dd9798
R10: 0000000000000000 R11: 0000000000000002 R12: ffff880e279f5840
R13: ffff880e279f58b0 R14: 0000000000000002 R15: 5a5a5a5a5a5a5a5a
FS:  0000000000000000(0000) GS:ffff880062060000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f08013da000 CR3: 000000205023f000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ll_ost_io01_007 (pid: 44561, threadinfo ffff882038c60000, task ffff882038c5cab0)
Stack:
 ffff882038c61630 ffffffffa1d9c77d ffff882038c61590 ffffffff811baf4b
<d> 0000000000000002 5a5a5a5a5a5a5a5a ffff882050017000 ffff880eb4e2c620
<d> ffff882038c61630 ffffffffa1db6a61 ffff880e279f5840 ffff880eb4e2c620
Call Trace:
 [<ffffffffa1d9c77d>] ldiskfs_ext_insert_extent+0x81d/0x1190 [ldiskfs]
 [<ffffffff811baf4b>] ? __mark_inode_dirty+0x3b/0x160
 [<ffffffffa1db6a61>] ? ldiskfs_mb_new_blocks+0x241/0x630 [ldiskfs]
 [<ffffffffa2036e49>] ldiskfs_ext_new_extent_cb+0x5d9/0x6d0 [osd_ldiskfs]
 [<ffffffffa1d9bd92>] ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs]
 [<ffffffffa2036870>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [osd_ldiskfs]
 [<ffffffffa20365dc>] osd_ldiskfs_map_nblocks+0xcc/0xf0 [osd_ldiskfs]
 [<ffffffffa203671c>] osd_ldiskfs_map_ext_inode_pages+0x11c/0x270 [osd_ldiskfs]
 [<ffffffffa2036f65>] osd_ldiskfs_map_inode_pages.clone.0+0x25/0x30 [osd_ldiskfs]
 [<ffffffffa2038b96>] osd_write_commit+0x2f6/0x610 [osd_ldiskfs]
 [<ffffffffa2247fc4>] ofd_commitrw_write+0x684/0x11b0 [ofd]
 [<ffffffffa224ad45>] ofd_commitrw+0x5d5/0xbc0 [ofd]
 [<ffffffffa099d0d1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa21de1ad>] obd_commitrw+0x11d/0x390 [ost]
 [<ffffffffa21e8151>] ost_brw_write+0xea1/0x15d0 [ost]
 [<ffffffffa0e4aeb5>] ? null_authorize+0x75/0x100 [ptlrpc]
 [<ffffffffa0e08a4e>] ? ptlrpc_send_reply+0x28e/0x7f0 [ptlrpc]
 [<ffffffff8128d9e9>] ? cpumask_next_and+0x29/0x50
 [<ffffffffa0dce860>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa21ee7bf>] ost_handle+0x43af/0x44e0 [ost]
 [<ffffffffa0e1778b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
 [<ffffffffa0e1e8d5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [<ffffffffa05e34fa>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0e17289>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0e2105d>] ptlrpc_main+0xaed/0x1780 [ptlrpc]
 [<ffffffffa0e20570>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
 [<ffffffff8109e78e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Code: 01 89 f0 c9 c3 0f 1f 44 00 00 41 83 c0 01 44 89 81 a8 03 00 00 eb e2 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <0f> b7 4e 04 0f b7 42 04 66 81 f9 00 80 40 0f 97 c7 66 3d 00 80 
RIP  [<ffffffffa1d995c9>] ldiskfs_can_extents_be_merged+0x9/0xb0 [ldiskfs]
 RSP <ffff882038c61550>

after I have analyzed almost all of these crash-dumps, I have been able to find that :
_ all these different crashes are caused by a corruption in a <size-128> Slab object
_ the corruption is an overrun of previous object in same Slab
_ previous object is a [ldiskfs,ext4]_ext_path[] array used to walk an inode extents tree and that has been allocated for a too small size (based on its original/evaluated depth) vs concurrent depth growth



 Comments   
Comment by Bruno Faccini (Inactive) [ 04/Apr/16 ]

I have identified that in recent Kernels (like >= v3.17-rc2-138-g10809df8) there are already ext4 patches, which try to implement safe _ext_path[] array re-[sizing,allocation], like :
_ commit 705912ca9 ("ext4: teach ext4_ext_find_extent() to free path on error")
_ commit dfe508093 ("ext4: drop EXT4_EX_NOFREE_ON_ERR from rest of extents handling code")
_ commit 10809df84 ("ext4: teach ext4_ext_find_extent() to realloc path if necessary")
but even with them, all cases/scenarios against concurrent depth growth do not seem tp be addressed.

So the new way I have implemented is to avoid any _ext_path[] re-allocation need upon extent tree depth change, by originally allocating it of the maximum possible depth dimension, that is evaluated at mount time based on filesystem's basic and internal data-structures sizes/units/limits like max file size, block size, number of indirect extent entries that can fit in a block, ...

And I must precise that this should not be overkill/costly because present maximum extent tree depth is 5, requiring 5x48 (or 56 in ldiskfs) = 240 (280) bytes to be allocated in a raw and for full _ext_path[] life, against possibly [re-]allocating multiple times [3-5]x entries.

Comment by Gerrit Updater [ 06/Apr/16 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19349
Subject: LU-7980 ldiskfs: always pre-allocate max depth for path
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 874cfbd471dff33e2a0a1ab9626a25fd7d3f3eff

Comment by Gerrit Updater [ 28/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19349/
Subject: LU-7980 ldiskfs: always pre-allocate max depth for path
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 655255b651d80d115fbde6729bd58e22afce6483

Comment by Joseph Gmitter (Inactive) [ 28/Apr/16 ]

Landed for 2.9.0

Comment by Andreas Dilger [ 18/May/16 ]

Reopen this issue for Bruno to submit this patch to the upstream kernel as a simplified and more robust version of commit 10809df84.

Bruno, from your comments, even the upstream kernel is not totally robust in this case and would benefit from this patch. If that is incorrect, please close again.

Comment by Bruno Faccini (Inactive) [ 23/Jun/16 ]

I have pushed a re-newed version of my patch to upstream Kernel/ext4, via emails to linux-ext4@vger.kernel.org and with the "ext4: always pre-allocate max depth for path" title.

Comment by Andreas Dilger [ 07/Jul/16 ]

A discussion came up on the list this week, and apparently the "maximum depth" of the extent tree is not actually fixed as we thought because the ext4 extent code does not balance the extent tree. This means in certain extremely rare situations the extent tree may become unbalanced and grow beyond the calculated s_max_ext_tree_depth. This is very unlikely to happen because the Lustre clients are already buffering and merging the data blocks, and we do not (yet) allow fallocate() to allocate or punch arbitrary block numbers. See http://marc.info/?l=linux-ext4&m=146790955304474&w=4 for more details.

Comment by Peter Jones [ 05/Aug/16 ]

Removing fixversion to decouple upstream work from fix for 2.9

Comment by Andreas Dilger [ 11/Sep/21 ]

This patch was landed for rhel7.x as commit v2_8_52_0-42-g655255b651. The comment from Ted was:

The problem with the max possible extent depth assumption is that this assumes non-pathological trees. Unfortunately, at the moment we don't dont ever shrink the extent tree as we delete entries from the tree, and we aren't obeying the requirements of a formal B+ tree, which is that all nodes (except for a trivial tree consisting of a single leaf node at the root) must be at least half-full. So while it is highly unlikely, it is possible to create highly pathological trees that could potentially be deeper than five deep.

They are extremely unlikely to happen in practice, granted, but if we are relying on this to prevent array bound overflow attacks, a malicious attacker could potentially be very happy to arrange such as situation.

So at least in the short run, we may be better off finding all of the places where we drop i_data_sem after we've allocated the struct path array, and after we grab it again for writing, double check to see if we need to reallocate it. For performance reasons I'm happy always
allocating an extra array element or two to minimize the need to do the reallocation, but for correctness's sake it would be good if we could easily test the code path where we need to do a reallocation, as well as demonstrate that we do the right thing if the reallocation fails...

However, in the functions affected by this patch (ext4_ext_find_extent() and ext4_find_extent()) also have a check that the 5-level tree is never exceeded (from commit v4.17-rc4-30-gbc890a602471, but backported to at least RHEL7.7 (but not in 7.3):

        if (depth < 0 || depth > EXT4_MAX_EXTENT_DEPTH) {
                EXT4_ERROR_INODE(inode, "inode has invalid extent depth: %d",
                                 depth);
                return ERR_PTR(-EFSCORRUPTED);
        }

so it is impossible for the tree to exceed EXT4_MAX_EXTENT_DEPTH as Ted claimed, so the "s_max_ext_tree_depth" calculation is at least not needed. The "depth was changed" chunk was removed in commit v3.17-rc2-138-g10809df84a4d, but not backported to RHEL7. The "(ppos > depth))" check was removed in commit v4.6-rc4-18-g816cd71b0c72.

Generated at Sat Feb 10 02:13:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.