[LU-13168] Client panic "Freechain corrupt"/"Redzone Overwritten" Created: 22/Jan/20  Updated: 27/May/20  Resolved: 14/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0, Lustre 2.12.5

Type: Bug Priority: Major
Reporter: Christopher Mountford Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None
Environment:

Centos 7.7
MDS/OSS Lustre version: 2.12.3
clients Lustre: 2.12.3_93_gb75f04d-1


Issue Links:
Duplicate
duplicates LU-12911 Setting a LOV EA can access or change... Resolved
Related
is related to LU-13589 PFL "lfs setstripe -E 1M -S 65536" in... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

 

We are using 2.12.3_93_gb75f04d-1 on clients to fix a panic when deleting files and using data on MDT (LU-12462). This has resolved the panic on deleting files, however we are now experiencing 2-3 kernel panics a day between our 6 cluster login machines.

We do not yet know what is triggering these, however they all start from with either a kmalloc-8 freechain corrupt or kmalloc-8 Redzone overwritten, I've reproduced samples of both the vmcore-dmesg file generated by kdump/, this looks similar to me to LU-12581

Typical dmesg from crashed client:

[171210.346747] =============================================================================
[171210.346754] BUG kmalloc-8 (Tainted: G OE ------------ ): Freechain corrupt
[171210.346756] -----------------------------------------------------------------------------
[171210.346759] Disabling lock debugging due to kernel taint
[171210.346763] INFO: Slab 0xffffeb5450defb40 objects=102 used=6 fp=0xffff8eb6b7bedfa8 flags=0x6fffff00000081
[171210.346765] INFO: Object 0xffff8eb6b7bedf30 @offset=3888 fp=0x7fff8eb6b7bedf08
[171210.346770] Redzone ffff8eb6b7bedf28: bb bb bb bb bb bb bb bb ........
[171210.346773] Object ffff8eb6b7bedf30: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk.
[171210.346775] Redzone ffff8eb6b7bedf38: bb bb bb bb bb bb bb bb ........
[171210.346778] Padding ffff8eb6b7bedf48: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[171210.346783] CPU: 21 PID: 8721 Comm: pool Kdump: loaded Tainted: G B OE ------------ 3.10.0-1062.9.1.el7.x86_64 #1
[171210.346785] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/21/2019
[171210.346787] Call Trace:
[171210.346799] [<ffffffffa777ac23>] dump_stack+0x19/0x1b
[171210.346805] [<ffffffffa7221561>] print_trailer+0x161/0x280
[171210.346808] [<ffffffffa7221ebf>] on_freelist+0xff/0x270
[171210.346813] [<ffffffffa77774cc>] free_debug_processing+0x18d/0x270
[171210.346818] [<ffffffffa71ddcb5>] ? kvfree+0x35/0x40
[171210.346822] [<ffffffffa7223bee>] __slab_free+0x1ce/0x290
[171210.346829] [<ffffffffa7272e58>] ? generic_setxattr+0x68/0x80
[171210.346834] [<ffffffffa7273635>] ? __vfs_setxattr_noperm+0x65/0x1b0
[171210.346840] [<ffffffffa732b7ae>] ? evm_inode_setxattr+0xe/0x10
[171210.346844] [<ffffffffa71ddcb5>] ? kvfree+0x35/0x40
[171210.346847] [<ffffffffa7223db6>] kfree+0x106/0x140
[171210.346851] [<ffffffffa71ddcb5>] kvfree+0x35/0x40
[171210.346855] [<ffffffffa727399b>] setxattr+0x15b/0x1e0
[171210.346861] [<ffffffffa725c3ed>] ? putname+0x3d/0x60
[171210.346865] [<ffffffffa725d602>] ? user_path_at_empty+0x72/0xc0
[171210.346871] [<ffffffffa724d828>] ? __sb_start_write+0x58/0x120
[171210.346876] [<ffffffffa7273c87>] SyS_setxattr+0xb7/0x100
[171210.346882] [<ffffffffa778dede>] system_call_fastpath+0x25/0x2a
[171210.346885] =============================================================================
[171210.346888] BUG kmalloc-8 (Tainted: G B OE ------------ ): Wrong object count. Counter is 6 but counted were 98
[171210.346889] -----------------------------------------------------------------------------
[171210.346893] INFO: Slab 0xffffeb5450defb40 objects=102 used=6 fp=0xffff8eb6b7bedfa8 flags=0x6fffff00000081
[171210.346897] CPU: 21 PID: 8721 Comm: pool Kdump: loaded Tainted: G B OE ------------ 3.10.0-1062.9.1.el7.x86_64 #1
[171210.346899] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/21/2019
[171210.346901] Call Trace:
[171210.346905] [<ffffffffa777ac23>] dump_stack+0x19/0x1b
[171210.346908] [<ffffffffa7221b54>] slab_err+0xb4/0xe0
[171210.346915] [<ffffffffa7030a1e>] ? show_stack+0x4e/0x60
[171210.346918] [<ffffffffa7221561>] ? print_trailer+0x161/0x280
[171210.346921] [<ffffffffa7221f85>] on_freelist+0x1c5/0x270
[171210.346925] [<ffffffffa77774cc>] free_debug_processing+0x18d/0x270
[171210.346929] [<ffffffffa71ddcb5>] ? kvfree+0x35/0x40
[171210.346932] [<ffffffffa7223bee>] __slab_free+0x1ce/0x290
[171210.346937] [<ffffffffa7272e58>] ? generic_setxattr+0x68/0x80
[171210.346941] [<ffffffffa7273635>] ? __vfs_setxattr_noperm+0x65/0x1b0
[171210.346944] [<ffffffffa732b7ae>] ? evm_inode_setxattr+0xe/0x10
[171210.346948] [<ffffffffa71ddcb5>] ? kvfree+0x35/0x40
[171210.346951] [<ffffffffa7223db6>] kfree+0x106/0x140
[171210.346955] [<ffffffffa71ddcb5>] kvfree+0x35/0x40
[171210.346959] [<ffffffffa727399b>] setxattr+0x15b/0x1e0
[171210.346963] [<ffffffffa725c3ed>] ? putname+0x3d/0x60
[171210.346967] [<ffffffffa725d602>] ? user_path_at_empty+0x72/0xc0
[171210.346971] [<ffffffffa724d828>] ? __sb_start_write+0x58/0x120
[171210.346976] [<ffffffffa7273c87>] SyS_setxattr+0xb7/0x100
[171210.346980] [<ffffffffa778dede>] system_call_fastpath+0x25/0x2a
[171210.346983] FIX kmalloc-8: Object count adjusted.
[171210.346985] =============================================================================
[171210.346988] BUG kmalloc-8 (Tainted: G B OE ------------ ): Redzone overwritten
[171210.346989] -----------------------------------------------------------------------------
[171210.346993] INFO: 0xffff8eb6b7bed0b0-0xffff8eb6b7bed0b7. First byte 0x4c instead of 0xcc
[171210.346996] INFO: Slab 0xffffeb5450defb40 objects=102 used=98 fp=0xffff8eb6b7bedfa8 flags=0x6fffff00000081
[171210.346998] INFO: Object 0xffff8eb6b7bed0a8 @offset=168 fp=0x7f7f0e36373e5050
[171210.347001] Redzone ffff8eb6b7bed0a0: cc cc cc cc cc cc cc cc ........
[171210.347004] Object ffff8eb6b7bed0a8: d0 0b d6 0b 88 01 00 25 .......%
[171210.347006] Redzone ffff8eb6b7bed0b0: 4c 4c 4c 4c 4c 4c 4c 4c LLLLLLLL
[171210.347009] Padding ffff8eb6b7bed0c0: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[171210.347012] CPU: 21 PID: 8721 Comm: pool Kdump: loaded Tainted: G B OE ------------ 3.10.0-1062.9.1.el7.x86_64 #1
[171210.347014] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/21/2019
[171210.347016] Call Trace:
[171210.347020] [<ffffffffa777ac23>] dump_stack+0x19/0x1b
[171210.347023] [<ffffffffa7221561>] print_trailer+0x161/0x280
[171210.347026] [<ffffffffa72217ef>] check_bytes_and_report+0xcf/0x110
[171210.347030] [<ffffffffa722237d>] check_object+0x1dd/0x2a0
[171210.347033] [<ffffffffa77773cc>] free_debug_processing+0x8d/0x270
[171210.347037] [<ffffffffa71ddcb5>] ? kvfree+0x35/0x40
[171210.347040] [<ffffffffa7223bee>] __slab_free+0x1ce/0x290
[171210.347045] [<ffffffffa7272e58>] ? generic_setxattr+0x68/0x80
[171210.347049] [<ffffffffa7273635>] ? __vfs_setxattr_noperm+0x65/0x1b0
[171210.347258] [<ffffffffa732b7ae>] ? evm_inode_setxattr+0xe/0x10
[171210.347262] [<ffffffffa71ddcb5>] ? kvfree+0x35/0x40
[171210.347265] [<ffffffffa7223db6>] kfree+0x106/0x140
[171210.347449] [<ffffffffa71ddcb5>] kvfree+0x35/0x40
[171210.347627] [<ffffffffa727399b>] setxattr+0x15b/0x1e0
[171210.347823] [<ffffffffa725c3ed>] ? putname+0x3d/0x60
[171210.348010] [<ffffffffa725d602>] ? user_path_at_empty+0x72/0xc0
[171210.348204] [<ffffffffa724d828>] ? __sb_start_write+0x58/0x120
[171210.348209] [<ffffffffa7273c87>] SyS_setxattr+0xb7/0x100
[171210.348392] [<ffffffffa778dede>] system_call_fastpath+0x25/0x2a
[171210.348578] FIX kmalloc-8: Restoring 0xffff8eb6b7bed0b0-0xffff8eb6b7bed0b7=0xcc
[171210.349139] FIX kmalloc-8: Object at 0xffff8eb6b7bed0a8 not freed
[171210.462694] general protection fault: 0000 [#1] SMP 
[171210.488281] Modules linked in: fuse can_bcm sctp can_raw can nfsd mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) cts lnet(OE) rpcsec_gss_krb5 nfsv4 dns_resolver libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_recent xt_conntrack nf_conntrack iptable_filter dm_mirror dm_region_hash dm_log dm_mod mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm mlx4_core(OE) mgag200 ttm irqbypass drm_kms_helper crc32_pclmul iTCO_wdt crc32c_intel iTCO_vendor_support ghash_clmulni_intel aesni_intel
[171210.866605] syscopyarea lrw sysfillrect sysimgblt fb_sys_fops gf128mul glue_helper drm ablk_helper cryptd ses enclosure drm_panel_orientation_quirks ipmi_si mlx_compat(OE) pcspkr ipmi_devintf devlink ioatdma ipmi_msghandler pcc_cpufreq wmi i2c_i801 hpwdt lpc_ich acpi_power_meter binfmt_misc knem(OE) auth_rpcgss ip_tables smartpqi bridge stp llc xfs isci libsas qla3xxx e1000e igb i2c_algo_bit megaraid_sas aacraid aic79xx ata_piix mpt2sas raid_class mptspi scsi_transport_spi mptsas mptscsih mptbase arcmsr ahci libahci sata_nv sata_svw bnx2x libcrc32c bnx2 ext4 mbcache jbd2 sata_sil libata tg3 e1000 nfsv3 nfs_acl nfs lockd grace sunrpc fscache tun sd_mod crc_t10dif crct10dif_generic sg ixgbe crct10dif_pclmul crct10dif_common hpsa dca mdio ptp hpilo scsi_transport_sas pps_core [last unloaded: ipmi_msghandler]
[171211.240594] 
[171211.242206] CPU: 21 PID: 8721 Comm: pool Kdump: loaded Tainted: G B OE ------------ 3.10.0-1062.9.1.el7.x86_64 #1
[171211.300030] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/21/2019
[171211.336094] task: ffff8e972f73b150 ti: ffff8e9cdf368000 task.ti: ffff8e9cdf368000
[171211.375938] RIP: 0010:[<ffffffffc112fcdc>] [<ffffffffc112fcdc>] cl_page_delete0+0x6c/0x220 [obdclass]
[171211.423586] RSP: 0018:ffff8e9cdf36bb98 EFLAGS: 00010287
[171211.451848] RAX: 7fffffffc1439900 RBX: ffff8eb76f7e4a90 RCX: 000000000000001c
[171211.490610] RDX: ffff8e88df67bb50 RSI: ffff8ea253a26b58 RDI: ffff8e8fd9ff89a4
[171211.530173] RBP: ffff8e9cdf36bbb0 R08: ffff8ea253a26b58 R09: 0000000000000046
[171211.568509] R10: 0000000000000230 R11: ffff8eae63eebc00 R12: ffff8eb76f7e4a28
[171211.606963] R13: ffffffffc118d878 R14: ffff8e9cdf36bcd0 R15: ffff8e9cdf36bc60
[171211.645074] FS: 00002aaabe446700(0000) GS:ffff8ec27f3c0000(0000) knlGS:0000000000000000
[171211.689665] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[171211.722223] CR2: 0000000001440b30 CR3: 0000003bd6a22000 CR4: 00000000003607e0
[171211.762135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[171211.801929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[171211.842015] Call Trace:
[171211.855677] [<ffffffffc112fec3>] cl_page_delete+0x33/0x110 [obdclass]
[171211.892125] [<ffffffffc158a9ff>] ll_invalidatepage+0x7f/0x170 [lustre]
[171211.931786] [<ffffffffa71ce22d>] do_invalidatepage_range+0x7d/0x90
[171211.965295] [<ffffffffa71ce2d7>] truncate_inode_page+0x77/0x80
[171211.996796] [<ffffffffa71ce50a>] truncate_inode_pages_range+0x1ea/0x750
[171212.033187] [<ffffffffa71ceadf>] truncate_inode_pages_final+0x4f/0x60
[171212.071886] [<ffffffffc1570acf>] ll_delete_inode+0x4f/0x230 [lustre]
[171212.106401] [<ffffffffa7268544>] evict+0xb4/0x180
[171212.133368] [<ffffffffa726896c>] iput+0xfc/0x190
[171212.162024] [<ffffffffa725cbde>] do_unlinkat+0x1ae/0x2d0
[171212.191747] [<ffffffffa725dc96>] SyS_unlink+0x16/0x20
[171212.221730] [<ffffffffa778dede>] system_call_fastpath+0x25/0x2a
[171212.253488] Code: 89 e6 ba 04 00 00 00 4c 89 ef e8 80 fb ff ff 49 8b 44 24 30 49 83 c4 28 49 39 c4 48 8d 58 e0 74 2b 66 0f 1f 44 00 00 48 8b 43 18 <48> 8b 40 40 48 85 c0 74 0b 48 89 de 4c 89 ef e8 20 2e 26 e6 48 
[171212.353283] RIP [<ffffffffc112fcdc>] cl_page_delete0+0x6c/0x220 [obdclass]
[171212.391650] RSP <ffff8e9cdf36bb98>
 

second dmesg (Redzone overwritten message)

 

[588058.598892] =============================================================================
[588058.598898] BUG kmalloc-8 (Tainted: G OE ------------ ): Redzone overwritten
[588058.598900] -----------------------------------------------------------------------------
[588058.598903] Disabling lock debugging due to kernel taint
[588058.598906] INFO: 0xffff8b70ccbdde48-0xffff8b70ccbdde4f. First byte 0x4c instead of 0xcc
[588058.598908] INFO: Slab 0xffffd60ee632f740 objects=102 used=93 fp=0xffff8b70ccbddcd8 flags=0x6fffff00000081
[588058.598910] INFO: Object 0xffff8b70ccbdde40 @offset=3648 fp=0x7f7f0b704c3d5d48
[588058.598914] Redzone ffff8b70ccbdde38: cc cc cc cc cc cc cc cc ........
[588058.598916] Object ffff8b70ccbdde40: d0 0b d6 0b 88 01 00 25 .......%
[588058.598918] Redzone ffff8b70ccbdde48: 4c 4c 4c 4c 4c 4c 4c 4c LLLLLLLL
[588058.598920] Padding ffff8b70ccbdde58: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[588058.598924] CPU: 27 PID: 12194 Comm: pool Kdump: loaded Tainted: G B OE ------------ 3.10.0-1062.9.1.el7.x86_64 #1
[588058.598926] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/21/2019
[588058.598927] Call Trace:
[588058.598939] [<ffffffffb477ac23>] dump_stack+0x19/0x1b
[588058.598944] [<ffffffffb4221561>] print_trailer+0x161/0x280
[588058.598947] [<ffffffffb42217ef>] check_bytes_and_report+0xcf/0x110
[588058.598950] [<ffffffffb422237d>] check_object+0x1dd/0x2a0
[588058.598953] [<ffffffffb47773cc>] free_debug_processing+0x8d/0x270
[588058.598958] [<ffffffffb41ddcb5>] ? kvfree+0x35/0x40
[588058.598962] [<ffffffffb4223bee>] __slab_free+0x1ce/0x290
[588058.598968] [<ffffffffb4272e58>] ? generic_setxattr+0x68/0x80
[588058.598972] [<ffffffffb4273635>] ? __vfs_setxattr_noperm+0x65/0x1b0
[588058.598977] [<ffffffffb432b7ae>] ? evm_inode_setxattr+0xe/0x10
[588058.598980] [<ffffffffb41ddcb5>] ? kvfree+0x35/0x40
[588058.598982] [<ffffffffb4223db6>] kfree+0x106/0x140
[588058.598985] [<ffffffffb41ddcb5>] kvfree+0x35/0x40
[588058.598989] [<ffffffffb427399b>] setxattr+0x15b/0x1e0
[588058.598994] [<ffffffffb425c3ed>] ? putname+0x3d/0x60
[588058.598998] [<ffffffffb425d602>] ? user_path_at_empty+0x72/0xc0
[588058.599003] [<ffffffffb424d828>] ? __sb_start_write+0x58/0x120
[588058.599008] [<ffffffffb42802f1>] ? do_utimes+0xf1/0x180
[588058.599011] [<ffffffffb4273c87>] SyS_setxattr+0xb7/0x100
[588058.599016] [<ffffffffb478dede>] system_call_fastpath+0x25/0x2a
[588058.599019] FIX kmalloc-8: Restoring 0xffff8b70ccbdde48-0xffff8b70ccbdde4f=0xcc
[588058.599022] FIX kmalloc-8: Object at 0xffff8b70ccbdde40 not freed
[588060.269020] WebExtensions[13188]: segfault at 1fff8 ip 00001f17cb0e5fbb sp 00007fffffffb998 error 4
[588076.827561] atom[21965]: segfault at 21ea75682310 ip 00002aaaab0d6550 sp 00007fffffffc3a8 error 4 in libnode.so[2aaaaaccf000+12ba000]
[588128.154889] LustreError: 32046:0:(cl_page.c:394:cl_pagevec_put()) page@ffff8b70fd832600[0 ffff8b5e72b18270 4 1 (null)]
[588128.154903] LustreError: 32046:0:(cl_page.c:394:cl_pagevec_put()) vvp-page@ffff8b70fd832650(0:0) vm@ffffd60ec75c3dc0 6fffff00000009 3:0 0 449438 lru
[588128.154910] LustreError: 32046:0:(cl_page.c:394:cl_pagevec_put()) lov-page@ffff8b70fd832690, comp index: 30002, gen: 8
[588128.154924] LustreError: 32046:0:(cl_page.c:394:cl_pagevec_put()) osc-page@ffff8b70fd8326c8 112542: 1< 0x845fed 2 0 - - > 2< 460972032 0 4096 0x0 0x420 | (null) ffff8b872b233648 ffff8b49c2ff9540 > 3< 0 0 0 > 4< 0 0 8 1879965695 - | - - - - > 5< - - - - | 0 - | 0 - ->
[588128.154930] LustreError: 32046:0:(cl_page.c:394:cl_pagevec_put()) end page@ffff8b70fd832600
[588128.154935] LustreError: 32046:0:(cl_page.c:394:cl_pagevec_put()) list_empty(&page->cp_batch)
[588128.154939] LustreError: 32046:0:(cl_page.c:394:cl_pagevec_put()) ASSERTION( 0 ) failed: 
[588128.196374] LustreError: 32046:0:(cl_page.c:394:cl_pagevec_put()) LBUG
[588128.230830] Pid: 32046, comm: wget 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019
[588128.230832] Call Trace:
[588128.230848] [<ffffffffc10d97cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[588128.230867] [<ffffffffc10d987c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[588128.230876] [<ffffffffc12fd1f3>] cl_pagevec_put+0x3a3/0x3e0 [obdclass]
[588128.230922] [<ffffffffc12fd240>] cl_page_put+0x10/0x20 [obdclass]
[588128.230944] [<ffffffffc175d895>] ll_releasepage+0xb5/0x1a0 [lustre]
[588128.230971] [<ffffffffb41bd565>] try_to_release_page+0x35/0x50
[588128.230979] [<ffffffffb41d2a19>] shrink_page_list+0xa09/0xc30
[588128.230985] [<ffffffffb41d3266>] shrink_inactive_list+0x1c6/0x5d0
[588128.230989] [<ffffffffb41d3d65>] shrink_lruvec+0x385/0x740
[588128.230993] [<ffffffffb41d4196>] shrink_zone+0x76/0x1a0
[588128.230997] [<ffffffffb41d4680>] do_try_to_free_pages+0xf0/0x520
[588128.231002] [<ffffffffb41d4d0a>] try_to_free_mem_cgroup_pages+0xda/0x190
[588128.231006] [<ffffffffb423c7ce>] mem_cgroup_reclaim+0x4e/0x120
[588128.231011] [<ffffffffb423d19c>] __mem_cgroup_try_charge+0x4ec/0x670
[588128.231014] [<ffffffffb423db09>] mem_cgroup_charge_common+0x59/0xc0
[588128.231018] [<ffffffffb423f4ca>] mem_cgroup_cache_charge+0x8a/0xb0
[588128.231022] [<ffffffffb41be1ee>] __add_to_page_cache_locked+0x4e/0x190
[588128.231026] [<ffffffffb41be387>] add_to_page_cache_lru+0x37/0xb0
[588128.231030] [<ffffffffb41be449>] grab_cache_page_nowait+0x49/0xa0
[588128.231033] [<ffffffffc175e1d5>] ll_write_begin+0xd5/0xc00 [lustre]
[588128.231048] [<ffffffffb41bd28f>] generic_file_buffered_write+0x10f/0x270
[588128.231052] [<ffffffffb41bfaf2>] __generic_file_aio_write+0x1e2/0x400
[588128.231056] [<ffffffffc176c51b>] __generic_file_write_iter+0xcb/0x340 [lustre]
[588128.231072] [<ffffffffc1770704>] vvp_io_write_start+0x4c4/0x970 [lustre]
[588128.231088] [<ffffffffc13011a8>] cl_io_start+0x68/0x130 [obdclass]
[588128.231112] [<ffffffffc130338c>] cl_io_loop+0xcc/0x1c0 [obdclass]
[588128.231134] [<ffffffffc1725f4b>] ll_file_io_generic+0x63b/0xc90 [lustre]
[588128.231147] [<ffffffffc1726a39>] ll_file_aio_write+0x289/0x660 [lustre]
[588128.231158] [<ffffffffc1726f10>] ll_file_write+0x100/0x1c0 [lustre]
[588128.231170] [<ffffffffb424a7f0>] vfs_write+0xc0/0x1f0
[588128.231175] [<ffffffffb424b60f>] SyS_write+0x7f/0xf0
[588128.231179] [<ffffffffb478dede>] system_call_fastpath+0x25/0x2a
[588128.231185] [<ffffffffffffffff>] 0xffffffffffffffff
[588128.231226] Kernel panic - not syncing: LBUG
[588128.255989] CPU: 14 PID: 32046 Comm: wget Kdump: loaded Tainted: G B OE ------------ 3.10.0-1062.9.1.el7.x86_64 #1
[588128.315914] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/21/2019
[588128.353586] Call Trace:
[588128.367273] [<ffffffffb477ac23>] dump_stack+0x19/0x1b
[588128.395924] [<ffffffffb4774967>] panic+0xe8/0x21f
[588128.423537] [<ffffffffc10d98cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[588128.458128] [<ffffffffc12fd1f3>] cl_pagevec_put+0x3a3/0x3e0 [obdclass]
[588128.496480] [<ffffffffc12fbcf0>] ? cl_page_delete0+0x80/0x220 [obdclass]
[588128.536173] [<ffffffffc12fd240>] cl_page_put+0x10/0x20 [obdclass]
[588128.569680] [<ffffffffc175d895>] ll_releasepage+0xb5/0x1a0 [lustre]
[588128.606245] [<ffffffffb41bd565>] try_to_release_page+0x35/0x50
[588128.637150] [<ffffffffb41d2a19>] shrink_page_list+0xa09/0xc30
[588128.668853] [<ffffffffb41d3266>] shrink_inactive_list+0x1c6/0x5d0
[588128.701180] [<ffffffffb41d3d65>] shrink_lruvec+0x385/0x740
[588128.729829] [<ffffffffb41d4196>] shrink_zone+0x76/0x1a0
[588128.757029] [<ffffffffb41d4680>] do_try_to_free_pages+0xf0/0x520
[588128.788834] [<ffffffffb41d4d0a>] try_to_free_mem_cgroup_pages+0xda/0x190
[588128.825660] [<ffffffffb423c7ce>] mem_cgroup_reclaim+0x4e/0x120
[588128.857225] [<ffffffffb423d19c>] __mem_cgroup_try_charge+0x4ec/0x670
[588128.890392] [<ffffffffb423db09>] mem_cgroup_charge_common+0x59/0xc0
[588128.926221] [<ffffffffb423f4ca>] mem_cgroup_cache_charge+0x8a/0xb0
[588128.959155] [<ffffffffb41be1ee>] __add_to_page_cache_locked+0x4e/0x190
[588128.997956] [<ffffffffb41be387>] add_to_page_cache_lru+0x37/0xb0
[588129.030741] [<ffffffffb41be449>] grab_cache_page_nowait+0x49/0xa0
[588129.068075] [<ffffffffc175e1d5>] ll_write_begin+0xd5/0xc00 [lustre]
[588129.108147] [<ffffffffb41bd28f>] generic_file_buffered_write+0x10f/0x270
[588129.147957] [<ffffffffb41bfaf2>] __generic_file_aio_write+0x1e2/0x400
[588129.187721] [<ffffffffc176c51b>] __generic_file_write_iter+0xcb/0x340 [lustre]
[588129.227843] [<ffffffffc1770704>] vvp_io_write_start+0x4c4/0x970 [lustre]
[588129.267789] [<ffffffffc13011a8>] cl_io_start+0x68/0x130 [obdclass]
[588129.300695] [<ffffffffc130338c>] cl_io_loop+0xcc/0x1c0 [obdclass]
[588129.337964] [<ffffffffc1725f4b>] ll_file_io_generic+0x63b/0xc90 [lustre]
[588129.377915] [<ffffffffc1726a39>] ll_file_aio_write+0x289/0x660 [lustre]
[588129.418350] [<ffffffffc1726f10>] ll_file_write+0x100/0x1c0 [lustre]
[588129.458398] [<ffffffffb424a7f0>] vfs_write+0xc0/0x1f0
[588129.488082] [<ffffffffb424b60f>] SyS_write+0x7f/0xf0
[588129.518291] [<ffffffffb478dede>] system_call_fastpath+0x25/0x2a

 

 



 Comments   
Comment by Andreas Dilger [ 23/Jan/20 ]

I assume that the redzone/slab debugging is because your clients have "CONFIG_DEBUG_SLAB" enabled or similar.

Do you have any idea if there is a particular workload that triggers this problem, or just general usage?

If you are able to reproduce the problem easily, would you be able to try "git bisect" to isolate the source of this problem to a specific patch? You would need to apply the LU-12462 patch after each bisect step so that you don't hit the other problem.

The only thing in the slab dump that looks familiar is the data "d0 0b d6 0b 88 01 00 25" which looks to be "LOV_USER_MAGIC_COMP_V1=0x0BD60BD0" but the following bytes do not look valid. One possibility is that a "struct lov_comp_md_v1" was allocated from one slab, but incorrectly freed from a second slab? It definitely shouldn't be allocated from the kmalloc-8 slab.

Comment by Christopher Mountford [ 23/Jan/20 ]

We are trying to isolate the workload causing this - it is only occurring on login nodes. If we are able to reproduce it we will try to identify is a specific patch causes this on a test node.

Comment by Christopher Mountford [ 28/Feb/20 ]

We've updated the clients to 2.12.4 and are still seeing the same issue.

Not yet identified a way of reproducing the problem, it seems to strike randomly.

We can confirm we only see these crashes on our login nodes which support virtual desktops (using nomachine)  - 22 desktop login node crashes (between 4 login nodes), no similar problems on 2 ssh only login nodes and 170 compute nodes. All login nodes (NoMachine and ssh) have identical login images and hardware.

The mate desktop we provide does seem to open (and keep open) a lot of files - particularly small memory mapped files. Not sure if this might contribute to the problem.

 

Comment by Christopher Mountford [ 28/Apr/20 ]

I think we've finally tracked down a trigger for this crash - it occurs when files are being copied (intermittently) and moved (almost always triggers a panic) using the gnome io libraries (for example using ctrl-x/ctrl-v in the Caja or Nautilus file browsers. Critically, it only occurs when the move is between 2 lustre file systems. GNU utils and kio do not seem to cause the same problem.

We've taken a look at the strace output from a gio move command, it ends (kernel panics) with:

.
.
lstat("/lustre/old_data/admin/cjm14/pr_E1hr_CNRM-CM6-1_amip_r1i1p1f2_gr_197901010030-198312312330.nc", {st_mode=S_IFREG|0644,st_size=4899999744, ...}) = 0
chmod("/lustre/old_data/admin/cjm14/pr_E1hr_CNRM-CM6-1_amip_r1i1p1f2_gr_197901010030-198312312330.nc", 0100644) = 0
utimes("/lustre/old_data/admin/cjm14/pr_E1hr_CNRM-CM6-1_amip_r1i1p1f2_gr_197901010030-198312312330.nc", [{1588072285, 0}, {1587982928, 0}]) = 0
setxattr("/lustre/old_data/admin/cjm14/pr_E1hr_CNRM-CM6-1_amip_r1i1p1f2_gr_197901010030-198312312330.nc", "lustre.lov", "\320\v\326\v\10\3", 7, 0packet_write_wait: Connection to 10.141.32.19 port 22: Broken pipe

 
I believe the lustre.lov attribute is related to the file layout?

Critically, gio writes only 7 bytes to this attribute, if we compare this to strace gnu mv -

fsetxattr(4, "lustre.lov", "\320\v\326\v\10\3\0\0\6\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\340\0\0\0008\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\20\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\1\0\0\0\30\1\0\0P\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\20\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\10\0\0\0h\1\0\0\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\377\377\377\377\377\377\377\377\350\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\320\v\321\v\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\1\0\0\0lb\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\320\v\321\v\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\2\0\0\0\207\302\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0Aj\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\320\v\321\v\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0@\0\4\0\0\0Gg\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0mb\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\210\302\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0Bj\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\320\v\321\v\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0@\0\10\0\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 776, 0) = 0

The attribute value is 776 bytes long - it looks to me like gio truncates it at the first null character.

 

The following code reproduces this crash via a truncated call to setxattr (and confirms that the xattr for the above file is indeed 776 bytes), so far we've confirmed this with 2.12.4 client and server, I'm going to check it again on a system with 2.12.4 client and 2.10.8 server:

 

int main(int argc, char *argv[]) {
        char *file = argv[1];
        int trunc_to = 7;
        // call with a zero size buffer, should return the buffer size required
        ssize_t asz = getxattr(file, "lustre.lov", NULL, 0);
        // call with a suitable buffer:
        unsigned char xattr_buf[asz];
        ssize_t asz2 = getxattr(file, "lustre.lov", xattr_buf,asz);
        if (asz2 != asz)
        {
                printf("asz2 (%i) does not match asz (%i)\n", asz2, asz );
                return -1;
        }

        // if we get this far we have the attr - lets take a look at it:
        printf( "xattr - %i characters\n", asz2 );

        int ii;
        for( ii=0; ii<asz2; ii++ )
              printf( "0x%02x:", xattr_buf[ii] );
        printf("\n");
        // Now try writing it back to the file - this should be ok.
        int rc = setxattr(file, "lustre.lov", xattr_buf, asz, 0);
        if (rc) 
        {
                printf( "call to setxattr with full attribute value failed with errno %i\n", errno );
                return -1;
        }
        printf( "wrote xattr with %i bytes\n", asz);

        // Finally, lets delibrately truncate it and write it (this   appears 
        // be what happens at the end of glib io move) - this crashes our
        // clients.
        rc = setxattr(file, "lustre.lov", xattr_buf, trunc_to, 0);
        if (rc)
        {
                printf( "call to setxattr with partial attribute value failed with errno %i\n", errno );
                return -1;
        }
        printf("wrote truncated xattr with %i bytes\n", asz);
        return 0;
}

It often crashes immediately, sometimes crashes on a second or 3rd run and occasionally crashes a minute or 2 after the code is run. If it doesn't crash immediately it reports errno as 34.

Both client and server run lustre 2.12.4, MLNX 4.7-1.0.0.1.

kernels 3.10.0-1062.4.3.el7.x86_64 and 3.10.0-1062.12.1.el7.x86_64 on clients, 3.10.0-1062.9.1.el7_lustre.x86_64 on servers.

Comment by Christopher Mountford [ 29/Apr/20 ]

Testing on a second system with 2.12.4 client and 2.10.8 server (which has a much simpler file layout), the truncated setxattr returns error code 34 and the client does not crash.

Comment by Andreas Dilger [ 30/Apr/20 ]

The attribute value is 776 bytes long - it looks to me like gio truncates it at the first null character.

Could you please file a bug with gio about this. There definitely isn't a valid reason to assume that all xattrs are ASCII strings.

It seems that this gio bug is particularly evil, in that it results in the LOV_MAGIC_V1 and LOV_PATTERN_RAID0 to be written to the kernel, so it passes the basic validity checks, but the rest of the structure is probably zero or uninitialized garbage.

Comment by Andreas Dilger [ 30/Apr/20 ]

I was trying to reproduce this locally on master with your test program, but only every got error 34 and no crash. It looks like this was fixed on master via patch https://review.whamcloud.com/36589 "LU-12911 llite: Don't access lov_md fields before size check" but that was only landed as commit v2_13_52-17-gf2d06d3c76 and not backported to the b2_12 LTS release.

I've cherry-picked that patch as patch https://review.whamcloud.com/38433 so you can download a packages from build.whamcloud.com when it finishes, or apply it locally and rebuild to verify it fixes your problem.

Thanks for doing the legwork on this, it made it straight forward to identify the problem once you had a good reproducer. Even though it was coincidentally fixed (due to static code analysis), your efforts at least identified that this was a real issue for the 2.12 release that needed to be fixed.

Comment by Gerrit Updater [ 30/Apr/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38434
Subject: LU-13168 tests: verify truncated xattr is handled
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e0e7f0e4ef0aa868c0d9d38bb517357ef50cea25

Comment by Andreas Dilger [ 01/May/20 ]

Christopher, it looks like there was still a problem with PFL layouts, even after the 36589 patch. I've updated the 38434 patch to also fix the PFL layout problem.

Comment by Christopher Mountford [ 01/May/20 ]

I've tested the 38434 patch, I downloaded the tarball and built the client rpms from there. This seems to have fix the problem, the system call now returns an error without causing a panic. I'll try applying the diff to the 2.12.4 source and check that works as well.

I'll grab the latest gio as well and check if the handling of extended attributes is fixed, if not I'll put a bug report in to gnome.

Comment by Christopher Mountford [ 06/May/20 ]

I've applied the patch to our 2.12.4 clients. This appears to build fine. Our test code now returns an error as expected rather than triggering a panic. Will this be included in the next 2.12 LTS release?

 

 

Comment by Peter Jones [ 06/May/20 ]

> Will this be included in the next 2.12 LTS release?

Very likely but it needs to land to master first.

Comment by Gerrit Updater [ 14/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38434/
Subject: LU-13168 tests: verify truncated xattr is handled
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cb74546354201434a6fd3d53acd1a0808fbfcb1c

Comment by Peter Jones [ 14/May/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 14/May/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38604
Subject: LU-13168 tests: verify truncated xattr is handled
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7d506d0989c0467e5a5fce7f6295eae578aef82e

Comment by Gerrit Updater [ 27/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38604/
Subject: LU-13168 tests: verify truncated xattr is handled
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 75c0eb51332639a09c720fb41f3a2cdb5b029afb

Comment by Peter Jones [ 27/May/20 ]

Fix confirmed in 2.12.5

Generated at Sat Feb 10 02:58:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.