[LU-5376] slab error in kmem_cache_destroy(): cache `xattr_kmem': Can't free all objects Created: 20/Jul/14  Updated: 27/Feb/20  Resolved: 27/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 14983

 Description   

Running racer with an extra patch to increase number of operations: http://review.whamcloud.com/#/c/5936/5

I am seemingly hitting a memory leak in xattr code:

<3>[362074.759838] slab error in kmem_cache_destroy(): cache `xattr_kmem': Can't free all objects
<4>[362074.760559] Pid: 3154, comm: rmmod Not tainted 2.6.32-rhe6.5-debug #2
<4>[362074.760962] Call Trace:
<4>[362074.761282]  [<ffffffff8116d069>] ? __slab_error+0x29/0x30
<4>[362074.761882]  [<ffffffff81171506>] ? kmem_cache_destroy+0xa6/0xf0
<4>[362074.762308]  [<ffffffffa0e4aabd>] ? lu_kmem_fini+0x2d/0x50 [obdclass]
<4>[362074.762878]  [<ffffffffa0c4c015>] ? ll_xattr_fini+0x15/0x20 [lustre]
<4>[362074.763507]  [<ffffffffa0c6a3c2>] ? exit_lustre_lite+0xe/0xd3 [lustre]
<4>[362074.764163]  [<ffffffff810b81b4>] ? sys_delete_module+0x194/0x260
<4>[362074.764813]  [<ffffffff8151989e>] ? do_page_fault+0x3e/0xa0
<4>[362074.765379]  [<ffffffff8100b0b2>] ? system_call_fastpath+0x16/0x1b
<6>[362080.901752] LNet: Removed LNI 192.168.10.220@tcp
<3>[362081.031031] LustreError: 3254:0:(class_obd.c:708:cleanup_obdclass()) obd_memory max: 149654698, leaked: 551

This is followed by problems to load lustre modules again and then dies on invalid pointer dereference suggesting there's something handling allocation failures incorrectly:


<6>[362101.036189] Lustre: Lustre: Build Version: 2.6.50-gc5e9f13-CHANGED-2.6.32-rhe6.5-debug
<3>[362101.039867] SLAB: cache with size 64 has lost its name
...
<6>[362104.420882] LNet: Added LNI 192.168.10.220@tcp [8/256/0/180]
<6>[362104.421503] LNet: Accept secure, port 988
<3>[362104.424302] SLAB: cache with size 64 has lost its name
... (repeated many-many times)
<3>[362111.779957] SLAB: cache with size 64 has lost its name
<3>[362115.274250] kmem_cache_create: duplicate cache xattr_kmem
<4>[362115.274649] Pid: 4220, comm: insmod Not tainted 2.6.32-rhe6.5-debug #2
<4>[362115.275063] Call Trace:
<4>[362115.275425] [<ffffffff81172465>] ? kmem_cache_create+0x655/0x6e0
<4>[362115.275845] [<ffffffffa11ec66e>] ? lu_env_init+0x1e/0x30 [obdclass]
<4>[362115.276281] [<ffffffffa0c6048f>] ? ccc_global_init+0x5f/0xb0 [lustre]
<4>[362115.276941] [<ffffffffa11f48fd>] ? cl_env_new+0x15d/0x350 [obdclass]
<4>[362115.277604] [<ffffffffa11e8b28>] ? lu_kmem_init+0x48/0x80 [obdclass]
<4>[362115.278317] [<ffffffffa0c4c035>] ? ll_xattr_init+0x15/0x20 [lustre]
<4>[362115.278989] [<ffffffffa0a171e7>] ? init_lustre_lite+0x1e7/0x280 [lustre]
<4>[362115.279449] [<ffffffffa0a17000>] ? init_lustre_lite+0x0/0x280 [lustre]
<4>[362115.279910] [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0
<4>[362115.280346] [<ffffffff810bb291>] ? sys_init_module+0xe1/0x250
<4>[362115.280890] [<ffffffff8100b0b2>] ? system_call_fastpath+0x16/0x1b
<1>[362115.428554] BUG: unable to handle kernel paging request at ffffffffa0c97860
<1>[362115.429294] IP: [<ffffffffa11e7e84>] keys_fill+0x54/0x190 [obdclass]
<4>[362115.430109] PGD 1a27067 PUD 1a2b063 PMD b7af4067 PTE 0
<4>[362115.430756] Oops: 0000 1 SMP DEBUG_PAGEALLOC
<4>[362115.431365] last sysfs file: /sys/devices/system/cpu/online
<4>[362115.432021] CPU 2
<4>[362115.432112] Modules linked in: ofd osp lod ost mdt mdd mgs nodemap osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 jbd2 mbcache virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
<4>[362115.432512]
<4>[362115.432512] Pid: 3956, comm: ptlrpcd_3 Not tainted 2.6.32-rhe6.5-debug #2 Red Hat KVM
<4>[362115.432512] RIP: 0010:[<ffffffffa11e7e84>] [<ffffffffa11e7e84>] keys_fill+0x54/0x190 [obdclass]
<4>[362115.432512] RSP: 0018:ffff880079cb3cf0 EFLAGS: 00010286
<4>[362115.432512] RAX: ffff880056b0bdf0 RBX: 00000000000000e0 RCX: 0000000000000000
<4>[362115.442964] RDX: ffff88000c5a0f70 RSI: ffff880026cd63b0 RDI: ffff880079cb3e00
<4>[362115.443718] RBP: ffff880079cb3d30 R08: 0000000000000000 R09: 0000000000000000
<4>[362115.444089] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880079cb3e00
<4>[362115.444089] R13: ffffffffa0c97860 R14: 0000000000000000 R15: 0000000000000000
<4>[362115.444089] FS: 0000000000000000(0000) GS:ffff880006280000(0000) knlGS:0000000000000000
<4>[362115.447312] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[362115.447312] CR2: ffffffffa0c97860 CR3: 0000000001a25000 CR4: 00000000000006e0
<4>[362115.449181] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[362115.449181] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[362115.449181] Process ptlrpcd_3 (pid: 3956, threadinfo ffff880079cb2000, task ffff8800b57c6500)
<4>[362115.449181] Stack:
<4>[362115.449181] ffff880079cb3d10 ffffffff810829b2 ffff8800bb1e8000 ffff880079cb3e00
<4>[362115.449181] <d> ffff880026cd63b0 ffff880079cb3e00 0000000000000000 0000000000000000
<4>[362115.449181] <d> ffff880079cb3d40 ffffffffa11e7fdd ffff880079cb3d60 ffffffffa11e8006
<4>[362115.449181] Call Trace:
<4>[362115.449181] [<ffffffff810829b2>] ? del_timer_sync+0x22/0x30
<4>[362115.449181] [<ffffffffa11e7fdd>] lu_context_refill+0x1d/0x30 [obdclass]
<4>[362115.449181] [<ffffffffa11e8006>] lu_env_refill+0x16/0x30 [obdclass]
<4>[362115.449181] [<ffffffffa13e859f>] ptlrpcd_check+0x4f/0x590 [ptlrpc]
<4>[362115.449181] [<ffffffffa13e906d>] ptlrpcd+0x2ad/0x3f0 [ptlrpc]
<4>[362115.449181] [<ffffffff8105de00>] ? default_wake_function+0x0/0x20
<4>[362115.449181] [<ffffffffa13e8dc0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
<4>[362115.449181] [<ffffffff81098c06>] kthread+0x96/0xa0
<4>[362115.449181] [<ffffffff8100c24a>] child_rip+0xa/0x20
<4>[362115.449181] [<ffffffff81098b70>] ? kthread+0x0/0xa0
<4>[362115.449181] [<ffffffff8100c240>] ? child_rip+0x0/0x20
<4>[362115.449181] Code: 08 48 81 fb 40 01 00 00 41 89 44 24 28 0f 84 c4 00 00 00 49 8b 44 24 10 4c 8b ab 20 7e 27 a1 48 83 3c 18 00 75 d1 4d 85 ed 74 cc <41> 8b 45 00 41 85 04 24 74 c2 a9 00 00 00 40 75 bb 4c 89 ee 4c
<1>[362115.449181] RIP [<ffffffffa11e7e84>] keys_fill+0x54/0x190 [obdclass]



 Comments   
Comment by Andreas Dilger [ 27/Feb/20 ]

Close old bug that hasn't been seen in a long time.

Generated at Sat Feb 10 01:50:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.