[LU-4007] BUG at mm/slab.c lcw_dispatch_main Created: 25/Sep/13  Updated: 31/Dec/13  Resolved: 01/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6, Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: mn1, patch

Severity: 3
Rank (Obsolete): 10726

 Description   

During testing we hit the next bug

------------[ cut here ]------------
kernel BUG at mm/slab.c:522!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/possible
CPU 0 
Modules linked in: lustre(U) obdfilter(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) ldiskfs(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) exportfs jbd sha512_generic sha256_generic sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]

Modules linked in: lustre(U) obdfilter(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) ldiskfs(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) exportfs jbd sha512_generic sha256_generic sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
Pid: 2997, comm: lc_watchdogd Not tainted 2.6.32-131.17.1.el6_lustreb_neo_stable_166_1 #1 KVM
RIP: 0010:[<ffffffff8115c2fa>]  [<ffffffff8115c2fa>] kfree+0x29a/0x320
RSP: 0018:ffff880054937e10  EFLAGS: 00010046
RAX: ffffea000071d180 RBX: ffffffffa0850af8 RCX: 0000000000000001
RDX: 0020000000000000 RSI: ffff880054937f00 RDI: ffffffffa0850af8
RBP: ffff880054937e70 R08: ffffffffa0850b50 R09: 0000000000000001
R10: 00000000ffffffff R11: 0000000000000000 R12: ffffffffa082b94e
R13: 0000000000000286 R14: 20c49ba5e353f7cf R15: ffff880054937ed0
FS:  00007fdc14c95700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff9ef6d21f CR3: 000000005cbed000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process lc_watchdogd (pid: 2997, threadinfo ffff880054936000, task ffff88003782b500)
Stack:
 ffffffffa084c580 ffffffffa086d200 0000000000000000 ffff88003782b500
<0> 20c49ba5e353f7cf 0000000000000000 ffff880054937e70 ffffffffa0850af8
<0> ffffffffa0850b50 ffff88003782b500 20c49ba5e353f7cf ffff880054937ed0
Call Trace:
 [<ffffffffa082b94e>] cfs_free+0xe/0x10 [libcfs]
 [<ffffffffa0839826>] lcw_dispatch_main+0x5b6/0x960 [libcfs]
 [<ffffffff8108e1e0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0839270>] ? lcw_dispatch_main+0x0/0x960 [libcfs]
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffffa0839270>] ? lcw_dispatch_main+0x0/0x960 [libcfs]
 [<ffffffffa0839270>] ? lcw_dispatch_main+0x0/0x960 [libcfs]
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Code: 4d c8 89 c2 83 c0 01 49 89 4c d4 18 41 89 04 24 c7 03 00 00 00 00 e9 91 fe ff ff 0f 0b eb fe 48 8b 40 10 48 8b 10 e9 46 fe ff ff <0f> 0b 0f 1f 40 00 eb fa 48 8b 40 10 48 8b 10 66 85 d2 0f 89 d3 
RIP  [<ffffffff8115c2fa>] kfree+0x29a/0x320
 RSP <ffff880054937e10>
---[ end trace 7bc8f64a86a5459d ]---
Kernel panic - not syncing: Fatal exception
Pid: 2997, comm: lc_watchdogd Tainted: G      D    ----------------   2.6.32-131.17.1.el6_lustreb_neo_stable_166_1 #1
Call Trace:
 [<ffffffff814db962>] ? panic+0x78/0x143
 [<ffffffff814df9a4>] ? oops_end+0xe4/0x100
 [<ffffffff8100f2fb>] ? die+0x5b/0x90
 [<ffffffff814df274>] ? do_trap+0xc4/0x160
 [<ffffffff8100ceb5>] ? do_invalid_op+0x95/0xb0
 [<ffffffff8115c2fa>] ? kfree+0x29a/0x320
 [<ffffffff814dca3c>] ? wait_for_common+0x14c/0x180
 [<ffffffffa082b94e>] ? cfs_free+0xe/0x10 [libcfs]
 [<ffffffff8100bf5b>] ? invalid_op+0x1b/0x20
 [<ffffffffa082b94e>] ? cfs_free+0xe/0x10 [libcfs]
 [<ffffffff8115c2fa>] ? kfree+0x29a/0x320
 [<ffffffffa082b94e>] ? cfs_free+0xe/0x10 [libcfs]
 [<ffffffffa0839826>] ? lcw_dispatch_main+0x5b6/0x960 [libcfs]
 [<ffffffff8108e1e0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0839270>] ? lcw_dispatch_main+0x0/0x960 [libcfs]
 [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
 [<ffffffffa0839270>] ? lcw_dispatch_main+0x0/0x960 [libcfs]
 [<ffffffffa0839270>] ? lcw_dispatch_main+0x0/0x960 [libcfs]
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20


 Comments   
Comment by Alexander Boyko [ 25/Sep/13 ]

lcw_dispatch_main() creates zombies list and moves lcw into it.
But removes entry from another list in the zombies loop.

http://review.whamcloud.com/7755
Xyratex-bug-id: MRP-1179

Comment by Oleg Drokin [ 25/Sep/13 ]

Can you please include version information for lustre and what the test was that triggered this? (for this and future bugs)

Comment by Alexander Boyko [ 25/Sep/13 ]

conf-sanity test_35b
Our version base on Lustre 2.1 and includes other patches. During analyze I found that this problem exist for a long time, but does not hit early. Probably no zombies for lcw at 99.99% of time.

Comment by Dmitry Eremin (Inactive) [ 24/Oct/13 ]

This was introduced by commit 5508471cb0cc6a7fde28472973e5c881ae25e820
Author: Liang Zhen <Zhen.Liang@sun.com>
Date: Sun Aug 15 20:46:52 2010 +0400

b23289 smp improvement for watchdog i=eric.mei i=maxim

Comment by Andreas Dilger [ 28/Oct/13 ]

That commit was landed for 2.0.50, so it is in all 2.1 - 2.5 releases. I've updated the Affects Version appropriately. Patch should be cherry-picked to b2_4 and b2_1 also.

Comment by Peter Jones [ 01/Nov/13 ]

Landed for 2.6

Generated at Sat Feb 10 01:38:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.