[LU-1844] sanityn, subtest test_16: list_del corruption when run ofd + ldiskfs Created: 06/Sep/12  Updated: 26/Dec/13  Resolved: 26/Dec/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: USE_OFD

Issue Links:
Related
is related to LU-1823 sanity/103: slab corruption Resolved
is related to LU-1883 osd-ldiskfs fills file offsets into l... Resolved
is related to LU-1847 sanityn test 16 fail when run ofd+ldi... Closed
Severity: 3
Rank (Obsolete): 10232

 Description   

This issue was created by maloo for Minh Diep <mdiep@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/2e5162be-f3ce-11e1-9654-52540035b04c.

21:21:07:Lustre: DEBUG MARKER: == sanityn test 16: 2500 iterations of dual-mount fsx ================================================ 21:20:54 (1346473254)
21:21:08:-----------[ cut here ]-----------
21:21:08:WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Not tainted)
21:21:08:Hardware name: KVM
21:21:08:list_del corruption. next->prev should be ffff8800700f2040, but was (null)
21:21:08:Modules linked in: nfs fscache osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) ofd(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) jbd2 obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
21:21:08:Pid: 5941, comm: jbd2/dm-1-8 Not tainted 2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1
21:21:08:Call Trace:
21:21:09: [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
21:21:09: [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
21:21:09: [<ffffffff812833bd>] ? list_del+0x8d/0xa0
21:21:09: [<ffffffff81162695>] ? cache_alloc_refill+0x145/0x240
21:21:09: [<ffffffff8116364f>] ? kmem_cache_alloc+0x15f/0x190
21:21:09: [<ffffffff81116985>] ? mempool_alloc_slab+0x15/0x20
21:21:09: [<ffffffff81116a93>] ? mempool_alloc+0x63/0x140
21:21:09: [<ffffffffa084361e>] ? jbd2_journal_file_buffer+0x4e/0x90 [jbd2]
21:21:09: [<ffffffff811b259e>] ? bio_alloc_bioset+0x3e/0xf0
21:21:09: [<ffffffff811b26f5>] ? bio_alloc+0x15/0x30
21:21:09: [<ffffffff811acde1>] ? submit_bh+0x81/0x150
21:21:09: [<ffffffffa0844d18>] ? jbd2_journal_commit_transaction+0x578/0x1530 [jbd2]
21:21:09: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
21:21:09: [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0
21:21:09: [<ffffffffa084b128>] ? kjournald2+0xb8/0x220 [jbd2]
21:21:09: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
21:21:09: [<ffffffffa084b070>] ? kjournald2+0x0/0x220 [jbd2]
21:21:09: [<ffffffff81091d66>] ? kthread+0x96/0xa0
21:21:09: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
21:21:09: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
21:21:10: [<ffffffff8100c140>] ? child_rip+0x0/0x20
21:21:10:--[ end trace 306e4f94d0887a84 ]--
21:21:10:BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
21:21:10:IP: [<ffffffff8128334b>] list_del+0x1b/0xa0
21:21:10:PGD 7d32a067 PUD 748d1067 PMD 0
21:21:10:Oops: 0000 1 SMP
21:21:10:last sysfs file: /sys/module/lockd/initstate
21:21:10:CPU 0
21:21:10:Modules linked in: nfs fscache osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) ofd(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) jbd2 obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
21:21:10:
21:21:11:Pid: 5941, comm: jbd2/dm-1-8 Tainted: G W --------------- 2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1 Red Hat KVM
21:21:11:RIP: 0010:[<ffffffff8128334b>] [<ffffffff8128334b>] list_del+0x1b/0xa0
21:21:11:RSP: 0018:ffff8800783e5b30 EFLAGS: 00010046
21:21:11:RAX: 0000000000000000 RBX: ffff880058bcc040 RCX: 0000000000000014
21:21:11:RDX: 0000000000000000 RSI: ffff880058bcc070 RDI: ffff880058bcc040
21:21:11:RBP: ffff8800783e5b40 R08: ffff880058bcc040 R09: 0000000000000000
21:21:11:R10: 000000000000000f R11: 0000000000000003 R12: ffff88007dd47800
21:21:11:R13: ffff88007dd46c40 R14: 0000000000000016 R15: ffff880058bcc040
21:21:11:FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
21:21:11:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
21:21:11:CR2: 0000000000000008 CR3: 000000007d323000 CR4: 00000000000006f0
21:21:11:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
21:21:11:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
21:21:11:Process jbd2/dm-1-8 (pid: 5941, threadinfo ffff8800783e4000, task ffff880070a50080)
21:21:11:Stack:
21:21:11: ffff88007dd47800 ffff88007de016c0 ffff8800783e5bb0 ffffffff81162695
21:21:12:<d> ffff8800783e5bd0 0000000081258fa0 ffff88007dd46c80 0005120001008161
21:21:12:<d> ffff88007dd46c60 ffff88007dd46c50 ffff88005e9fba40 0000000000000000
21:21:12:Call Trace:
21:21:12: [<ffffffff81162695>] cache_alloc_refill+0x145/0x240
21:21:12: [<ffffffff8116364f>] kmem_cache_alloc+0x15f/0x190
21:21:12: [<ffffffff81116985>] mempool_alloc_slab+0x15/0x20
21:21:12: [<ffffffff81116a93>] mempool_alloc+0x63/0x140
21:21:12: [<ffffffffa084361e>] ? jbd2_journal_file_buffer+0x4e/0x90 [jbd2]
21:21:12: [<ffffffff811b259e>] bio_alloc_bioset+0x3e/0xf0
21:21:12: [<ffffffff811b26f5>] bio_alloc+0x15/0x30
21:21:12: [<ffffffff811acde1>] submit_bh+0x81/0x150
21:21:12: [<ffffffffa0844d18>] jbd2_journal_commit_transaction+0x578/0x1530 [jbd2]
21:21:12: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
21:21:12: [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0
21:21:12: [<ffffffffa084b128>] kjournald2+0xb8/0x220 [jbd2]
21:21:12: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
21:21:12: [<ffffffffa084b070>] ? kjournald2+0x0/0x220 [jbd2]
21:21:12: [<ffffffff81091d66>] kthread+0x96/0xa0
21:21:12: [<ffffffff8100c14a>] child_rip+0xa/0x20
21:21:14: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
21:21:14: [<ffffffff8100c140>] ? child_rip+0x0/0x20
21:21:14:Code: 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 4c 8b 00 4c 39 c7 75 39 48 8b 03 <4c> 8b 40 08 4c 39 c3 75 4c 48 8b 53 08 48 89 50 08 48 89 02 48
21:21:14:RIP [<ffffffff8128334b>] list_del+0x1b/0xa0
21:21:14: RSP <ffff8800783e5b30>
21:21:14:CR2: 0000000000000008
21:21:14:Initializing cgroup subsys cpuset
21:21:14:Initializing cgroup subsys cpu
21:21:14:Linux version 2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Tue Aug 21 01:32:12 PDT 2012
21:21:14:Command line: ro root=UUID=c2b3ff8f-353b-4a4d-9205-2b60b4c5168e rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off memmap=exactmap memmap=613K@4K memmap=131463K@49765K elfcorehdr=181228K memmap=4K$0K memmap=64K$960K memmap=12K$2097140K memmap=272K$4194032K
21:21:15:KERNEL supported cpus:



 Comments   
Comment by Minh Diep [ 06/Sep/12 ]

this might affect 2.3 as well

Comment by Minh Diep [ 06/Sep/12 ]

hit similar on b2.3

15:29:56:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-25vm5,client-25vm6.lab.whamcloud.com DURATION=900 == 15:29:51 (1346884191)
15:29:56:-----------[ cut here ]-----------
15:29:56:WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Not tainted)
15:29:56:Hardware name: KVM
15:29:56:list_add corruption. prev->next should be next (ffff88003740c020), but was (null). (prev=ffff8800739759c0).
15:29:56:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) ofd(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache jbd2 sha512_generic sha256_generic nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
15:29:56:Pid: 16, comm: kblockd/0 Not tainted 2.6.32-279.5.1.el6_lustre.g293c36b.x86_64 #1
15:29:56:Call Trace:
15:29:56: [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
15:29:56: [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
15:29:56: [<ffffffff8128345f>] ? __list_add+0x8f/0xa0
15:29:56: [<ffffffffa006a9ee>] ? do_virtblk_request+0x20e/0x418 [virtio_blk]
15:29:56: [<ffffffff81250520>] ? blk_unplug_work+0x0/0x70
15:29:56: [<ffffffff81255952>] ? __generic_unplug_device+0x32/0x40
15:29:56: [<ffffffff8125598e>] ? generic_unplug_device+0x2e/0x50
15:29:56: [<ffffffff81250556>] ? blk_unplug_work+0x36/0x70
15:29:56: [<ffffffff81250520>] ? blk_unplug_work+0x0/0x70
15:29:56: [<ffffffff8108c760>] ? worker_thread+0x170/0x2a0
15:29:56: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
15:29:56: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
15:29:56: [<ffffffff81091d66>] ? kthread+0x96/0xa0
15:29:56: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
15:29:56: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
15:29:56: [<ffffffff8100c140>] ? child_rip+0x0/0x20
15:29:56:--[ end trace f5bdc39561532b09 ]--

Comment by Andreas Dilger [ 06/Sep/12 ]

List corruption is generally one of two things:

  • list_add() of the same item to two different lists
  • random memory corruption somewhere else (often seen in long lists because they touch a lot of memory)

I suspect that this may be related to LU-1823, which is another apparent case of random memory corruption.

It might make sense to add on an extra check based on Yu Jian's change http://review.whamcloud.com/3876 to check for these list_add()/list_del corruption messages as well.

Comment by Andreas Dilger [ 06/Sep/12 ]

Minh, please always include the Maloo URL for any logs, since this helps analysis later on.

Could you please comment if this problem is isolated to a single node type (MDS, OSS, client), or whether it happens on different nodes? It appears that both of these are on the server, but I can't tell whether it is an MDS or OSS.

Comment by Li Wei (Inactive) [ 11/Sep/12 ]

This could be related to LU-1883?

Comment by Andreas Dilger [ 12/Sep/12 ]

It may also be LU-1823. Let's hope we don't have more than two different memory corruptions at the same time.

Comment by Andreas Dilger [ 26/Dec/13 ]

Ofd is running a long time

Generated at Sat Feb 10 01:20:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.