Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7363

sanity test_116a timeout due to list_del corruption

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.8.0
    • None
    • autotest
    • 3
    • 9223372036854775807

    Description

      sanity test 116a times out due to list_del corruption causing the OST to crash. Logs are at https://testing.hpdd.intel.com/test_sets/4438716a-7e97-11e5-991d-5254006e85c2

      From the OST console log, we see

      15:04:48:WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Tainted: P           -- ------------   )
      15:04:49:Hardware name: KVM
      15:04:49:list_del corruption. prev->next should be ffff880028897040, but was (null)
      15:04:49:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode serio_raw virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      15:04:49:Pid: 11, comm: events/0 Tainted: P           -- ------------    2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1
      15:04:49:Call Trace:
      15:04:49: [<ffffffff81077461>] ? warn_slowpath_common+0x91/0xe0
      15:04:49: [<ffffffff81077566>] ? warn_slowpath_fmt+0x46/0x60
      15:04:49: [<ffffffff812a42be>] ? list_del+0x6e/0xa0
      15:04:49: [<ffffffff811796b8>] ? free_block+0xc8/0x170
      15:04:49: [<ffffffff81179991>] ? drain_array+0xc1/0x100
      15:04:49: [<ffffffff8117a8b0>] ? cache_reap+0xc0/0x250
      15:04:49: [<ffffffff8117a7f0>] ? cache_reap+0x0/0x250
      15:04:49: [<ffffffff8109a780>] ? worker_thread+0x170/0x2a0
      15:04:49: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
      15:04:49: [<ffffffff8109a610>] ? worker_thread+0x0/0x2a0
      15:04:49: [<ffffffff810a0fce>] ? kthread+0x9e/0xc0
      15:04:49: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      15:04:49: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      15:04:49: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      15:04:50:---[ end trace d1b3c3625885cac7 ]---
      15:04:50:BUG: unable to handle kernel NULL pointer dereference at (null)
      15:04:50:IP: [<ffffffff812a4260>] list_del+0x10/0xa0
      15:04:50:PGD 7c938067 PUD 7c936067 PMD 0 
      15:04:50:Oops: 0000 [#1] SMP 
      15:04:50:last sysfs file: /sys/devices/system/cpu/online
      15:04:50:CPU 1 
      15:04:50:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode serio_raw virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      15:04:50:
      15:04:50:Pid: 12, comm: events/1 Tainted: P        W  -- ------------    2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 Red Hat KVM
      15:04:50:RIP: 0010:[<ffffffff812a4260>]  [<ffffffff812a4260>] list_del+0x10/0xa0
      15:04:50:RSP: 0018:ffff88007e537d10  EFLAGS: 00010082
      15:04:50:RAX: 0000000000000000 RBX: ffff88003dfb4000 RCX: 000000000000100c
      15:04:50:RDX: ffff88007f8213c0 RSI: ffff88007e4f9818 RDI: ffff88003dfb4000
      15:04:50:RBP: ffff88007e537d20 R08: 0000000000000000 R09: 0000000000000000
      15:04:50:R10: 0000000000000061 R11: ffff88002bd17000 R12: 0000000000000018
      15:04:50:R13: ffff88007e4f9818 R14: 0000000000000000 R15: ffffea0000000000
      15:04:50:FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
      15:04:50:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      15:04:50:CR2: 0000000000000000 CR3: 000000007c5d4000 CR4: 00000000000006e0
      15:04:50:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      15:04:50:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      15:04:51:Process events/1 (pid: 12, threadinfo ffff88007e534000, task ffff88007e527520)
      15:04:51:Stack:
      15:04:51: 0000000000000005 ffff88007f890240 ffff88007e537d80 ffffffff811796b8
      15:04:51:<d> ffff88007f8213c0 ffff88003dfb4000 000000000000100c ffff88003dfb4080
      15:04:51:<d> ffff88007ef92268 ffff88007e4f9800 ffff88007f890240 0000000000000018
      15:04:51:Call Trace:
      15:04:51: [<ffffffff811796b8>] free_block+0xc8/0x170
      15:04:51: [<ffffffff81179991>] drain_array+0xc1/0x100
      15:04:51: [<ffffffff8117a87e>] cache_reap+0x8e/0x250
      15:04:51: [<ffffffff8117a7f0>] ? cache_reap+0x0/0x250
      15:04:51: [<ffffffff8109a780>] worker_thread+0x170/0x2a0
      15:04:51: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
      15:04:51: [<ffffffff8109a610>] ? worker_thread+0x0/0x2a0
      15:04:51: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      15:04:51: [<ffffffff8100c28a>] child_rip+0xa/0x20
      15:04:51: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      15:04:51: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      15:04:52:Code: 89 3a 48 c7 c2 a0 40 2a 81 e8 3d 37 ff ff c9 c3 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 <4c> 8b 00 4c 39 c7 75 39 48 8b 03 4c 8b 40 08 4c 39 c3 75 4c 48 
      15:04:52:RIP  [<ffffffff812a4260>] list_del+0x10/0xa0
      15:04:52: RSP <ffff88007e537d10>
      15:04:52:CR2: 0000000000000000
      

      There are several list_del corruption tickets that are still open; LU-6326, LU-7246, LU-4644, LU-4526 to name a few. In this case, the prev->next pointer is NULL and the stack is different from the open JIRA tickets.

      Attachments

        Activity

          People

            wc-triage WC Triage
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: