Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7163

replay-single test_70c: OSS memory corruption during recovery

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/974a9cae-5b77-11e5-bdf5-5254006e85c2.

      The sub-test test_70c timed out with the following error in the OSS console log:

      23:32:48:LustreError: 168-f: BAD WRITE CHECKSUM: lustre-OST0001 from 12345-10.1.4.189@tcp inode [0x20000560a:0x3254:0x0] object 0x0:8236 extent [2097152-3143167]: client csum a73c8811, server csum 9be5c892
      23:32:48:general protection fault: 0000 [#1] SMP 
      23:32:48:last sysfs file: /sys/devices/system/cpu/online
      23:32:48:CPU 0 
      23:32:48:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode serio_raw virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      23:32:48:
      23:32:48:Pid: 4821, comm: socknal_sd00_01 Tainted: P           -- ------------    2.6.32-573.3.1.el6_lustre.g43c6468.x86_64 #1 Red Hat KVM
      23:32:48:RIP: 0010:[<ffffffff8113e229>]  [<ffffffff8113e229>] put_page+0x9/0x40
      23:32:48:RSP: 0018:ffff88003710f900  EFLAGS: 00010206
      23:32:48:RAX: 0000000000000030 RBX: 0000000000000001 RCX: ffff880068090000
      23:32:48:RDX: ffff880068090640 RSI: ffff88006809060c RDI: 00f8100c00000003
      23:32:48:RBP: ffff88003710f900 R08: 00f80ed400000003 R09: 00f80e1c00000190
      23:32:48:R10: ffff880077cfe840 R11: ffff880077cfe8f0 R12: ffff88006d4950c0
      23:32:48:R13: ffff88006d4950f8 R14: ffff880077cfec9c R15: 0000000000000000
      23:32:48:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
      23:32:48:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      23:32:48:CR2: 00007fce6bd77000 CR3: 000000007b976000 CR4: 00000000000006f0
      23:32:48:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      23:32:48:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      23:32:48:Process socknal_sd00_01 (pid: 4821, threadinfo ffff88003710c000, task ffff880037a31520)
      23:32:48:Stack:
      23:32:48: ffff88003710f920 ffffffff8145e84f ffff88006d4950c0 0000000000000000
      23:32:48:<d> ffff88003710f940 ffffffff8145e3de ffff880077cfec9c ffff88006d4950c0
      23:32:48:<d> ffff88003710fa70 ffffffff814b7326 ffff88003710f970 ffff8800378c1080
      23:32:48:Call Trace:
      23:32:48: [<ffffffff8145e84f>] skb_release_data+0x7f/0x110
      23:32:48: [<ffffffff8145e3de>] __kfree_skb+0x1e/0xa0
      23:32:48: [<ffffffff814b7326>] tcp_recvmsg+0xfe6/0x10f0
      23:32:48: [<ffffffff814d812a>] inet_recvmsg+0x5a/0x90
      23:32:48: [<ffffffff814584d3>] sock_recvmsg+0x133/0x160
      23:32:48: [<ffffffff81458544>] kernel_recvmsg+0x44/0x60
      23:32:48: [<ffffffffa0d60965>] ksocknal_lib_recv_kiov+0x165/0x3d0 [ksocklnd]
      23:32:48: [<ffffffffa0d5a07f>] ksocknal_process_receive+0x2af/0xed0 [ksocklnd]
      23:32:48: [<ffffffffa0d5c62b>] ksocknal_scheduler+0x12b/0x1390 [ksocklnd]
      23:32:48: [<ffffffff810a101e>] kthread+0x9e/0xc0
      

      There are also other types of memory corruption being seen in other failures:
      https://testing.hpdd.intel.com/test_sets/a2f995dc-59ab-11e5-aac5-5254006e85c2

      17:39:05:WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Tainted: P           -- ------------   )
      17:39:05:Hardware name: KVM
      17:39:05:list_del corruption. prev->next should be ffff88006b844000, but was 00040010042802a8
      17:39:05:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode serio_raw virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      17:39:05:Pid: 11, comm: events/0 Tainted: P           -- ------------    2.6.32-573.3.1.el6_lustre.gde57418.x86_64 #1
      17:39:05:Call Trace:
      17:39:05: [<ffffffff81077491>] ? warn_slowpath_common+0x91/0xe0
      17:39:05: [<ffffffff81077596>] ? warn_slowpath_fmt+0x46/0x60
      17:39:05: [<ffffffff812a40ae>] ? list_del+0x6e/0xa0
      17:39:05: [<ffffffff811796f8>] ? free_block+0xc8/0x170
      17:39:05: [<ffffffff811799d1>] ? drain_array+0xc1/0x100
      17:39:05: [<ffffffff8117a8be>] ? cache_reap+0x8e/0x250
      17:39:05: [<ffffffff8117a830>] ? cache_reap+0x0/0x250
      17:39:05: [<ffffffff8109a7d0>] ? worker_thread+0x170/0x2a0
      17:39:05: [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40
      17:39:05: [<ffffffff8109a660>] ? worker_thread+0x0/0x2a0
      17:39:05: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
      

      Info required for matching: replay-single 70c

      Attachments

        Activity

          People

            wc-triage WC Triage
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: