[LU-7163] replay-single test_70c: OSS memory corruption during recovery Created: 15/Sep/15  Updated: 28/Feb/20  Resolved: 28/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: zfs

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/974a9cae-5b77-11e5-bdf5-5254006e85c2.

The sub-test test_70c timed out with the following error in the OSS console log:

23:32:48:LustreError: 168-f: BAD WRITE CHECKSUM: lustre-OST0001 from 12345-10.1.4.189@tcp inode [0x20000560a:0x3254:0x0] object 0x0:8236 extent [2097152-3143167]: client csum a73c8811, server csum 9be5c892
23:32:48:general protection fault: 0000 [#1] SMP 
23:32:48:last sysfs file: /sys/devices/system/cpu/online
23:32:48:CPU 0 
23:32:48:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode serio_raw virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
23:32:48:
23:32:48:Pid: 4821, comm: socknal_sd00_01 Tainted: P           -- ------------    2.6.32-573.3.1.el6_lustre.g43c6468.x86_64 #1 Red Hat KVM
23:32:48:RIP: 0010:[<ffffffff8113e229>]  [<ffffffff8113e229>] put_page+0x9/0x40
23:32:48:RSP: 0018:ffff88003710f900  EFLAGS: 00010206
23:32:48:RAX: 0000000000000030 RBX: 0000000000000001 RCX: ffff880068090000
23:32:48:RDX: ffff880068090640 RSI: ffff88006809060c RDI: 00f8100c00000003
23:32:48:RBP: ffff88003710f900 R08: 00f80ed400000003 R09: 00f80e1c00000190
23:32:48:R10: ffff880077cfe840 R11: ffff880077cfe8f0 R12: ffff88006d4950c0
23:32:48:R13: ffff88006d4950f8 R14: ffff880077cfec9c R15: 0000000000000000
23:32:48:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
23:32:48:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
23:32:48:CR2: 00007fce6bd77000 CR3: 000000007b976000 CR4: 00000000000006f0
23:32:48:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
23:32:48:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
23:32:48:Process socknal_sd00_01 (pid: 4821, threadinfo ffff88003710c000, task ffff880037a31520)
23:32:48:Stack:
23:32:48: ffff88003710f920 ffffffff8145e84f ffff88006d4950c0 0000000000000000
23:32:48:<d> ffff88003710f940 ffffffff8145e3de ffff880077cfec9c ffff88006d4950c0
23:32:48:<d> ffff88003710fa70 ffffffff814b7326 ffff88003710f970 ffff8800378c1080
23:32:48:Call Trace:
23:32:48: [<ffffffff8145e84f>] skb_release_data+0x7f/0x110
23:32:48: [<ffffffff8145e3de>] __kfree_skb+0x1e/0xa0
23:32:48: [<ffffffff814b7326>] tcp_recvmsg+0xfe6/0x10f0
23:32:48: [<ffffffff814d812a>] inet_recvmsg+0x5a/0x90
23:32:48: [<ffffffff814584d3>] sock_recvmsg+0x133/0x160
23:32:48: [<ffffffff81458544>] kernel_recvmsg+0x44/0x60
23:32:48: [<ffffffffa0d60965>] ksocknal_lib_recv_kiov+0x165/0x3d0 [ksocklnd]
23:32:48: [<ffffffffa0d5a07f>] ksocknal_process_receive+0x2af/0xed0 [ksocklnd]
23:32:48: [<ffffffffa0d5c62b>] ksocknal_scheduler+0x12b/0x1390 [ksocklnd]
23:32:48: [<ffffffff810a101e>] kthread+0x9e/0xc0

There are also other types of memory corruption being seen in other failures:
https://testing.hpdd.intel.com/test_sets/a2f995dc-59ab-11e5-aac5-5254006e85c2

17:39:05:WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Tainted: P           -- ------------   )
17:39:05:Hardware name: KVM
17:39:05:list_del corruption. prev->next should be ffff88006b844000, but was 00040010042802a8
17:39:05:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode serio_raw virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
17:39:05:Pid: 11, comm: events/0 Tainted: P           -- ------------    2.6.32-573.3.1.el6_lustre.gde57418.x86_64 #1
17:39:05:Call Trace:
17:39:05: [<ffffffff81077491>] ? warn_slowpath_common+0x91/0xe0
17:39:05: [<ffffffff81077596>] ? warn_slowpath_fmt+0x46/0x60
17:39:05: [<ffffffff812a40ae>] ? list_del+0x6e/0xa0
17:39:05: [<ffffffff811796f8>] ? free_block+0xc8/0x170
17:39:05: [<ffffffff811799d1>] ? drain_array+0xc1/0x100
17:39:05: [<ffffffff8117a8be>] ? cache_reap+0x8e/0x250
17:39:05: [<ffffffff8117a830>] ? cache_reap+0x0/0x250
17:39:05: [<ffffffff8109a7d0>] ? worker_thread+0x170/0x2a0
17:39:05: [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40
17:39:05: [<ffffffff8109a660>] ? worker_thread+0x0/0x2a0
17:39:05: [<ffffffff810a101e>] ? kthread+0x9e/0xc0

Info required for matching: replay-single 70c



 Comments   
Comment by Andreas Dilger [ 15/Sep/15 ]

Another failure:
https://testing.hpdd.intel.com/test_sets/e20d0f78-5b28-11e5-af09-5254006e85c2

16:03:08:WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0() (Tainted: P           -- ------------   )
16:03:08:Hardware name: KVM
16:03:08:list_add corruption. next->prev should be prev (ffff88007d4f8088), but was 0000000400010002. (next=ffffc90008d4d010).
16:03:08:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) microcode serio_raw virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
16:03:08:Pid: 1290, comm: ll_ost_io00_024 Tainted: P           -- ------------    2.6.32-573.3.1.el6_lustre.gde57418.x86_64 #1
16:03:08:Call Trace:
16:03:08: [<ffffffff81077491>] ? warn_slowpath_common+0x91/0xe0
16:03:08: [<ffffffff81077596>] ? warn_slowpath_fmt+0x46/0x60
16:03:08: [<ffffffff812a414d>] ? __list_add+0x6d/0xa0
16:03:08: [<ffffffffa01a00c8>] ? spl_kmem_cache_alloc+0x3a8/0x980 [spl]
16:03:08: [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40
16:03:08: [<ffffffffa02a6a73>] ? zio_data_buf_alloc+0x23/0x30 [zfs]
16:03:08: [<ffffffffa01fedaf>] ? arc_get_data_buf+0x9f/0x4c0 [zfs]
16:03:08: [<ffffffffa01ffb40>] ? arc_buf_alloc+0x120/0x160 [zfs]
16:03:08: [<ffffffffa01ffb9f>] ? arc_loan_buf+0x1f/0x30 [zfs]
16:03:08: [<ffffffffa020d509>] ? dmu_request_arcbuf+0x19/0x20 [zfs]
16:03:08: [<ffffffffa0f95198>] ? osd_bufs_get+0x738/0xb50 [osd_zfs]
16:03:08: [<ffffffffa10e7d89>] ? ofd_preprw+0x519/0x1550 [ofd]
16:03:08: [<ffffffffa0ac18d4>] ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc]
16:03:08: [<ffffffffa0af871f>] ? obd_preprw+0x10f/0x380 [ptlrpc]
16:03:08: [<ffffffffa0b01d04>] ? tgt_brw_write+0xaf4/0x1540 [ptlrpc]
16:03:08: [<ffffffff8153a07e>] ? mutex_lock+0x1e/0x50
16:03:08: [<ffffffffa0b009cc>] ? tgt_request_handle+0xa4c/0x1290 [ptlrpc]
16:03:08: [<ffffffffa0aa85a1>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
16:03:08: [<ffffffffa0aa7760>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
16:03:08: [<ffffffff810a101e>] ? kthread+0x9e/0xc0
16:03:08: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
16:03:08: [<ffffffff810a0f80>] ? kthread+0x0/0xc0
Comment by James Nunez (Inactive) [ 19/Oct/15 ]

Another failure at
2015-10-16 05:16:54 - https://testing.hpdd.intel.com/test_sets/f9101686-7403-11e5-ada9-5254006e85c2
2015-10-17 23:40:40 - https://testing.hpdd.intel.com/test_sets/c0225b5c-755e-11e5-b12f-5254006e85c2
2015-10-28 13:02:33 - https://testing.hpdd.intel.com/test_sets/d9ad2d1a-7dae-11e5-bca9-5254006e85c2
2015-11-04 19:49:18 - https://testing.hpdd.intel.com/test_sets/8215f85c-8367-11e5-b9d3-5254006e85c2
2015-11-07 18:42:15 - https://testing.hpdd.intel.com/test_sets/8db0e7c4-85e4-11e5-9c46-5254006e85c2

Comment by Nathaniel Clark [ 21/Oct/15 ]

This appears to be only showing up on review-zfs-part-2 (it predates 0.6.5.2 landing).

Comment by nasf (Inactive) [ 04/Nov/15 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/fd9e656a-828e-11e5-a6c5-5254006e85c2

Comment by Andreas Dilger [ 28/Feb/20 ]

Close old bug that hasn't been seen in a long time.

Generated at Sat Feb 10 02:06:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.