Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • CONFIG_DEBUG_SLAB=y
    • 3
    • 4237

    Description

      Lustre: DEBUG MARKER: == sanity test 103: acl test ========================================================================= 19:57:07 (1346774227)
      /work/lustre/head/clean/lustre/utils/l_getidentity
      Slab corruption (Tainted: P --------------- ): size-2048 start=dac6c470, len=2048
      Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
      Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
      310: 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 05 00
      320: 01 00 00 00 02 00 07 00 02 00 00 00 04 00 07 00
      330: ff ff ff ff 10 00 07 00 ff ff ff ff 20 00 05 00
      340: ff ff ff ff 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
      Next obj: start=dac6cc88, len=2048
      Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
      Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
      000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
      010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

      02000000:00000010:1.0:1346774231.327841:1804:3373:0:(sec_null.c:217:null_alloc_repbuf()) kmalloced 'req->rq_repbuf': 2048 at dac6c470.
      ...

      02000000:00000010:1.0:1346774231.328361:836:3373:0:(sec_null.c:231:null_free_repbuf()) kfreed 'req->rq_repbuf': 2048 at dac6c470.

      Attachments

        Issue Links

          Activity

            [LU-1823] sanity/103: slab corruption
            sarah Sarah Liu added a comment -

            review-dne failed as:

            12:47:05:[ 3488.673687] Lustre: 3067:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1386362637/real 1386362637]  req@ffff880036623bb8 x1453703367244336/t0(0) o38->lustre-MDT0002-osp-MDT0000@10.10.16.232@tcp:24/4 lens 400/544 e 0 to 1 dl 1386362662 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            12:47:06:[ 3488.676610] Lustre: 3067:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 305 previous similar messages
            12:47:06:[ 3673.675126] BUG: unable to handle kernel paging request at ffff880022a6ae30
            12:47:06:[ 3673.675968] IP: [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd]
            12:47:06:[ 3673.676583] PGD 1a26063 PUD 1a2a063 PMD 215067 PTE 22a6a060
            12:47:06:[ 3673.677103] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
            12:47:06:[ 3673.677103] last sysfs file: /sys/devices/system/cpu/possible
            12:47:07:[ 3673.677103] CPU 1 
            12:47:07:[ 3673.677103] Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
            12:47:07:[ 3673.677103] 
            12:47:07:[ 3673.677103] Pid: 3069, comm: ptlrpcd_1 Not tainted 2.6.32-358.23.2.el6_lustre.g3ddc521.x86_64 #1 Red Hat KVM
            12:47:07:[ 3673.677103] RIP: 0010:[<ffffffffa09ef395>]  [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd]
            12:47:07:[ 3673.677103] RSP: 0018:ffff88005f7bf980  EFLAGS: 00010286
            12:47:07:[ 3673.677103] RAX: 0000000000000000 RBX: ffff88000eb08ef0 RCX: 0000000000000009
            12:47:07:[ 3673.677103] RDX: ffff880002300000 RSI: ffff880002311b20 RDI: ffffffffa09feb38
            12:47:07:[ 3673.677103] RBP: ffff88005f7bf9f0 R08: 0000000000000000 R09: 0000000000000000
            12:47:07:[ 3673.677103] R10: ffff8800497d37e0 R11: 0000000000000000 R12: ffff880022a6adf0
            12:47:07:[ 3673.677103] R13: 0000000000000001 R14: ffff880064cd64e0 R15: 00000000000000e0
            12:47:07:[ 3673.677103] FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
            12:47:07:[ 3673.677103] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
            12:47:07:[ 3673.677103] CR2: ffff880022a6ae30 CR3: 0000000001a25000 CR4: 00000000000006e0
            12:47:07:[ 3673.677103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            12:47:07:[ 3673.677103] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            12:47:07:[ 3673.677103] Process ptlrpcd_1 (pid: 3069, threadinfo ffff88005f7be000, task ffff88005c46c340)
            12:47:07:[ 3673.677103] Stack:
            12:47:07:[ 3673.677103]  ffff880015f79c01 0000000122a6adf0 0000000000000000 ffff880052054df0
            12:47:07:[ 3673.677103] <d> 0000000000000000 000000e800000000 000200000a0a10ed 0000000000003039
            12:47:07:[ 3673.677103] <d> ffff88005f7bfa10 ffff880052054df0 ffff880022a6adf0 ffff880022a6adf0
            12:47:07:[ 3673.677103] Call Trace:
            12:47:08:[ 3673.677103]  [<ffffffffa0532dfb>] lnet_ni_send+0x4b/0xf0 [lnet]
            12:47:08:[ 3673.677103]  [<ffffffffa0537005>] lnet_send+0x655/0xb80 [lnet]
            12:47:08:[ 3673.677103]  [<ffffffffa053806a>] LNetPut+0x31a/0x860 [lnet]
            12:47:08:[ 3673.677103]  [<ffffffffa0808dc0>] ptl_send_buf+0x1e0/0x550 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffff8103b82c>] ? kvm_clock_read+0x1c/0x20
            12:47:08:[ 3673.677103]  [<ffffffffa080c2bd>] ptl_send_rpc+0x4dd/0xcc0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffffa0800994>] ptlrpc_send_new_req+0x454/0x7c0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffffa0804728>] ptlrpc_check_set+0x898/0x1da0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffffa083061b>] ptlrpcd_check+0x55b/0x590 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffffa0830bdd>] ptlrpcd+0x2ad/0x3f0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffff8105bca0>] ? default_wake_function+0x0/0x20
            12:47:08:[ 3673.677103]  [<ffffffffa0830930>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffff81095696>] kthread+0x96/0xa0
            12:47:08:[ 3673.677103]  [<ffffffff8100c10a>] child_rip+0xa/0x20
            12:47:08:[ 3673.677103]  [<ffffffff81095600>] ? kthread+0x0/0xa0
            12:47:09:[ 3673.677103]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
            12:47:09:[ 3673.677103] Code: 00 c7 43 60 c1 00 00 00 48 89 de 48 c7 43 70 00 00 00 00 48 c7 43 68 00 00 00 00 48 8b 55 c0 8b 4d c8 48 8b 7d a8 e8 7b fa ff ff <41> f6 44 24 40 08 74 1b 8b 4d b4 85 c9 0f 84 c8 00 00 00 65 48 
            12:47:09:[ 3673.677103] RIP  [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd]
            12:47:09:[ 3673.677103]  RSP <ffff88005f7bf980>
            12:47:09:[ 3673.677103] CR2: ffff880022a6ae30
            12:47:09:[    0.000000] Initializing cgroup subsys cpuset
            12:47:09:[    0.000000] Initializing cgroup subsys cpu
            
            sarah Sarah Liu added a comment - review-dne failed as: 12:47:05:[ 3488.673687] Lustre: 3067:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1386362637/real 1386362637] req@ffff880036623bb8 x1453703367244336/t0(0) o38->lustre-MDT0002-osp-MDT0000@10.10.16.232@tcp:24/4 lens 400/544 e 0 to 1 dl 1386362662 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 12:47:06:[ 3488.676610] Lustre: 3067:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 305 previous similar messages 12:47:06:[ 3673.675126] BUG: unable to handle kernel paging request at ffff880022a6ae30 12:47:06:[ 3673.675968] IP: [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd] 12:47:06:[ 3673.676583] PGD 1a26063 PUD 1a2a063 PMD 215067 PTE 22a6a060 12:47:06:[ 3673.677103] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC 12:47:06:[ 3673.677103] last sysfs file: /sys/devices/system/cpu/possible 12:47:07:[ 3673.677103] CPU 1 12:47:07:[ 3673.677103] Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 12:47:07:[ 3673.677103] 12:47:07:[ 3673.677103] Pid: 3069, comm: ptlrpcd_1 Not tainted 2.6.32-358.23.2.el6_lustre.g3ddc521.x86_64 #1 Red Hat KVM 12:47:07:[ 3673.677103] RIP: 0010:[<ffffffffa09ef395>] [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd] 12:47:07:[ 3673.677103] RSP: 0018:ffff88005f7bf980 EFLAGS: 00010286 12:47:07:[ 3673.677103] RAX: 0000000000000000 RBX: ffff88000eb08ef0 RCX: 0000000000000009 12:47:07:[ 3673.677103] RDX: ffff880002300000 RSI: ffff880002311b20 RDI: ffffffffa09feb38 12:47:07:[ 3673.677103] RBP: ffff88005f7bf9f0 R08: 0000000000000000 R09: 0000000000000000 12:47:07:[ 3673.677103] R10: ffff8800497d37e0 R11: 0000000000000000 R12: ffff880022a6adf0 12:47:07:[ 3673.677103] R13: 0000000000000001 R14: ffff880064cd64e0 R15: 00000000000000e0 12:47:07:[ 3673.677103] FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 12:47:07:[ 3673.677103] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 12:47:07:[ 3673.677103] CR2: ffff880022a6ae30 CR3: 0000000001a25000 CR4: 00000000000006e0 12:47:07:[ 3673.677103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 12:47:07:[ 3673.677103] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 12:47:07:[ 3673.677103] Process ptlrpcd_1 (pid: 3069, threadinfo ffff88005f7be000, task ffff88005c46c340) 12:47:07:[ 3673.677103] Stack: 12:47:07:[ 3673.677103] ffff880015f79c01 0000000122a6adf0 0000000000000000 ffff880052054df0 12:47:07:[ 3673.677103] <d> 0000000000000000 000000e800000000 000200000a0a10ed 0000000000003039 12:47:07:[ 3673.677103] <d> ffff88005f7bfa10 ffff880052054df0 ffff880022a6adf0 ffff880022a6adf0 12:47:07:[ 3673.677103] Call Trace: 12:47:08:[ 3673.677103] [<ffffffffa0532dfb>] lnet_ni_send+0x4b/0xf0 [lnet] 12:47:08:[ 3673.677103] [<ffffffffa0537005>] lnet_send+0x655/0xb80 [lnet] 12:47:08:[ 3673.677103] [<ffffffffa053806a>] LNetPut+0x31a/0x860 [lnet] 12:47:08:[ 3673.677103] [<ffffffffa0808dc0>] ptl_send_buf+0x1e0/0x550 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffff8103b82c>] ? kvm_clock_read+0x1c/0x20 12:47:08:[ 3673.677103] [<ffffffffa080c2bd>] ptl_send_rpc+0x4dd/0xcc0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffffa0800994>] ptlrpc_send_new_req+0x454/0x7c0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffffa0804728>] ptlrpc_check_set+0x898/0x1da0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffffa083061b>] ptlrpcd_check+0x55b/0x590 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffffa0830bdd>] ptlrpcd+0x2ad/0x3f0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffff8105bca0>] ? default_wake_function+0x0/0x20 12:47:08:[ 3673.677103] [<ffffffffa0830930>] ? ptlrpcd+0x0/0x3f0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffff81095696>] kthread+0x96/0xa0 12:47:08:[ 3673.677103] [<ffffffff8100c10a>] child_rip+0xa/0x20 12:47:08:[ 3673.677103] [<ffffffff81095600>] ? kthread+0x0/0xa0 12:47:09:[ 3673.677103] [<ffffffff8100c100>] ? child_rip+0x0/0x20 12:47:09:[ 3673.677103] Code: 00 c7 43 60 c1 00 00 00 48 89 de 48 c7 43 70 00 00 00 00 48 c7 43 68 00 00 00 00 48 8b 55 c0 8b 4d c8 48 8b 7d a8 e8 7b fa ff ff <41> f6 44 24 40 08 74 1b 8b 4d b4 85 c9 0f 84 c8 00 00 00 65 48 12:47:09:[ 3673.677103] RIP [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd] 12:47:09:[ 3673.677103] RSP <ffff88005f7bf980> 12:47:09:[ 3673.677103] CR2: ffff880022a6ae30 12:47:09:[ 0.000000] Initializing cgroup subsys cpuset 12:47:09:[ 0.000000] Initializing cgroup subsys cpu
            pjones Peter Jones added a comment -

            Landed for 2.3 and 2.4

            pjones Peter Jones added a comment - Landed for 2.3 and 2.4

            I have almost 24 hours of testing and still no recreation.

            Oleg I will leave the cherry-pick to you.

            keith Keith Mannthey (Inactive) added a comment - I have almost 24 hours of testing and still no recreation. Oleg I will leave the cherry-pick to you.
            yujian Jian Yu added a comment -

            I submitted a patch which contained Oleg's patch and enabled the kernel debug options to verify whether the issue is resolved or not on master branch: http://review.whamcloud.com/3936.

            sanity test 103 passed without detecting slab corruption issue.

            yujian Jian Yu added a comment - I submitted a patch which contained Oleg's patch and enabled the kernel debug options to verify whether the issue is resolved or not on master branch: http://review.whamcloud.com/3936 . sanity test 103 passed without detecting slab corruption issue.
            green Oleg Drokin added a comment -

            Don't create separate patches for other branches, please/ I'll cherry-pick this patch directly since that part of the code have not changed since landing.

            green Oleg Drokin added a comment - Don't create separate patches for other branches, please/ I'll cherry-pick this patch directly since that part of the code have not changed since landing.

            I been running http://review.whamcloud.com/3928 with Master for a few hours now and I do not seem to be able to reproduce the errors. It looks to be the fix to me.

            If Yu's latest build agrees I will cherry pick it to b2_3 and b2_2.

            Thanks for finding this Oleg.

            keith Keith Mannthey (Inactive) added a comment - I been running http://review.whamcloud.com/3928 with Master for a few hours now and I do not seem to be able to reproduce the errors. It looks to be the fix to me. If Yu's latest build agrees I will cherry pick it to b2_3 and b2_2. Thanks for finding this Oleg.

            People

              green Oleg Drokin
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: