Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.4.0
    • None
    • CONFIG_DEBUG_SLAB=y
    • 3
    • 4237

    Description

      Lustre: DEBUG MARKER: == sanity test 103: acl test ========================================================================= 19:57:07 (1346774227)
      /work/lustre/head/clean/lustre/utils/l_getidentity
      Slab corruption (Tainted: P --------------- ): size-2048 start=dac6c470, len=2048
      Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
      Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
      310: 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 05 00
      320: 01 00 00 00 02 00 07 00 02 00 00 00 04 00 07 00
      330: ff ff ff ff 10 00 07 00 ff ff ff ff 20 00 05 00
      340: ff ff ff ff 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
      Next obj: start=dac6cc88, len=2048
      Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
      Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
      000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
      010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

      02000000:00000010:1.0:1346774231.327841:1804:3373:0:(sec_null.c:217:null_alloc_repbuf()) kmalloced 'req->rq_repbuf': 2048 at dac6c470.
      ...

      02000000:00000010:1.0:1346774231.328361:836:3373:0:(sec_null.c:231:null_free_repbuf()) kfreed 'req->rq_repbuf': 2048 at dac6c470.

      Attachments

        Issue Links

          Activity

            [LU-1823] sanity/103: slab corruption
            sarah Sarah Liu added a comment -

            review-dne failed as:

            12:47:05:[ 3488.673687] Lustre: 3067:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1386362637/real 1386362637]  req@ffff880036623bb8 x1453703367244336/t0(0) o38->lustre-MDT0002-osp-MDT0000@10.10.16.232@tcp:24/4 lens 400/544 e 0 to 1 dl 1386362662 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            12:47:06:[ 3488.676610] Lustre: 3067:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 305 previous similar messages
            12:47:06:[ 3673.675126] BUG: unable to handle kernel paging request at ffff880022a6ae30
            12:47:06:[ 3673.675968] IP: [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd]
            12:47:06:[ 3673.676583] PGD 1a26063 PUD 1a2a063 PMD 215067 PTE 22a6a060
            12:47:06:[ 3673.677103] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
            12:47:06:[ 3673.677103] last sysfs file: /sys/devices/system/cpu/possible
            12:47:07:[ 3673.677103] CPU 1 
            12:47:07:[ 3673.677103] Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
            12:47:07:[ 3673.677103] 
            12:47:07:[ 3673.677103] Pid: 3069, comm: ptlrpcd_1 Not tainted 2.6.32-358.23.2.el6_lustre.g3ddc521.x86_64 #1 Red Hat KVM
            12:47:07:[ 3673.677103] RIP: 0010:[<ffffffffa09ef395>]  [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd]
            12:47:07:[ 3673.677103] RSP: 0018:ffff88005f7bf980  EFLAGS: 00010286
            12:47:07:[ 3673.677103] RAX: 0000000000000000 RBX: ffff88000eb08ef0 RCX: 0000000000000009
            12:47:07:[ 3673.677103] RDX: ffff880002300000 RSI: ffff880002311b20 RDI: ffffffffa09feb38
            12:47:07:[ 3673.677103] RBP: ffff88005f7bf9f0 R08: 0000000000000000 R09: 0000000000000000
            12:47:07:[ 3673.677103] R10: ffff8800497d37e0 R11: 0000000000000000 R12: ffff880022a6adf0
            12:47:07:[ 3673.677103] R13: 0000000000000001 R14: ffff880064cd64e0 R15: 00000000000000e0
            12:47:07:[ 3673.677103] FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
            12:47:07:[ 3673.677103] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
            12:47:07:[ 3673.677103] CR2: ffff880022a6ae30 CR3: 0000000001a25000 CR4: 00000000000006e0
            12:47:07:[ 3673.677103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            12:47:07:[ 3673.677103] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            12:47:07:[ 3673.677103] Process ptlrpcd_1 (pid: 3069, threadinfo ffff88005f7be000, task ffff88005c46c340)
            12:47:07:[ 3673.677103] Stack:
            12:47:07:[ 3673.677103]  ffff880015f79c01 0000000122a6adf0 0000000000000000 ffff880052054df0
            12:47:07:[ 3673.677103] <d> 0000000000000000 000000e800000000 000200000a0a10ed 0000000000003039
            12:47:07:[ 3673.677103] <d> ffff88005f7bfa10 ffff880052054df0 ffff880022a6adf0 ffff880022a6adf0
            12:47:07:[ 3673.677103] Call Trace:
            12:47:08:[ 3673.677103]  [<ffffffffa0532dfb>] lnet_ni_send+0x4b/0xf0 [lnet]
            12:47:08:[ 3673.677103]  [<ffffffffa0537005>] lnet_send+0x655/0xb80 [lnet]
            12:47:08:[ 3673.677103]  [<ffffffffa053806a>] LNetPut+0x31a/0x860 [lnet]
            12:47:08:[ 3673.677103]  [<ffffffffa0808dc0>] ptl_send_buf+0x1e0/0x550 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffff8103b82c>] ? kvm_clock_read+0x1c/0x20
            12:47:08:[ 3673.677103]  [<ffffffffa080c2bd>] ptl_send_rpc+0x4dd/0xcc0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffffa0800994>] ptlrpc_send_new_req+0x454/0x7c0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffffa0804728>] ptlrpc_check_set+0x898/0x1da0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffffa083061b>] ptlrpcd_check+0x55b/0x590 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffffa0830bdd>] ptlrpcd+0x2ad/0x3f0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffff8105bca0>] ? default_wake_function+0x0/0x20
            12:47:08:[ 3673.677103]  [<ffffffffa0830930>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
            12:47:08:[ 3673.677103]  [<ffffffff81095696>] kthread+0x96/0xa0
            12:47:08:[ 3673.677103]  [<ffffffff8100c10a>] child_rip+0xa/0x20
            12:47:08:[ 3673.677103]  [<ffffffff81095600>] ? kthread+0x0/0xa0
            12:47:09:[ 3673.677103]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
            12:47:09:[ 3673.677103] Code: 00 c7 43 60 c1 00 00 00 48 89 de 48 c7 43 70 00 00 00 00 48 c7 43 68 00 00 00 00 48 8b 55 c0 8b 4d c8 48 8b 7d a8 e8 7b fa ff ff <41> f6 44 24 40 08 74 1b 8b 4d b4 85 c9 0f 84 c8 00 00 00 65 48 
            12:47:09:[ 3673.677103] RIP  [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd]
            12:47:09:[ 3673.677103]  RSP <ffff88005f7bf980>
            12:47:09:[ 3673.677103] CR2: ffff880022a6ae30
            12:47:09:[    0.000000] Initializing cgroup subsys cpuset
            12:47:09:[    0.000000] Initializing cgroup subsys cpu
            
            sarah Sarah Liu added a comment - review-dne failed as: 12:47:05:[ 3488.673687] Lustre: 3067:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1386362637/real 1386362637] req@ffff880036623bb8 x1453703367244336/t0(0) o38->lustre-MDT0002-osp-MDT0000@10.10.16.232@tcp:24/4 lens 400/544 e 0 to 1 dl 1386362662 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 12:47:06:[ 3488.676610] Lustre: 3067:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 305 previous similar messages 12:47:06:[ 3673.675126] BUG: unable to handle kernel paging request at ffff880022a6ae30 12:47:06:[ 3673.675968] IP: [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd] 12:47:06:[ 3673.676583] PGD 1a26063 PUD 1a2a063 PMD 215067 PTE 22a6a060 12:47:06:[ 3673.677103] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC 12:47:06:[ 3673.677103] last sysfs file: /sys/devices/system/cpu/possible 12:47:07:[ 3673.677103] CPU 1 12:47:07:[ 3673.677103] Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 12:47:07:[ 3673.677103] 12:47:07:[ 3673.677103] Pid: 3069, comm: ptlrpcd_1 Not tainted 2.6.32-358.23.2.el6_lustre.g3ddc521.x86_64 #1 Red Hat KVM 12:47:07:[ 3673.677103] RIP: 0010:[<ffffffffa09ef395>] [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd] 12:47:07:[ 3673.677103] RSP: 0018:ffff88005f7bf980 EFLAGS: 00010286 12:47:07:[ 3673.677103] RAX: 0000000000000000 RBX: ffff88000eb08ef0 RCX: 0000000000000009 12:47:07:[ 3673.677103] RDX: ffff880002300000 RSI: ffff880002311b20 RDI: ffffffffa09feb38 12:47:07:[ 3673.677103] RBP: ffff88005f7bf9f0 R08: 0000000000000000 R09: 0000000000000000 12:47:07:[ 3673.677103] R10: ffff8800497d37e0 R11: 0000000000000000 R12: ffff880022a6adf0 12:47:07:[ 3673.677103] R13: 0000000000000001 R14: ffff880064cd64e0 R15: 00000000000000e0 12:47:07:[ 3673.677103] FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 12:47:07:[ 3673.677103] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 12:47:07:[ 3673.677103] CR2: ffff880022a6ae30 CR3: 0000000001a25000 CR4: 00000000000006e0 12:47:07:[ 3673.677103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 12:47:07:[ 3673.677103] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 12:47:07:[ 3673.677103] Process ptlrpcd_1 (pid: 3069, threadinfo ffff88005f7be000, task ffff88005c46c340) 12:47:07:[ 3673.677103] Stack: 12:47:07:[ 3673.677103] ffff880015f79c01 0000000122a6adf0 0000000000000000 ffff880052054df0 12:47:07:[ 3673.677103] <d> 0000000000000000 000000e800000000 000200000a0a10ed 0000000000003039 12:47:07:[ 3673.677103] <d> ffff88005f7bfa10 ffff880052054df0 ffff880022a6adf0 ffff880022a6adf0 12:47:07:[ 3673.677103] Call Trace: 12:47:08:[ 3673.677103] [<ffffffffa0532dfb>] lnet_ni_send+0x4b/0xf0 [lnet] 12:47:08:[ 3673.677103] [<ffffffffa0537005>] lnet_send+0x655/0xb80 [lnet] 12:47:08:[ 3673.677103] [<ffffffffa053806a>] LNetPut+0x31a/0x860 [lnet] 12:47:08:[ 3673.677103] [<ffffffffa0808dc0>] ptl_send_buf+0x1e0/0x550 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffff8103b82c>] ? kvm_clock_read+0x1c/0x20 12:47:08:[ 3673.677103] [<ffffffffa080c2bd>] ptl_send_rpc+0x4dd/0xcc0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffffa0800994>] ptlrpc_send_new_req+0x454/0x7c0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffffa0804728>] ptlrpc_check_set+0x898/0x1da0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffffa083061b>] ptlrpcd_check+0x55b/0x590 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffffa0830bdd>] ptlrpcd+0x2ad/0x3f0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffff8105bca0>] ? default_wake_function+0x0/0x20 12:47:08:[ 3673.677103] [<ffffffffa0830930>] ? ptlrpcd+0x0/0x3f0 [ptlrpc] 12:47:08:[ 3673.677103] [<ffffffff81095696>] kthread+0x96/0xa0 12:47:08:[ 3673.677103] [<ffffffff8100c10a>] child_rip+0xa/0x20 12:47:08:[ 3673.677103] [<ffffffff81095600>] ? kthread+0x0/0xa0 12:47:09:[ 3673.677103] [<ffffffff8100c100>] ? child_rip+0x0/0x20 12:47:09:[ 3673.677103] Code: 00 c7 43 60 c1 00 00 00 48 89 de 48 c7 43 70 00 00 00 00 48 c7 43 68 00 00 00 00 48 8b 55 c0 8b 4d c8 48 8b 7d a8 e8 7b fa ff ff <41> f6 44 24 40 08 74 1b 8b 4d b4 85 c9 0f 84 c8 00 00 00 65 48 12:47:09:[ 3673.677103] RIP [<ffffffffa09ef395>] ksocknal_send+0x165/0x450 [ksocklnd] 12:47:09:[ 3673.677103] RSP <ffff88005f7bf980> 12:47:09:[ 3673.677103] CR2: ffff880022a6ae30 12:47:09:[ 0.000000] Initializing cgroup subsys cpuset 12:47:09:[ 0.000000] Initializing cgroup subsys cpu
            pjones Peter Jones added a comment -

            Landed for 2.3 and 2.4

            pjones Peter Jones added a comment - Landed for 2.3 and 2.4

            I have almost 24 hours of testing and still no recreation.

            Oleg I will leave the cherry-pick to you.

            keith Keith Mannthey (Inactive) added a comment - I have almost 24 hours of testing and still no recreation. Oleg I will leave the cherry-pick to you.
            yujian Jian Yu added a comment -

            I submitted a patch which contained Oleg's patch and enabled the kernel debug options to verify whether the issue is resolved or not on master branch: http://review.whamcloud.com/3936.

            sanity test 103 passed without detecting slab corruption issue.

            yujian Jian Yu added a comment - I submitted a patch which contained Oleg's patch and enabled the kernel debug options to verify whether the issue is resolved or not on master branch: http://review.whamcloud.com/3936 . sanity test 103 passed without detecting slab corruption issue.
            green Oleg Drokin added a comment -

            Don't create separate patches for other branches, please/ I'll cherry-pick this patch directly since that part of the code have not changed since landing.

            green Oleg Drokin added a comment - Don't create separate patches for other branches, please/ I'll cherry-pick this patch directly since that part of the code have not changed since landing.

            I been running http://review.whamcloud.com/3928 with Master for a few hours now and I do not seem to be able to reproduce the errors. It looks to be the fix to me.

            If Yu's latest build agrees I will cherry pick it to b2_3 and b2_2.

            Thanks for finding this Oleg.

            keith Keith Mannthey (Inactive) added a comment - I been running http://review.whamcloud.com/3928 with Master for a few hours now and I do not seem to be able to reproduce the errors. It looks to be the fix to me. If Yu's latest build agrees I will cherry pick it to b2_3 and b2_2. Thanks for finding this Oleg.
            green Oleg Drokin added a comment -

            Keith, ok. Well, the original issue my patch fixes is there since long ago, it might have been hidden to because nobody was leaving incorrect pointers in the mti_info buf, and then later somebody started to do it and that's why everything started to fall apart lately.
            Granted I did not try to go back all the way to see if I can reproduce it on just the original commit

            green Oleg Drokin added a comment - Keith, ok. Well, the original issue my patch fixes is there since long ago, it might have been hidden to because nobody was leaving incorrect pointers in the mti_info buf, and then later somebody started to do it and that's why everything started to fall apart lately. Granted I did not try to go back all the way to see if I can reproduce it on just the original commit
            yujian Jian Yu added a comment -

            Hi Keith,
            I submitted a patch which contained Oleg's patch and enabled the kernel debug options to verify whether the issue is resolved or not on master branch: http://review.whamcloud.com/3936.

            yujian Jian Yu added a comment - Hi Keith, I submitted a patch which contained Oleg's patch and enabled the kernel debug options to verify whether the issue is resolved or not on master branch: http://review.whamcloud.com/3936 .

            Oleg I am familiar with PAGEALLOC. I am just wondering if it is a separate issue or not. I didn't get a chance to test your patch but I will do it first thing tomorrow.

            On my nodes can produce the issue I was able to test 2.2.59 and I don't see the issue but lu-998 code should be present it the release.

            I am building on the client nodes now to speed thing up I hope tomorrow I can get thing greatly narrowed down if Oleg's patch is not the root issue.

            Sorry I didn't get a change to push Yu's build errors yet.

            keith Keith Mannthey (Inactive) added a comment - Oleg I am familiar with PAGEALLOC. I am just wondering if it is a separate issue or not. I didn't get a chance to test your patch but I will do it first thing tomorrow. On my nodes can produce the issue I was able to test 2.2.59 and I don't see the issue but lu-998 code should be present it the release. I am building on the client nodes now to speed thing up I hope tomorrow I can get thing greatly narrowed down if Oleg's patch is not the root issue. Sorry I didn't get a change to push Yu's build errors yet.
            green Oleg Drokin added a comment -

            Keith, PAGEALLOC debug is a more powerful (and slow) kernel feature to track random memory writes. What it does is when you free a page instead of merely writing it with some poison value to check later (what you are hitting, and once hit - no way to know who made the write) it actually unmaps the page from address space, so when the offender comes to do the write, the page is not there and we get the nice oops pointing at the red-handed writer to the place they are not supposed to be writing.

            green Oleg Drokin added a comment - Keith, PAGEALLOC debug is a more powerful (and slow) kernel feature to track random memory writes. What it does is when you free a page instead of merely writing it with some poison value to check later (what you are hitting, and once hit - no way to know who made the write) it actually unmaps the page from address space, so when the offender comes to do the write, the page is not there and we get the nice oops pointing at the red-handed writer to the place they are not supposed to be writing.

            People

              green Oleg Drokin
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: