Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1480

failure on replay-single test_74: ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.4.1
    • 3
    • 4293

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8506fd4e-ad5b-11e1-8152-52540035b04c.

      The sub-test test_74 failed with the following error:

      test failed to respond and timed out

      Info required for matching: replay-single 74

      Attachments

        Issue Links

          Activity

            [LU-1480] failure on replay-single test_74: ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            bobijam Zhenyu Xu added a comment - - edited

            status update:

            A debugging patch has landed in master branch and waiting for re-hits with the debug message.

            bobijam Zhenyu Xu added a comment - - edited status update: A debugging patch has landed in master branch and waiting for re-hits with the debug message.

            Alex reported in LU-2070:

            please use http://review.whamcloud.com/4151 to debug

            adilger Andreas Dilger added a comment - Alex reported in LU-2070 : please use http://review.whamcloud.com/4151 to debug

            Prompted to Blocker for 2.4.

            liwei Li Wei (Inactive) added a comment - Prompted to Blocker for 2.4.
            yujian Jian Yu added a comment -

            Lustre Tag: v2_3_0_RC3
            Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/198
            Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/36
            Distro/Arch: RHEL6.3/x86_64

            The same issue occurred in parallel-scale-nfsv3 test nfsread_orphan_file:
            https://maloo.whamcloud.com/test_sets/0101c7c8-16a0-11e2-962d-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_3_0_RC3 Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/198 Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/36 Distro/Arch: RHEL6.3/x86_64 The same issue occurred in parallel-scale-nfsv3 test nfsread_orphan_file: https://maloo.whamcloud.com/test_sets/0101c7c8-16a0-11e2-962d-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v2_3_0_RC3
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
            Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
            Network: TCP
            ENABLE_QUOTA=yes

            parallel-scale-nfsv3 test iorfpp failed with the same issue:
            https://maloo.whamcloud.com/test_sets/7ca42e16-168c-11e2-962d-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_3_0_RC3 Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36 Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client) Network: TCP ENABLE_QUOTA=yes parallel-scale-nfsv3 test iorfpp failed with the same issue: https://maloo.whamcloud.com/test_sets/7ca42e16-168c-11e2-962d-52540035b04c
            yujian Jian Yu added a comment -

            Lustre Tag: v2_3_0_RC2
            Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/198
            Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/32
            Distro/Arch: RHEL6.3/x86_64

            The same issue occurred in parallel-scale-nfsv3 test nfsread_orphan_file:
            https://maloo.whamcloud.com/test_sets/d53d4086-1370-11e2-808f-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_3_0_RC2 Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/198 Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/32 Distro/Arch: RHEL6.3/x86_64 The same issue occurred in parallel-scale-nfsv3 test nfsread_orphan_file: https://maloo.whamcloud.com/test_sets/d53d4086-1370-11e2-808f-52540035b04c
            bobijam Zhenyu Xu added a comment - - edited

            http://review.whamcloud.com/4108

            LU-1480 lov: lov_delete_raid0 need wait

            If lov_delete_raid0 does not wait for its layout stable, lov object's
            deletion will leave lovsub object hang in the memory.

            bobijam Zhenyu Xu added a comment - - edited http://review.whamcloud.com/4108 LU-1480 lov: lov_delete_raid0 need wait If lov_delete_raid0 does not wait for its layout stable, lov object's deletion will leave lovsub object hang in the memory.
            sarah Sarah Liu added a comment -

            Hit this issue when testing interop between 1.8 and 2.3-RC1 after paralles-scale-nfsv3 passed all the sub tests

            https://maloo.whamcloud.com/test_sets/71e39924-0626-11e2-9b17-52540035b04c

            MDS console log of test_nfsread_orphan_file shows:

            19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) header@ffff8800517c9bf0[0x0, 1, [0x100060000:0x4769:0x0] hash]{ 
            19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) ....lovsub@ffff8800517c9c88[0]
            19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) ....osc@ffff8800545352a8id: 18281 gr: 0 idx: 6 gen: 0 kms_valid: 1 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 2942 mtime: 1348454797 atime: 0 ctime: 1348454797 blocks: 0
            19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) } header@ffff8800517c9bf0
            19:47:03:LustreError: 20620:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            19:47:03:LustreError: 20620:0:(lu_object.c:1081:lu_device_fini()) LBUG
            19:47:03:Pid: 20620, comm: umount
            19:47:03:
            19:47:03:Call Trace:
            19:47:03: [<ffffffffa0d31905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            19:47:03: [<ffffffffa0d31f17>] lbug_with_loc+0x47/0xb0 [libcfs]
            19:47:03: [<ffffffffa040b2dc>] lu_device_fini+0xcc/0xd0 [obdclass]
            19:47:03: [<ffffffffa089f114>] lovsub_device_free+0x24/0x200 [lov]
            19:47:03: [<ffffffffa040e826>] lu_stack_fini+0x96/0xf0 [obdclass]
            19:47:03: [<ffffffffa04137ae>] cl_stack_fini+0xe/0x10 [obdclass]
            19:47:03: [<ffffffffa088e6a8>] lov_device_fini+0x58/0x130 [lov]
            19:47:03: [<ffffffffa040e7d9>] lu_stack_fini+0x49/0xf0 [obdclass]
            19:47:03: [<ffffffffa04137ae>] cl_stack_fini+0xe/0x10 [obdclass]
            19:47:03: [<ffffffffa0b744cd>] cl_sb_fini+0x6d/0x190 [lustre]
            19:47:03: [<ffffffffa0b3959c>] client_common_put_super+0x14c/0xe60 [lustre]
            19:47:03: [<ffffffffa0b3a380>] ll_put_super+0xd0/0x360 [lustre]
            19:47:03: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
            19:47:03: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
            19:47:03: [<ffffffff8117d436>] kill_anon_super+0x16/0x60
            19:47:03: [<ffffffffa03f8eaa>] lustre_kill_super+0x4a/0x60 [obdclass]
            19:47:03: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
            19:47:03: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
            19:47:03: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
            19:47:03: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
            19:47:03: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
            19:47:03:
            19:47:03:Kernel panic - not syncing: LBUG
            19:47:03:Pid: 20620, comm: umount Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1
            19:47:03:Call Trace:
            19:47:03: [<ffffffff814fd58a>] ? panic+0xa0/0x168
            19:47:03: [<ffffffffa0d31f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
            19:47:03: [<ffffffffa040b2dc>] ? lu_device_fini+0xcc/0xd0 [obdclass]
            19:47:03: [<ffffffffa089f114>] ? lovsub_device_free+0x24/0x200 [lov]
            19:47:03: [<ffffffffa040e826>] ? lu_stack_fini+0x96/0xf0 [obdclass]
            19:47:03: [<ffffffffa04137ae>] ? cl_stack_fini+0xe/0x10 [obdclass]
            19:47:03: [<ffffffffa088e6a8>] ? lov_device_fini+0x58/0x130 [lov]
            19:47:03: [<ffffffffa040e7d9>] ? lu_stack_fini+0x49/0xf0 [obdclass]
            19:47:03: [<ffffffffa04137ae>] ? cl_stack_fini+0xe/0x10 [obdclass]
            19:47:03: [<ffffffffa0b744cd>] ? cl_sb_fini+0x6d/0x190 [lustre]
            19:47:03: [<ffffffffa0b3959c>] ? client_common_put_super+0x14c/0xe60 [lustre]
            19:47:03: [<ffffffffa0b3a380>] ? ll_put_super+0xd0/0x360 [lustre]
            19:47:03: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
            19:47:03: [<ffffffff8117d34b>] ? generic_shutdown_super+0x5b/0xe0
            19:47:03: [<ffffffff8117d436>] ? kill_anon_super+0x16/0x60
            19:47:03: [<ffffffffa03f8eaa>] ? lustre_kill_super+0x4a/0x60 [obdclass]
            19:47:03: [<ffffffff8117e4b0>] ? deactivate_super+0x70/0x90
            19:47:03: [<ffffffff8119a4ff>] ? mntput_no_expire+0xbf/0x110
            19:47:03: [<ffffffff8119af9b>] ? sys_umount+0x7b/0x3a0
            19:47:03: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
            19:47:03: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
            19:47:03:Initializing cgroup subsys cpuset
            19:47:03:Initializing cgroup subsys cpu
            
            sarah Sarah Liu added a comment - Hit this issue when testing interop between 1.8 and 2.3-RC1 after paralles-scale-nfsv3 passed all the sub tests https://maloo.whamcloud.com/test_sets/71e39924-0626-11e2-9b17-52540035b04c MDS console log of test_nfsread_orphan_file shows: 19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) header@ffff8800517c9bf0[0x0, 1, [0x100060000:0x4769:0x0] hash]{ 19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) ....lovsub@ffff8800517c9c88[0] 19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) ....osc@ffff8800545352a8id: 18281 gr: 0 idx: 6 gen: 0 kms_valid: 1 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 2942 mtime: 1348454797 atime: 0 ctime: 1348454797 blocks: 0 19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) } header@ffff8800517c9bf0 19:47:03:LustreError: 20620:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 19:47:03:LustreError: 20620:0:(lu_object.c:1081:lu_device_fini()) LBUG 19:47:03:Pid: 20620, comm: umount 19:47:03: 19:47:03:Call Trace: 19:47:03: [<ffffffffa0d31905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 19:47:03: [<ffffffffa0d31f17>] lbug_with_loc+0x47/0xb0 [libcfs] 19:47:03: [<ffffffffa040b2dc>] lu_device_fini+0xcc/0xd0 [obdclass] 19:47:03: [<ffffffffa089f114>] lovsub_device_free+0x24/0x200 [lov] 19:47:03: [<ffffffffa040e826>] lu_stack_fini+0x96/0xf0 [obdclass] 19:47:03: [<ffffffffa04137ae>] cl_stack_fini+0xe/0x10 [obdclass] 19:47:03: [<ffffffffa088e6a8>] lov_device_fini+0x58/0x130 [lov] 19:47:03: [<ffffffffa040e7d9>] lu_stack_fini+0x49/0xf0 [obdclass] 19:47:03: [<ffffffffa04137ae>] cl_stack_fini+0xe/0x10 [obdclass] 19:47:03: [<ffffffffa0b744cd>] cl_sb_fini+0x6d/0x190 [lustre] 19:47:03: [<ffffffffa0b3959c>] client_common_put_super+0x14c/0xe60 [lustre] 19:47:03: [<ffffffffa0b3a380>] ll_put_super+0xd0/0x360 [lustre] 19:47:03: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190 19:47:03: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 19:47:03: [<ffffffff8117d436>] kill_anon_super+0x16/0x60 19:47:03: [<ffffffffa03f8eaa>] lustre_kill_super+0x4a/0x60 [obdclass] 19:47:03: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 19:47:03: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 19:47:03: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 19:47:03: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 19:47:03: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 19:47:03: 19:47:03:Kernel panic - not syncing: LBUG 19:47:03:Pid: 20620, comm: umount Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 19:47:03:Call Trace: 19:47:03: [<ffffffff814fd58a>] ? panic+0xa0/0x168 19:47:03: [<ffffffffa0d31f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 19:47:03: [<ffffffffa040b2dc>] ? lu_device_fini+0xcc/0xd0 [obdclass] 19:47:03: [<ffffffffa089f114>] ? lovsub_device_free+0x24/0x200 [lov] 19:47:03: [<ffffffffa040e826>] ? lu_stack_fini+0x96/0xf0 [obdclass] 19:47:03: [<ffffffffa04137ae>] ? cl_stack_fini+0xe/0x10 [obdclass] 19:47:03: [<ffffffffa088e6a8>] ? lov_device_fini+0x58/0x130 [lov] 19:47:03: [<ffffffffa040e7d9>] ? lu_stack_fini+0x49/0xf0 [obdclass] 19:47:03: [<ffffffffa04137ae>] ? cl_stack_fini+0xe/0x10 [obdclass] 19:47:03: [<ffffffffa0b744cd>] ? cl_sb_fini+0x6d/0x190 [lustre] 19:47:03: [<ffffffffa0b3959c>] ? client_common_put_super+0x14c/0xe60 [lustre] 19:47:03: [<ffffffffa0b3a380>] ? ll_put_super+0xd0/0x360 [lustre] 19:47:03: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190 19:47:03: [<ffffffff8117d34b>] ? generic_shutdown_super+0x5b/0xe0 19:47:03: [<ffffffff8117d436>] ? kill_anon_super+0x16/0x60 19:47:03: [<ffffffffa03f8eaa>] ? lustre_kill_super+0x4a/0x60 [obdclass] 19:47:03: [<ffffffff8117e4b0>] ? deactivate_super+0x70/0x90 19:47:03: [<ffffffff8119a4ff>] ? mntput_no_expire+0xbf/0x110 19:47:03: [<ffffffff8119af9b>] ? sys_umount+0x7b/0x3a0 19:47:03: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 19:47:03: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b 19:47:03:Initializing cgroup subsys cpuset 19:47:03:Initializing cgroup subsys cpu
            pjones Peter Jones added a comment -

            ok

            pjones Peter Jones added a comment - ok
            bobijam Zhenyu Xu added a comment -

            git commit 35920b759ed78441db0cd9de6ac8ec66da862f22 has changed the mount logic, please lower the priority and wait to see whether we can hit the issue again.

            bobijam Zhenyu Xu added a comment - git commit 35920b759ed78441db0cd9de6ac8ec66da862f22 has changed the mount logic, please lower the priority and wait to see whether we can hit the issue again.

            Hi Bobi, the above piece of code looks good because inode state I_NEW seems enough to protect the assignment of lli_clob. From the log, I think there were object leak on the lovsub layer. Let's reserve some nodes on toro to reproduce this problem.

            jay Jinshan Xiong (Inactive) added a comment - Hi Bobi, the above piece of code looks good because inode state I_NEW seems enough to protect the assignment of lli_clob. From the log, I think there were object leak on the lovsub layer. Let's reserve some nodes on toro to reproduce this problem.

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: