Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1480

failure on replay-single test_74: ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.4.1
    • 3
    • 4293

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8506fd4e-ad5b-11e1-8152-52540035b04c.

      The sub-test test_74 failed with the following error:

      test failed to respond and timed out

      Info required for matching: replay-single 74

      Attachments

        Issue Links

          Activity

            [LU-1480] failure on replay-single test_74: ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            yujian Jian Yu added a comment -

            Lustre b1_8 client build: http://build.whamcloud.com/job/lustre-b1_8/258 (1.8.9-wc1)
            Lustre b2_1 server build: http://build.whamcloud.com/job/lustre-b2_1/215 (2.1.6 RC2)

            After running parallel-scale-nfsv3, unmounting NFS server/Lustre client on the MDS node hit the same failure:
            https://maloo.whamcloud.com/test_sets/25c65126-de8e-11e2-afb2-52540035b04c

            yujian Jian Yu added a comment - Lustre b1_8 client build: http://build.whamcloud.com/job/lustre-b1_8/258 (1.8.9-wc1) Lustre b2_1 server build: http://build.whamcloud.com/job/lustre-b2_1/215 (2.1.6 RC2) After running parallel-scale-nfsv3, unmounting NFS server/Lustre client on the MDS node hit the same failure: https://maloo.whamcloud.com/test_sets/25c65126-de8e-11e2-afb2-52540035b04c
            bobijam Zhenyu Xu added a comment -

            update a patchset at http://review.whamcloud.com/6105

            During a file lov object initialization, we need protect the access and change of its subobj->coh_parent, since it could be another layout change race there, which makes an unreferenced lovsub obj in the site object hash table.

            bobijam Zhenyu Xu added a comment - update a patchset at http://review.whamcloud.com/6105 During a file lov object initialization, we need protect the access and change of its subobj->coh_parent, since it could be another layout change race there, which makes an unreferenced lovsub obj in the site object hash table.
            yujian Jian Yu added a comment -

            Lustre b1_8 client build: http://build.whamcloud.com/job/lustre-b1_8/258 (1.8.9-wc1)
            Lustre b2_1 server build: http://build.whamcloud.com/job/lustre-b2_1/205

            After running parallel-scale-nfsv3, unmounting NFS server/Lustre client on the MDS node hit the same failure:

            19:29:19:LustreError: 10728:0:(lu_object.c:1018:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            19:29:19:LustreError: 10728:0:(lu_object.c:1018:lu_device_fini()) LBUG
            19:29:20:Pid: 10728, comm: umount
            19:29:20:
            19:29:20:Call Trace:
            19:29:20: [<ffffffffa049f785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            19:29:20: [<ffffffffa049fd97>] lbug_with_loc+0x47/0xb0 [libcfs]
            19:29:20: [<ffffffffa05df3dc>] lu_device_fini+0xcc/0xd0 [obdclass]
            19:29:20: [<ffffffffa0a798b4>] lovsub_device_free+0x24/0x1e0 [lov]
            19:29:20: [<ffffffffa05e25f6>] lu_stack_fini+0x96/0xf0 [obdclass]
            19:29:20: [<ffffffffa05e6bfe>] cl_stack_fini+0xe/0x10 [obdclass]
            19:29:20: [<ffffffffa0a699d8>] lov_device_fini+0x58/0x130 [lov]
            19:29:20: [<ffffffffa05e25a9>] lu_stack_fini+0x49/0xf0 [obdclass]
            19:29:20: [<ffffffffa05e6bfe>] cl_stack_fini+0xe/0x10 [obdclass]
            19:29:20: [<ffffffffa0b52b6d>] cl_sb_fini+0x6d/0x190 [lustre]
            19:29:20: [<ffffffffa0b1ac9c>] client_common_put_super+0x14c/0xe60 [lustre]
            19:29:20: [<ffffffffa0b1ba80>] ll_put_super+0xd0/0x360 [lustre]
            19:29:20: [<ffffffff8119d546>] ? invalidate_inodes+0xf6/0x190
            19:29:20: [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0
            19:29:20: [<ffffffff81183436>] kill_anon_super+0x16/0x60
            19:29:21: [<ffffffffa05ceaca>] lustre_kill_super+0x4a/0x60 [obdclass]
            19:29:21: [<ffffffff81183bd7>] deactivate_super+0x57/0x80
            19:29:21: [<ffffffff811a1bff>] mntput_no_expire+0xbf/0x110
            19:29:21: [<ffffffff811a266b>] sys_umount+0x7b/0x3a0
            19:29:21: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            19:29:21:
            19:29:21:Kernel panic - not syncing: LBUG
            

            Maloo report: https://maloo.whamcloud.com/test_sets/84843efe-c8e9-11e2-97fe-52540035b04c

            yujian Jian Yu added a comment - Lustre b1_8 client build: http://build.whamcloud.com/job/lustre-b1_8/258 (1.8.9-wc1) Lustre b2_1 server build: http://build.whamcloud.com/job/lustre-b2_1/205 After running parallel-scale-nfsv3, unmounting NFS server/Lustre client on the MDS node hit the same failure: 19:29:19:LustreError: 10728:0:(lu_object.c:1018:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 19:29:19:LustreError: 10728:0:(lu_object.c:1018:lu_device_fini()) LBUG 19:29:20:Pid: 10728, comm: umount 19:29:20: 19:29:20:Call Trace: 19:29:20: [<ffffffffa049f785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 19:29:20: [<ffffffffa049fd97>] lbug_with_loc+0x47/0xb0 [libcfs] 19:29:20: [<ffffffffa05df3dc>] lu_device_fini+0xcc/0xd0 [obdclass] 19:29:20: [<ffffffffa0a798b4>] lovsub_device_free+0x24/0x1e0 [lov] 19:29:20: [<ffffffffa05e25f6>] lu_stack_fini+0x96/0xf0 [obdclass] 19:29:20: [<ffffffffa05e6bfe>] cl_stack_fini+0xe/0x10 [obdclass] 19:29:20: [<ffffffffa0a699d8>] lov_device_fini+0x58/0x130 [lov] 19:29:20: [<ffffffffa05e25a9>] lu_stack_fini+0x49/0xf0 [obdclass] 19:29:20: [<ffffffffa05e6bfe>] cl_stack_fini+0xe/0x10 [obdclass] 19:29:20: [<ffffffffa0b52b6d>] cl_sb_fini+0x6d/0x190 [lustre] 19:29:20: [<ffffffffa0b1ac9c>] client_common_put_super+0x14c/0xe60 [lustre] 19:29:20: [<ffffffffa0b1ba80>] ll_put_super+0xd0/0x360 [lustre] 19:29:20: [<ffffffff8119d546>] ? invalidate_inodes+0xf6/0x190 19:29:20: [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0 19:29:20: [<ffffffff81183436>] kill_anon_super+0x16/0x60 19:29:21: [<ffffffffa05ceaca>] lustre_kill_super+0x4a/0x60 [obdclass] 19:29:21: [<ffffffff81183bd7>] deactivate_super+0x57/0x80 19:29:21: [<ffffffff811a1bff>] mntput_no_expire+0xbf/0x110 19:29:21: [<ffffffff811a266b>] sys_umount+0x7b/0x3a0 19:29:21: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 19:29:21: 19:29:21:Kernel panic - not syncing: LBUG Maloo report: https://maloo.whamcloud.com/test_sets/84843efe-c8e9-11e2-97fe-52540035b04c
            mdiep Minh Diep added a comment -

            I have reproduced this issue and trigger kernel crash dump available at /scratch/ftp/uploads/LU-1480

            mdiep Minh Diep added a comment - I have reproduced this issue and trigger kernel crash dump available at /scratch/ftp/uploads/ LU-1480
            bobijam Zhenyu Xu added a comment -

            pushed a debug patch at http://review.whamcloud.com/6105

            bobijam Zhenyu Xu added a comment - pushed a debug patch at http://review.whamcloud.com/6105
            mdiep Minh Diep added a comment -

            I hit this very frequent using fc18 client running sanity test

            mdiep Minh Diep added a comment - I hit this very frequent using fc18 client running sanity test
            yujian Jian Yu added a comment -

            Lustre Build: https://build.whamcloud.com/job/lustre-master/1340/

            Server: el6, x86_64
            clients: fc18, x86_64

            The issue occurred again and was reported in LU-3116.

            yujian Jian Yu added a comment - Lustre Build: https://build.whamcloud.com/job/lustre-master/1340/ Server: el6, x86_64 clients: fc18, x86_64 The issue occurred again and was reported in LU-3116 .
            pjones Peter Jones added a comment -

            Dropping priority as no longer occurring regularly

            pjones Peter Jones added a comment - Dropping priority as no longer occurring regularly
            bobijam Zhenyu Xu added a comment -

            haven't seen this issue for recent tests, I think we can lower the severity.

            bobijam Zhenyu Xu added a comment - haven't seen this issue for recent tests, I think we can lower the severity.
            adilger Andreas Dilger added a comment - Bobijam, this bug was reported hit 14 times in the past week, according to: https://maloo.whamcloud.com/test_sets/query?utf8=%E2%9C%93&test_set[test_set_script_id]=&test_set[status]=&test_set[query_bugs]=LU-1480&test_session[test_host]=&test_session[test_group]=&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node_network[network_type_id]=&commit=Update+results The most recent is at: https://maloo.whamcloud.com/test_sets/8ab52280-3536-11e2-918f-52540035b04c Can you please check if the information you need is in one of these failures. If not, is there something that can be done to improve the debugging patch to capture the information you need?
            bobijam Zhenyu Xu added a comment -

            Didn't find it, the failure illustrated in the above maloo report is that a client cannot finish inode sync while trying to umount the mount point, not relating to the device refcount issue.

            bobijam Zhenyu Xu added a comment - Didn't find it, the failure illustrated in the above maloo report is that a client cannot finish inode sync while trying to umount the mount point, not relating to the device refcount issue.

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: