Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1480

failure on replay-single test_74: ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.4.1
    • 3
    • 4293

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8506fd4e-ad5b-11e1-8152-52540035b04c.

      The sub-test test_74 failed with the following error:

      test failed to respond and timed out

      Info required for matching: replay-single 74

      Attachments

        Issue Links

          Activity

            [LU-1480] failure on replay-single test_74: ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            pjones Peter Jones added a comment -

            ok

            pjones Peter Jones added a comment - ok
            bobijam Zhenyu Xu added a comment -

            git commit 35920b759ed78441db0cd9de6ac8ec66da862f22 has changed the mount logic, please lower the priority and wait to see whether we can hit the issue again.

            bobijam Zhenyu Xu added a comment - git commit 35920b759ed78441db0cd9de6ac8ec66da862f22 has changed the mount logic, please lower the priority and wait to see whether we can hit the issue again.

            Hi Bobi, the above piece of code looks good because inode state I_NEW seems enough to protect the assignment of lli_clob. From the log, I think there were object leak on the lovsub layer. Let's reserve some nodes on toro to reproduce this problem.

            jay Jinshan Xiong (Inactive) added a comment - Hi Bobi, the above piece of code looks good because inode state I_NEW seems enough to protect the assignment of lli_clob. From the log, I think there were object leak on the lovsub layer. Let's reserve some nodes on toro to reproduce this problem.
            bobijam Zhenyu Xu added a comment -

            Xiong,

            Can we sure about that there is no race to set lli->lli_clob in the following code?

            cl_file_inode_init()
            //
                    if (lli->lli_clob == NULL) {
                            /* clob is slave of inode, empty lli_clob means for new inode,
                             * there is no clob in cache with the given fid, so it is
                             * unnecessary to perform lookup-alloc-lookup-insert, just
                             * alloc and insert directly. */
            #ifdef __KERNEL__
                            LASSERT(inode->i_state & I_NEW);
            #endif
                            conf.coc_lu.loc_flags = LOC_F_NEW;
                            clob = cl_object_find(env, lu2cl_dev(site->ls_top_dev),
                                                  fid, &conf);
                            if (!IS_ERR(clob)) {
                                    /*
                                     * No locking is necessary, as new inode is
                                     * locked by I_NEW bit.
                                     *
                                     * XXX not true for call from ll_update_inode().
                                     */
                                    lli->lli_clob = clob;
                                    lu_object_ref_add(&clob->co_lu, "inode", inode);
                            } else
                                    result = PTR_ERR(clob);
                    }
            
            bobijam Zhenyu Xu added a comment - Xiong, Can we sure about that there is no race to set lli->lli_clob in the following code? cl_file_inode_init() // if (lli->lli_clob == NULL) { /* clob is slave of inode, empty lli_clob means for new inode, * there is no clob in cache with the given fid, so it is * unnecessary to perform lookup-alloc-lookup-insert, just * alloc and insert directly. */ #ifdef __KERNEL__ LASSERT(inode->i_state & I_NEW); #endif conf.coc_lu.loc_flags = LOC_F_NEW; clob = cl_object_find(env, lu2cl_dev(site->ls_top_dev), fid, &conf); if (!IS_ERR(clob)) { /* * No locking is necessary, as new inode is * locked by I_NEW bit. * * XXX not true for call from ll_update_inode(). */ lli->lli_clob = clob; lu_object_ref_add(&clob->co_lu, "inode" , inode); } else result = PTR_ERR(clob); }
            adilger Andreas Dilger added a comment - Failed a third time in the same test run: https://maloo.whamcloud.com/test_sets/80724b1a-f1b4-11e1-9def-52540035b04c
            adilger Andreas Dilger added a comment - It failed again later in the same test run: https://maloo.whamcloud.com/test_sets/47f9f3d0-f1b6-11e1-9def-52540035b04c

            This problem was still hit in conf-sanity.sh:

            https://maloo.whamcloud.com/test_sets/f46209aa-f1b6-11e1-9def-52540035b04c

            but the patch being tested (git hash e65e329d84ba05370cb7f8e7a789cad129bcc287 == patch set 9) shows that it is based on master git hash 34a60922e2b94fb35872a9bd6fe3eefa6a95826c (LU-1661 tests: improve setup_posix_users()), which should contain the fix for LU-1773 already.

            adilger Andreas Dilger added a comment - This problem was still hit in conf-sanity.sh: https://maloo.whamcloud.com/test_sets/f46209aa-f1b6-11e1-9def-52540035b04c but the patch being tested (git hash e65e329d84ba05370cb7f8e7a789cad129bcc287 == patch set 9) shows that it is based on master git hash 34a60922e2b94fb35872a9bd6fe3eefa6a95826c ( LU-1661 tests: improve setup_posix_users()), which should contain the fix for LU-1773 already.

            duplicate of LU-1773

            jay Jinshan Xiong (Inactive) added a comment - duplicate of LU-1773
            pjones Peter Jones added a comment -

            Bobijam will look into this one

            pjones Peter Jones added a comment - Bobijam will look into this one
            sarah Sarah Liu added a comment - - edited

            Hit this failure when trying to umount lustre after test parallel-scale-nfsv3 finished.
            https://maloo.whamcloud.com/sub_tests/7c54f460-e93a-11e1-a508-52540035b04c

            Server: lustre-master-tag2.2.93 RHEL6
            Client: 1.8.8

            dmesg log from MDS

            00:44:32:LustreError: 12315:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            00:44:32:LustreError: 12315:0:(lu_object.c:1081:lu_device_fini()) LBUG
            00:44:32:Pid: 12315, comm: umount
            00:44:32:
            00:44:32:Call Trace:
            00:44:32: [<ffffffffa044b905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            00:44:32: [<ffffffffa044bf17>] lbug_with_loc+0x47/0xb0 [libcfs]
            00:44:32: [<ffffffffa05d503c>] lu_device_fini+0xcc/0xd0 [obdclass]
            00:44:33: [<ffffffffa09ce234>] lovsub_device_free+0x24/0x200 [lov]
            00:44:33: [<ffffffffa05d8586>] lu_stack_fini+0x96/0xf0 [obdclass]
            00:44:33: [<ffffffffa05dd50e>] cl_stack_fini+0xe/0x10 [obdclass]
            00:44:33: [<ffffffffa09bd6a8>] lov_device_fini+0x58/0x130 [lov]
            00:44:34: [<ffffffffa05d8539>] lu_stack_fini+0x49/0xf0 [obdclass]
            00:44:34: [<ffffffffa05dd50e>] cl_stack_fini+0xe/0x10 [obdclass]
            00:44:34: [<ffffffffa0a8d31d>] cl_sb_fini+0x6d/0x190 [lustre]
            00:44:34: [<ffffffffa0a5255c>] client_common_put_super+0x14c/0xe60 [lustre]
            00:44:34: [<ffffffffa0a53340>] ll_put_super+0xd0/0x360 [lustre]
            00:44:34: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
            00:44:34: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
            00:44:34: [<ffffffff8117d436>] kill_anon_super+0x16/0x60
            00:44:34: [<ffffffffa05c2c7a>] lustre_kill_super+0x4a/0x60 [obdclass]
            00:44:34: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
            00:44:34: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
            00:44:34: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
            00:44:34: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
            00:44:34: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
            00:44:35:
            00:44:36:Kernel panic - not syncing: LBUG
            00:44:36:Pid: 12315, comm: umount Not tainted 2.6.32-279.2.1.el6_lustre.x86_64 #1
            00:44:36:Call Trace:
            00:44:36: [<ffffffff814fd58a>] ? panic+0xa0/0x168
            00:44:36: [<ffffffffa044bf6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
            00:44:37: [<ffffffffa05d503c>] ? lu_device_fini+0xcc/0xd0 [obdclass]
            00:44:37: [<ffffffffa09ce234>] ? lovsub_device_free+0x24/0x200 [lov]
            00:44:37: [<ffffffffa05d8586>] ? lu_stack_fini+0x96/0xf0 [obdclass]
            00:44:37: [<ffffffffa05dd50e>] ? cl_stack_fini+0xe/0x10 [obdclass]
            00:44:37: [<ffffffffa09bd6a8>] ? lov_device_fini+0x58/0x130 [lov]
            00:44:37: [<ffffffffa05d8539>] ? lu_stack_fini+0x49/0xf0 [obdclass]
            00:44:37: [<ffffffffa05dd50e>] ? cl_stack_fini+0xe/0x10 [obdclass]
            00:44:37: [<ffffffffa0a8d31d>] ? cl_sb_fini+0x6d/0x190 [lustre]
            00:44:37: [<ffffffffa0a5255c>] ? client_common_put_super+0x14c/0xe60 [lustre]
            00:44:37: [<ffffffffa0a53340>] ? ll_put_super+0xd0/0x360 [lustre]
            00:44:37: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
            00:44:37: [<ffffffff8117d34b>] ? generic_shutdown_super+0x5b/0xe0
            00:44:38: [<ffffffff8117d436>] ? kill_anon_super+0x16/0x60
            00:44:38: [<ffffffffa05c2c7a>] ? lustre_kill_super+0x4a/0x60 [obdclass]
            00:44:39: [<ffffffff8117e4b0>] ? deactivate_super+0x70/0x90
            00:44:39: [<ffffffff8119a4ff>] ? mntput_no_expire+0xbf/0x110
            00:44:39: [<ffffffff8119af9b>] ? sys_umount+0x7b/0x3a0
            00:44:39: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
            00:44:39: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
            00:44:39:Initializing cgroup subsys cpuset
            
            sarah Sarah Liu added a comment - - edited Hit this failure when trying to umount lustre after test parallel-scale-nfsv3 finished. https://maloo.whamcloud.com/sub_tests/7c54f460-e93a-11e1-a508-52540035b04c Server: lustre-master-tag2.2.93 RHEL6 Client: 1.8.8 dmesg log from MDS 00:44:32:LustreError: 12315:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 00:44:32:LustreError: 12315:0:(lu_object.c:1081:lu_device_fini()) LBUG 00:44:32:Pid: 12315, comm: umount 00:44:32: 00:44:32:Call Trace: 00:44:32: [<ffffffffa044b905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 00:44:32: [<ffffffffa044bf17>] lbug_with_loc+0x47/0xb0 [libcfs] 00:44:32: [<ffffffffa05d503c>] lu_device_fini+0xcc/0xd0 [obdclass] 00:44:33: [<ffffffffa09ce234>] lovsub_device_free+0x24/0x200 [lov] 00:44:33: [<ffffffffa05d8586>] lu_stack_fini+0x96/0xf0 [obdclass] 00:44:33: [<ffffffffa05dd50e>] cl_stack_fini+0xe/0x10 [obdclass] 00:44:33: [<ffffffffa09bd6a8>] lov_device_fini+0x58/0x130 [lov] 00:44:34: [<ffffffffa05d8539>] lu_stack_fini+0x49/0xf0 [obdclass] 00:44:34: [<ffffffffa05dd50e>] cl_stack_fini+0xe/0x10 [obdclass] 00:44:34: [<ffffffffa0a8d31d>] cl_sb_fini+0x6d/0x190 [lustre] 00:44:34: [<ffffffffa0a5255c>] client_common_put_super+0x14c/0xe60 [lustre] 00:44:34: [<ffffffffa0a53340>] ll_put_super+0xd0/0x360 [lustre] 00:44:34: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190 00:44:34: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 00:44:34: [<ffffffff8117d436>] kill_anon_super+0x16/0x60 00:44:34: [<ffffffffa05c2c7a>] lustre_kill_super+0x4a/0x60 [obdclass] 00:44:34: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 00:44:34: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 00:44:34: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 00:44:34: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 00:44:34: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 00:44:35: 00:44:36:Kernel panic - not syncing: LBUG 00:44:36:Pid: 12315, comm: umount Not tainted 2.6.32-279.2.1.el6_lustre.x86_64 #1 00:44:36:Call Trace: 00:44:36: [<ffffffff814fd58a>] ? panic+0xa0/0x168 00:44:36: [<ffffffffa044bf6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 00:44:37: [<ffffffffa05d503c>] ? lu_device_fini+0xcc/0xd0 [obdclass] 00:44:37: [<ffffffffa09ce234>] ? lovsub_device_free+0x24/0x200 [lov] 00:44:37: [<ffffffffa05d8586>] ? lu_stack_fini+0x96/0xf0 [obdclass] 00:44:37: [<ffffffffa05dd50e>] ? cl_stack_fini+0xe/0x10 [obdclass] 00:44:37: [<ffffffffa09bd6a8>] ? lov_device_fini+0x58/0x130 [lov] 00:44:37: [<ffffffffa05d8539>] ? lu_stack_fini+0x49/0xf0 [obdclass] 00:44:37: [<ffffffffa05dd50e>] ? cl_stack_fini+0xe/0x10 [obdclass] 00:44:37: [<ffffffffa0a8d31d>] ? cl_sb_fini+0x6d/0x190 [lustre] 00:44:37: [<ffffffffa0a5255c>] ? client_common_put_super+0x14c/0xe60 [lustre] 00:44:37: [<ffffffffa0a53340>] ? ll_put_super+0xd0/0x360 [lustre] 00:44:37: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190 00:44:37: [<ffffffff8117d34b>] ? generic_shutdown_super+0x5b/0xe0 00:44:38: [<ffffffff8117d436>] ? kill_anon_super+0x16/0x60 00:44:38: [<ffffffffa05c2c7a>] ? lustre_kill_super+0x4a/0x60 [obdclass] 00:44:39: [<ffffffff8117e4b0>] ? deactivate_super+0x70/0x90 00:44:39: [<ffffffff8119a4ff>] ? mntput_no_expire+0xbf/0x110 00:44:39: [<ffffffff8119af9b>] ? sys_umount+0x7b/0x3a0 00:44:39: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 00:44:39: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b 00:44:39:Initializing cgroup subsys cpuset
            23:24:45:LustreError: 12357:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            23:24:45:LustreError: 12357:0:(lu_object.c:1081:lu_device_fini()) LBUG
            23:24:46:Pid: 12357, comm: umount
            23:24:46:
            23:24:46:Call Trace:
            23:24:46: [<ffffffff887dc751>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
            23:24:46: [<ffffffff887dcc58>] lbug_with_loc+0x48/0x90 [libcfs]
            23:24:46: [<ffffffff88978a08>] lu_device_fini+0x68/0xe0 [obdclass]
            23:24:46: [<ffffffff88d43110>] lovsub_device_free+0x70/0x1f0 [lov]
            23:24:46: [<ffffffff88978020>] lu_cdebug_printer+0x0/0x170 [obdclass]
            23:24:46: [<ffffffff88977686>] lu_stack_fini+0xa6/0xd0 [obdclass]
            23:24:46: [<ffffffff88d3327a>] lov_device_fini+0xaa/0x110 [lov]
            23:24:46: [<ffffffff88977618>] lu_stack_fini+0x38/0xd0 [obdclass]
            23:24:46: [<ffffffff88e609f7>] cl_sb_fini+0xb7/0x180 [lustre]
            23:24:46: [<ffffffff88e2680d>] client_common_put_super+0x3fd/0xe10 [lustre]
            23:24:47: [<ffffffff88e27575>] ll_put_super+0x1a5/0x330 [lustre]
            23:24:47: [<ffffffff800eecd4>] invalidate_inodes+0xce/0xe0
            23:24:47: [<ffffffff800e6332>] generic_shutdown_super+0x79/0xfb
            23:24:47: [<ffffffff800e6402>] kill_anon_super+0x9/0x35
            23:24:47: [<ffffffff800e64b3>] deactivate_super+0x6a/0x82
            23:24:47: [<ffffffff800f0413>] sys_umount+0x245/0x27b
            
            adilger Andreas Dilger added a comment - 23:24:45:LustreError: 12357:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 23:24:45:LustreError: 12357:0:(lu_object.c:1081:lu_device_fini()) LBUG 23:24:46:Pid: 12357, comm: umount 23:24:46: 23:24:46:Call Trace: 23:24:46: [<ffffffff887dc751>] libcfs_debug_dumpstack+0x51/0x60 [libcfs] 23:24:46: [<ffffffff887dcc58>] lbug_with_loc+0x48/0x90 [libcfs] 23:24:46: [<ffffffff88978a08>] lu_device_fini+0x68/0xe0 [obdclass] 23:24:46: [<ffffffff88d43110>] lovsub_device_free+0x70/0x1f0 [lov] 23:24:46: [<ffffffff88978020>] lu_cdebug_printer+0x0/0x170 [obdclass] 23:24:46: [<ffffffff88977686>] lu_stack_fini+0xa6/0xd0 [obdclass] 23:24:46: [<ffffffff88d3327a>] lov_device_fini+0xaa/0x110 [lov] 23:24:46: [<ffffffff88977618>] lu_stack_fini+0x38/0xd0 [obdclass] 23:24:46: [<ffffffff88e609f7>] cl_sb_fini+0xb7/0x180 [lustre] 23:24:46: [<ffffffff88e2680d>] client_common_put_super+0x3fd/0xe10 [lustre] 23:24:47: [<ffffffff88e27575>] ll_put_super+0x1a5/0x330 [lustre] 23:24:47: [<ffffffff800eecd4>] invalidate_inodes+0xce/0xe0 23:24:47: [<ffffffff800e6332>] generic_shutdown_super+0x79/0xfb 23:24:47: [<ffffffff800e6402>] kill_anon_super+0x9/0x35 23:24:47: [<ffffffff800e64b3>] deactivate_super+0x6a/0x82 23:24:47: [<ffffffff800f0413>] sys_umount+0x245/0x27b

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: