[LU-1480] failure on replay-single test_74: ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 Created: 05/Jun/12  Updated: 21/Jul/20  Resolved: 24/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: yuc2

Issue Links:
Duplicate
is duplicated by LU-3116 (lu_object.c:1140:lu_device_fini()) A... Resolved
Related
is related to LU-1773 Oops in lov_delete_raid0() Resolved
is related to LU-3228 fc18: sanity test_183: @@@@@@ FAIL: t... Resolved
is related to LU-2070 lu_object.c:1114:lu_device_fini()) AS... Closed
Severity: 3
Rank (Obsolete): 4293

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8506fd4e-ad5b-11e1-8152-52540035b04c.

The sub-test test_74 failed with the following error:

test failed to respond and timed out

Info required for matching: replay-single 74



 Comments   
Comment by Andreas Dilger [ 16/Aug/12 ]
23:24:45:LustreError: 12357:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
23:24:45:LustreError: 12357:0:(lu_object.c:1081:lu_device_fini()) LBUG
23:24:46:Pid: 12357, comm: umount
23:24:46:
23:24:46:Call Trace:
23:24:46: [<ffffffff887dc751>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
23:24:46: [<ffffffff887dcc58>] lbug_with_loc+0x48/0x90 [libcfs]
23:24:46: [<ffffffff88978a08>] lu_device_fini+0x68/0xe0 [obdclass]
23:24:46: [<ffffffff88d43110>] lovsub_device_free+0x70/0x1f0 [lov]
23:24:46: [<ffffffff88978020>] lu_cdebug_printer+0x0/0x170 [obdclass]
23:24:46: [<ffffffff88977686>] lu_stack_fini+0xa6/0xd0 [obdclass]
23:24:46: [<ffffffff88d3327a>] lov_device_fini+0xaa/0x110 [lov]
23:24:46: [<ffffffff88977618>] lu_stack_fini+0x38/0xd0 [obdclass]
23:24:46: [<ffffffff88e609f7>] cl_sb_fini+0xb7/0x180 [lustre]
23:24:46: [<ffffffff88e2680d>] client_common_put_super+0x3fd/0xe10 [lustre]
23:24:47: [<ffffffff88e27575>] ll_put_super+0x1a5/0x330 [lustre]
23:24:47: [<ffffffff800eecd4>] invalidate_inodes+0xce/0xe0
23:24:47: [<ffffffff800e6332>] generic_shutdown_super+0x79/0xfb
23:24:47: [<ffffffff800e6402>] kill_anon_super+0x9/0x35
23:24:47: [<ffffffff800e64b3>] deactivate_super+0x6a/0x82
23:24:47: [<ffffffff800f0413>] sys_umount+0x245/0x27b
Comment by Sarah Liu [ 20/Aug/12 ]

Hit this failure when trying to umount lustre after test parallel-scale-nfsv3 finished.
https://maloo.whamcloud.com/sub_tests/7c54f460-e93a-11e1-a508-52540035b04c

Server: lustre-master-tag2.2.93 RHEL6
Client: 1.8.8

dmesg log from MDS

00:44:32:LustreError: 12315:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
00:44:32:LustreError: 12315:0:(lu_object.c:1081:lu_device_fini()) LBUG
00:44:32:Pid: 12315, comm: umount
00:44:32:
00:44:32:Call Trace:
00:44:32: [<ffffffffa044b905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
00:44:32: [<ffffffffa044bf17>] lbug_with_loc+0x47/0xb0 [libcfs]
00:44:32: [<ffffffffa05d503c>] lu_device_fini+0xcc/0xd0 [obdclass]
00:44:33: [<ffffffffa09ce234>] lovsub_device_free+0x24/0x200 [lov]
00:44:33: [<ffffffffa05d8586>] lu_stack_fini+0x96/0xf0 [obdclass]
00:44:33: [<ffffffffa05dd50e>] cl_stack_fini+0xe/0x10 [obdclass]
00:44:33: [<ffffffffa09bd6a8>] lov_device_fini+0x58/0x130 [lov]
00:44:34: [<ffffffffa05d8539>] lu_stack_fini+0x49/0xf0 [obdclass]
00:44:34: [<ffffffffa05dd50e>] cl_stack_fini+0xe/0x10 [obdclass]
00:44:34: [<ffffffffa0a8d31d>] cl_sb_fini+0x6d/0x190 [lustre]
00:44:34: [<ffffffffa0a5255c>] client_common_put_super+0x14c/0xe60 [lustre]
00:44:34: [<ffffffffa0a53340>] ll_put_super+0xd0/0x360 [lustre]
00:44:34: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
00:44:34: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
00:44:34: [<ffffffff8117d436>] kill_anon_super+0x16/0x60
00:44:34: [<ffffffffa05c2c7a>] lustre_kill_super+0x4a/0x60 [obdclass]
00:44:34: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
00:44:34: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
00:44:34: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
00:44:34: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
00:44:34: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
00:44:35:
00:44:36:Kernel panic - not syncing: LBUG
00:44:36:Pid: 12315, comm: umount Not tainted 2.6.32-279.2.1.el6_lustre.x86_64 #1
00:44:36:Call Trace:
00:44:36: [<ffffffff814fd58a>] ? panic+0xa0/0x168
00:44:36: [<ffffffffa044bf6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
00:44:37: [<ffffffffa05d503c>] ? lu_device_fini+0xcc/0xd0 [obdclass]
00:44:37: [<ffffffffa09ce234>] ? lovsub_device_free+0x24/0x200 [lov]
00:44:37: [<ffffffffa05d8586>] ? lu_stack_fini+0x96/0xf0 [obdclass]
00:44:37: [<ffffffffa05dd50e>] ? cl_stack_fini+0xe/0x10 [obdclass]
00:44:37: [<ffffffffa09bd6a8>] ? lov_device_fini+0x58/0x130 [lov]
00:44:37: [<ffffffffa05d8539>] ? lu_stack_fini+0x49/0xf0 [obdclass]
00:44:37: [<ffffffffa05dd50e>] ? cl_stack_fini+0xe/0x10 [obdclass]
00:44:37: [<ffffffffa0a8d31d>] ? cl_sb_fini+0x6d/0x190 [lustre]
00:44:37: [<ffffffffa0a5255c>] ? client_common_put_super+0x14c/0xe60 [lustre]
00:44:37: [<ffffffffa0a53340>] ? ll_put_super+0xd0/0x360 [lustre]
00:44:37: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
00:44:37: [<ffffffff8117d34b>] ? generic_shutdown_super+0x5b/0xe0
00:44:38: [<ffffffff8117d436>] ? kill_anon_super+0x16/0x60
00:44:38: [<ffffffffa05c2c7a>] ? lustre_kill_super+0x4a/0x60 [obdclass]
00:44:39: [<ffffffff8117e4b0>] ? deactivate_super+0x70/0x90
00:44:39: [<ffffffff8119a4ff>] ? mntput_no_expire+0xbf/0x110
00:44:39: [<ffffffff8119af9b>] ? sys_umount+0x7b/0x3a0
00:44:39: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
00:44:39: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
00:44:39:Initializing cgroup subsys cpuset
Comment by Peter Jones [ 21/Aug/12 ]

Bobijam will look into this one

Comment by Jinshan Xiong (Inactive) [ 22/Aug/12 ]

duplicate of LU-1773

Comment by Andreas Dilger [ 30/Aug/12 ]

This problem was still hit in conf-sanity.sh:

https://maloo.whamcloud.com/test_sets/f46209aa-f1b6-11e1-9def-52540035b04c

but the patch being tested (git hash e65e329d84ba05370cb7f8e7a789cad129bcc287 == patch set 9) shows that it is based on master git hash 34a60922e2b94fb35872a9bd6fe3eefa6a95826c (LU-1661 tests: improve setup_posix_users()), which should contain the fix for LU-1773 already.

Comment by Andreas Dilger [ 30/Aug/12 ]

It failed again later in the same test run:

https://maloo.whamcloud.com/test_sets/47f9f3d0-f1b6-11e1-9def-52540035b04c

Comment by Andreas Dilger [ 30/Aug/12 ]

Failed a third time in the same test run:
https://maloo.whamcloud.com/test_sets/80724b1a-f1b4-11e1-9def-52540035b04c

Comment by Zhenyu Xu [ 03/Sep/12 ]

Xiong,

Can we sure about that there is no race to set lli->lli_clob in the following code?

cl_file_inode_init()
//
        if (lli->lli_clob == NULL) {
                /* clob is slave of inode, empty lli_clob means for new inode,
                 * there is no clob in cache with the given fid, so it is
                 * unnecessary to perform lookup-alloc-lookup-insert, just
                 * alloc and insert directly. */
#ifdef __KERNEL__
                LASSERT(inode->i_state & I_NEW);
#endif
                conf.coc_lu.loc_flags = LOC_F_NEW;
                clob = cl_object_find(env, lu2cl_dev(site->ls_top_dev),
                                      fid, &conf);
                if (!IS_ERR(clob)) {
                        /*
                         * No locking is necessary, as new inode is
                         * locked by I_NEW bit.
                         *
                         * XXX not true for call from ll_update_inode().
                         */
                        lli->lli_clob = clob;
                        lu_object_ref_add(&clob->co_lu, "inode", inode);
                } else
                        result = PTR_ERR(clob);
        }
Comment by Jinshan Xiong (Inactive) [ 03/Sep/12 ]

Hi Bobi, the above piece of code looks good because inode state I_NEW seems enough to protect the assignment of lli_clob. From the log, I think there were object leak on the lovsub layer. Let's reserve some nodes on toro to reproduce this problem.

Comment by Zhenyu Xu [ 12/Sep/12 ]

git commit 35920b759ed78441db0cd9de6ac8ec66da862f22 has changed the mount logic, please lower the priority and wait to see whether we can hit the issue again.

Comment by Peter Jones [ 12/Sep/12 ]

ok

Comment by Sarah Liu [ 24/Sep/12 ]

Hit this issue when testing interop between 1.8 and 2.3-RC1 after paralles-scale-nfsv3 passed all the sub tests

https://maloo.whamcloud.com/test_sets/71e39924-0626-11e2-9b17-52540035b04c

MDS console log of test_nfsread_orphan_file shows:

19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) header@ffff8800517c9bf0[0x0, 1, [0x100060000:0x4769:0x0] hash]{ 
19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) ....lovsub@ffff8800517c9c88[0]
19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) ....osc@ffff8800545352a8id: 18281 gr: 0 idx: 6 gen: 0 kms_valid: 1 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 2942 mtime: 1348454797 atime: 0 ctime: 1348454797 blocks: 0
19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) } header@ffff8800517c9bf0
19:47:03:LustreError: 20620:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
19:47:03:LustreError: 20620:0:(lu_object.c:1081:lu_device_fini()) LBUG
19:47:03:Pid: 20620, comm: umount
19:47:03:
19:47:03:Call Trace:
19:47:03: [<ffffffffa0d31905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
19:47:03: [<ffffffffa0d31f17>] lbug_with_loc+0x47/0xb0 [libcfs]
19:47:03: [<ffffffffa040b2dc>] lu_device_fini+0xcc/0xd0 [obdclass]
19:47:03: [<ffffffffa089f114>] lovsub_device_free+0x24/0x200 [lov]
19:47:03: [<ffffffffa040e826>] lu_stack_fini+0x96/0xf0 [obdclass]
19:47:03: [<ffffffffa04137ae>] cl_stack_fini+0xe/0x10 [obdclass]
19:47:03: [<ffffffffa088e6a8>] lov_device_fini+0x58/0x130 [lov]
19:47:03: [<ffffffffa040e7d9>] lu_stack_fini+0x49/0xf0 [obdclass]
19:47:03: [<ffffffffa04137ae>] cl_stack_fini+0xe/0x10 [obdclass]
19:47:03: [<ffffffffa0b744cd>] cl_sb_fini+0x6d/0x190 [lustre]
19:47:03: [<ffffffffa0b3959c>] client_common_put_super+0x14c/0xe60 [lustre]
19:47:03: [<ffffffffa0b3a380>] ll_put_super+0xd0/0x360 [lustre]
19:47:03: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
19:47:03: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
19:47:03: [<ffffffff8117d436>] kill_anon_super+0x16/0x60
19:47:03: [<ffffffffa03f8eaa>] lustre_kill_super+0x4a/0x60 [obdclass]
19:47:03: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
19:47:03: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
19:47:03: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
19:47:03: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
19:47:03: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
19:47:03:
19:47:03:Kernel panic - not syncing: LBUG
19:47:03:Pid: 20620, comm: umount Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1
19:47:03:Call Trace:
19:47:03: [<ffffffff814fd58a>] ? panic+0xa0/0x168
19:47:03: [<ffffffffa0d31f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
19:47:03: [<ffffffffa040b2dc>] ? lu_device_fini+0xcc/0xd0 [obdclass]
19:47:03: [<ffffffffa089f114>] ? lovsub_device_free+0x24/0x200 [lov]
19:47:03: [<ffffffffa040e826>] ? lu_stack_fini+0x96/0xf0 [obdclass]
19:47:03: [<ffffffffa04137ae>] ? cl_stack_fini+0xe/0x10 [obdclass]
19:47:03: [<ffffffffa088e6a8>] ? lov_device_fini+0x58/0x130 [lov]
19:47:03: [<ffffffffa040e7d9>] ? lu_stack_fini+0x49/0xf0 [obdclass]
19:47:03: [<ffffffffa04137ae>] ? cl_stack_fini+0xe/0x10 [obdclass]
19:47:03: [<ffffffffa0b744cd>] ? cl_sb_fini+0x6d/0x190 [lustre]
19:47:03: [<ffffffffa0b3959c>] ? client_common_put_super+0x14c/0xe60 [lustre]
19:47:03: [<ffffffffa0b3a380>] ? ll_put_super+0xd0/0x360 [lustre]
19:47:03: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
19:47:03: [<ffffffff8117d34b>] ? generic_shutdown_super+0x5b/0xe0
19:47:03: [<ffffffff8117d436>] ? kill_anon_super+0x16/0x60
19:47:03: [<ffffffffa03f8eaa>] ? lustre_kill_super+0x4a/0x60 [obdclass]
19:47:03: [<ffffffff8117e4b0>] ? deactivate_super+0x70/0x90
19:47:03: [<ffffffff8119a4ff>] ? mntput_no_expire+0xbf/0x110
19:47:03: [<ffffffff8119af9b>] ? sys_umount+0x7b/0x3a0
19:47:03: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
19:47:03: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
19:47:03:Initializing cgroup subsys cpuset
19:47:03:Initializing cgroup subsys cpu
Comment by Zhenyu Xu [ 01/Oct/12 ]

http://review.whamcloud.com/4108

LU-1480 lov: lov_delete_raid0 need wait

If lov_delete_raid0 does not wait for its layout stable, lov object's
deletion will leave lovsub object hang in the memory.

Comment by Jian Yu [ 11/Oct/12 ]

Lustre Tag: v2_3_0_RC2
Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/198
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/32
Distro/Arch: RHEL6.3/x86_64

The same issue occurred in parallel-scale-nfsv3 test nfsread_orphan_file:
https://maloo.whamcloud.com/test_sets/d53d4086-1370-11e2-808f-52540035b04c

Comment by Jian Yu [ 15/Oct/12 ]

Lustre Tag: v2_3_0_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
Network: TCP
ENABLE_QUOTA=yes

parallel-scale-nfsv3 test iorfpp failed with the same issue:
https://maloo.whamcloud.com/test_sets/7ca42e16-168c-11e2-962d-52540035b04c

Comment by Jian Yu [ 15/Oct/12 ]

Lustre Tag: v2_3_0_RC3
Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/198
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64

The same issue occurred in parallel-scale-nfsv3 test nfsread_orphan_file:
https://maloo.whamcloud.com/test_sets/0101c7c8-16a0-11e2-962d-52540035b04c

Comment by Li Wei (Inactive) [ 16/Oct/12 ]

Prompted to Blocker for 2.4.

Comment by Andreas Dilger [ 19/Oct/12 ]

Alex reported in LU-2070:

please use http://review.whamcloud.com/4151 to debug

Comment by Zhenyu Xu [ 12/Nov/12 ]

status update:

A debugging patch has landed in master branch and waiting for re-hits with the debug message.

Comment by Ian Colle (Inactive) [ 13/Nov/12 ]

https://maloo.whamcloud.com/test_sets/be5714e2-2ce7-11e2-9af4-52540035b04c

Comment by Andreas Dilger [ 20/Nov/12 ]

Bobijam, can you please look at the https://maloo.whamcloud.com/test_sets/be5714e2-2ce7-11e2-9af4-52540035b04c to see if your debugging patch contains the information you need to resolve this problem.

Comment by Zhenyu Xu [ 21/Nov/12 ]

Didn't find it, the failure illustrated in the above maloo report is that a client cannot finish inode sync while trying to umount the mount point, not relating to the device refcount issue.

Comment by Andreas Dilger [ 23/Nov/12 ]

Bobijam, this bug was reported hit 14 times in the past week, according to:

https://maloo.whamcloud.com/test_sets/query?utf8=%E2%9C%93&test_set[test_set_script_id]=&test_set[status]=&test_set[query_bugs]=LU-1480&test_session[test_host]=&test_session[test_group]=&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node_network[network_type_id]=&commit=Update+results

The most recent is at:
https://maloo.whamcloud.com/test_sets/8ab52280-3536-11e2-918f-52540035b04c

Can you please check if the information you need is in one of these failures. If not, is there something that can be done to improve the debugging patch to capture the information you need?

Comment by Zhenyu Xu [ 28/Nov/12 ]

haven't seen this issue for recent tests, I think we can lower the severity.

Comment by Peter Jones [ 30/Nov/12 ]

Dropping priority as no longer occurring regularly

Comment by Jian Yu [ 19/Apr/13 ]

Lustre Build: https://build.whamcloud.com/job/lustre-master/1340/

Server: el6, x86_64
clients: fc18, x86_64

The issue occurred again and was reported in LU-3116.

Comment by Minh Diep [ 19/Apr/13 ]

I hit this very frequent using fc18 client running sanity test

Comment by Zhenyu Xu [ 20/Apr/13 ]

pushed a debug patch at http://review.whamcloud.com/6105

Comment by Minh Diep [ 23/Apr/13 ]

I have reproduced this issue and trigger kernel crash dump available at /scratch/ftp/uploads/LU-1480

Comment by Jian Yu [ 30/May/13 ]

Lustre b1_8 client build: http://build.whamcloud.com/job/lustre-b1_8/258 (1.8.9-wc1)
Lustre b2_1 server build: http://build.whamcloud.com/job/lustre-b2_1/205

After running parallel-scale-nfsv3, unmounting NFS server/Lustre client on the MDS node hit the same failure:

19:29:19:LustreError: 10728:0:(lu_object.c:1018:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
19:29:19:LustreError: 10728:0:(lu_object.c:1018:lu_device_fini()) LBUG
19:29:20:Pid: 10728, comm: umount
19:29:20:
19:29:20:Call Trace:
19:29:20: [<ffffffffa049f785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
19:29:20: [<ffffffffa049fd97>] lbug_with_loc+0x47/0xb0 [libcfs]
19:29:20: [<ffffffffa05df3dc>] lu_device_fini+0xcc/0xd0 [obdclass]
19:29:20: [<ffffffffa0a798b4>] lovsub_device_free+0x24/0x1e0 [lov]
19:29:20: [<ffffffffa05e25f6>] lu_stack_fini+0x96/0xf0 [obdclass]
19:29:20: [<ffffffffa05e6bfe>] cl_stack_fini+0xe/0x10 [obdclass]
19:29:20: [<ffffffffa0a699d8>] lov_device_fini+0x58/0x130 [lov]
19:29:20: [<ffffffffa05e25a9>] lu_stack_fini+0x49/0xf0 [obdclass]
19:29:20: [<ffffffffa05e6bfe>] cl_stack_fini+0xe/0x10 [obdclass]
19:29:20: [<ffffffffa0b52b6d>] cl_sb_fini+0x6d/0x190 [lustre]
19:29:20: [<ffffffffa0b1ac9c>] client_common_put_super+0x14c/0xe60 [lustre]
19:29:20: [<ffffffffa0b1ba80>] ll_put_super+0xd0/0x360 [lustre]
19:29:20: [<ffffffff8119d546>] ? invalidate_inodes+0xf6/0x190
19:29:20: [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0
19:29:20: [<ffffffff81183436>] kill_anon_super+0x16/0x60
19:29:21: [<ffffffffa05ceaca>] lustre_kill_super+0x4a/0x60 [obdclass]
19:29:21: [<ffffffff81183bd7>] deactivate_super+0x57/0x80
19:29:21: [<ffffffff811a1bff>] mntput_no_expire+0xbf/0x110
19:29:21: [<ffffffff811a266b>] sys_umount+0x7b/0x3a0
19:29:21: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
19:29:21:
19:29:21:Kernel panic - not syncing: LBUG

Maloo report: https://maloo.whamcloud.com/test_sets/84843efe-c8e9-11e2-97fe-52540035b04c

Comment by Zhenyu Xu [ 04/Jun/13 ]

update a patchset at http://review.whamcloud.com/6105

During a file lov object initialization, we need protect the access and change of its subobj->coh_parent, since it could be another layout change race there, which makes an unreferenced lovsub obj in the site object hash table.

Comment by Jian Yu [ 27/Jun/13 ]

Lustre b1_8 client build: http://build.whamcloud.com/job/lustre-b1_8/258 (1.8.9-wc1)
Lustre b2_1 server build: http://build.whamcloud.com/job/lustre-b2_1/215 (2.1.6 RC2)

After running parallel-scale-nfsv3, unmounting NFS server/Lustre client on the MDS node hit the same failure:
https://maloo.whamcloud.com/test_sets/25c65126-de8e-11e2-afb2-52540035b04c

Comment by Jian Yu [ 09/Aug/13 ]

Lustre Branch: b2_4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/27/
Distro/Arch: RHEL6.4/x86_64 + FC18/x86_64 (Server + Client)

Sanity test 232 failed as follows:

== sanity test 232: failed lock should not block umount ============================================== 08:02:59 (1375628579)
fail_loc=0x31c
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0207861 s, 50.4 MB/s
fail_loc=0
10.10.4.122@tcp:/lustre /mnt/lustre lustre rw,flock,user_xattr 0 0
CMD: client-16vm2.lab.whamcloud.com grep -c /mnt/lustre' ' /proc/mounts
Stopping client client-16vm2.lab.whamcloud.com /mnt/lustre (opts:)
CMD: client-16vm2.lab.whamcloud.com lsof -t /mnt/lustre

Syslog on client node client-16vm2 showed that:

Aug  4 08:02:59 client-16vm2 kernel: [ 6221.766422] Lustre: DEBUG MARKER: == sanity test 232: failed lock should not block umount ============================================== 08:02:59 (1375628579)
Aug  4 08:02:59 client-16vm2 mrshd[30197]: pam_unix(mrsh:session): session closed for user root
Aug  4 08:02:59 client-16vm2 xinetd[427]: EXIT: mshell status=0 pid=30197 duration=0(sec)
Aug  4 08:02:59 client-16vm2 systemd-logind[299]: Removed session c2300.
Aug  4 08:02:59 client-16vm2 kernel: [ 6221.842241] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
Aug  4 08:02:59 client-16vm2 kernel: [ 6221.853369] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
Aug  4 08:03:00 client-16vm2 kernel: [ 6221.973212] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.052122] LustreError: 30246:0:(lu_object.c:1141:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.058312] LustreError: 30246:0:(lu_object.c:1141:lu_device_fini()) LBUG
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063666] Pid: 30246, comm: umount
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063669]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063669] Call Trace:
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063690]  [<ffffffffa02477e7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063698]  [<ffffffffa0247df5>] lbug_with_loc+0x45/0xc0 [libcfs]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063735]  [<ffffffffa03b3c0a>] lu_device_fini+0xba/0xc0 [obdclass]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063754]  [<ffffffffa0704816>] lovsub_device_free+0x66/0x1b0 [lov]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063776]  [<ffffffffa03b7a0e>] lu_stack_fini+0x7e/0xc0 [obdclass]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063797]  [<ffffffffa03bd24e>] cl_stack_fini+0xe/0x10 [obdclass]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063807]  [<ffffffffa06f48c9>] lov_device_fini+0x59/0x120 [lov]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063829]  [<ffffffffa03b79d9>] lu_stack_fini+0x49/0xc0 [obdclass]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063861]  [<ffffffffa03bd24e>] cl_stack_fini+0xe/0x10 [obdclass]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063892]  [<ffffffffa07e34d1>] cl_sb_fini+0x61/0x1a0 [lustre]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063905]  [<ffffffffa07a5d84>] client_common_put_super+0x54/0x7c0 [lustre]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063918]  [<ffffffffa07a6e84>] ll_put_super+0xd4/0x370 [lustre]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063922]  [<ffffffff811aa7ee>] ? dispose_list+0x3e/0x60
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063925]  [<ffffffff81192761>] generic_shutdown_super+0x61/0xe0
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063928]  [<ffffffff81192876>] kill_anon_super+0x16/0x30
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063949]  [<ffffffffa03a8b7a>] lustre_kill_super+0x4a/0x60 [obdclass]
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063951]  [<ffffffff81192bf7>] deactivate_locked_super+0x57/0x90
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063953]  [<ffffffff811937ae>] deactivate_super+0x4e/0x70
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063956]  [<ffffffff811ae167>] mntput_no_expire+0xd7/0x130
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063958]  [<ffffffff811af146>] sys_umount+0x76/0x390
Aug  4 08:03:00 client-16vm2 kernel: [ 6222.063961]  [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b

Maloo report: https://maloo.whamcloud.com/test_sets/996b2278-fd79-11e2-9fdb-52540035b04c

The failure occurs regularly on FC18 client and is blocking the remaining sanity sub-tests from running.

Comment by Jian Yu [ 04/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64 + FC18/x86_64 (Server + Client)

sanity test 232 hit the same failure:
https://maloo.whamcloud.com/test_sets/0cbde1d0-14ee-11e3-ac48-52540035b04c

Comment by Sarah Liu [ 11/Sep/13 ]

Also hit this error when running interop between 2.4.0 server and 2.5 client:
https://maloo.whamcloud.com/test_sets/a58ce5fe-19c7-11e3-bb73-52540035b04c

server: 2.4.0
client: lustre-master build #1652

19:44:33:LustreError: 10863:0:(lu_object.c:1141:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
19:44:33:LustreError: 10863:0:(lu_object.c:1141:lu_device_fini()) LBUG
19:44:34:Pid: 10863, comm: umount
19:44:34:
19:44:35:Call Trace:
19:44:35: [<ffffffffa0478895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
19:44:35: [<ffffffffa0478e97>] lbug_with_loc+0x47/0xb0 [libcfs]
19:44:36: [<ffffffffa05ec4b8>] lu_device_fini+0xb8/0xc0 [obdclass]
19:44:36: [<ffffffffa05d0727>] ls_device_put+0x87/0x1e0 [obdclass]
19:44:36: [<ffffffffa05d0a3c>] local_oid_storage_fini+0x1bc/0x270 [obdclass]
19:44:36: [<ffffffffa0d6bd74>] mgs_fs_cleanup+0x64/0x80 [mgs]
19:44:36: [<ffffffffa0d68ae0>] mgs_device_fini+0x1d0/0x5a0 [mgs]
19:44:36: [<ffffffffa05ddba7>] class_cleanup+0x577/0xda0 [obdclass]
19:44:36: [<ffffffffa05b2b36>] ? class_name2dev+0x56/0xe0 [obdclass]
19:44:37: [<ffffffffa05df48c>] class_process_config+0x10bc/0x1c80 [obdclass]
19:44:37: [<ffffffffa05d8cb3>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
19:44:38: [<ffffffffa05e01c9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
19:44:38: [<ffffffffa05b2b36>] ? class_name2dev+0x56/0xe0 [obdclass]
19:44:38: [<ffffffffa0614d7d>] server_put_super+0x46d/0xf00 [obdclass]
19:44:38: [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0
19:44:38: [<ffffffff81183436>] kill_anon_super+0x16/0x60
19:44:39: [<ffffffffa05e2026>] lustre_kill_super+0x36/0x60 [obdclass]
19:44:39: [<ffffffff81183bd7>] deactivate_super+0x57/0x80
19:44:39: [<ffffffff811a1c4f>] mntput_no_expire+0xbf/0x110
19:44:39: [<ffffffff811a26bb>] sys_umount+0x7b/0x3a0
19:44:39: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
19:44:39:
19:44:40:Kernel panic - not syncing: LBUG
19:44:40:Pid: 10863, comm: umount Not tainted 2.6.32-358.6.2.el6_lustre.g230b174.x86_64 #1
Comment by Peter Jones [ 24/Sep/13 ]

Landed for 2.5.0

Generated at Sat Feb 10 01:17:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.