[LU-1480] failure on replay-single test_74: ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 Created: 05/Jun/12 Updated: 21/Jul/20 Resolved: 24/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | yuc2 | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 4293 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8506fd4e-ad5b-11e1-8152-52540035b04c. The sub-test test_74 failed with the following error:
Info required for matching: replay-single 74 |
| Comments |
| Comment by Andreas Dilger [ 16/Aug/12 ] |
23:24:45:LustreError: 12357:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 23:24:45:LustreError: 12357:0:(lu_object.c:1081:lu_device_fini()) LBUG 23:24:46:Pid: 12357, comm: umount 23:24:46: 23:24:46:Call Trace: 23:24:46: [<ffffffff887dc751>] libcfs_debug_dumpstack+0x51/0x60 [libcfs] 23:24:46: [<ffffffff887dcc58>] lbug_with_loc+0x48/0x90 [libcfs] 23:24:46: [<ffffffff88978a08>] lu_device_fini+0x68/0xe0 [obdclass] 23:24:46: [<ffffffff88d43110>] lovsub_device_free+0x70/0x1f0 [lov] 23:24:46: [<ffffffff88978020>] lu_cdebug_printer+0x0/0x170 [obdclass] 23:24:46: [<ffffffff88977686>] lu_stack_fini+0xa6/0xd0 [obdclass] 23:24:46: [<ffffffff88d3327a>] lov_device_fini+0xaa/0x110 [lov] 23:24:46: [<ffffffff88977618>] lu_stack_fini+0x38/0xd0 [obdclass] 23:24:46: [<ffffffff88e609f7>] cl_sb_fini+0xb7/0x180 [lustre] 23:24:46: [<ffffffff88e2680d>] client_common_put_super+0x3fd/0xe10 [lustre] 23:24:47: [<ffffffff88e27575>] ll_put_super+0x1a5/0x330 [lustre] 23:24:47: [<ffffffff800eecd4>] invalidate_inodes+0xce/0xe0 23:24:47: [<ffffffff800e6332>] generic_shutdown_super+0x79/0xfb 23:24:47: [<ffffffff800e6402>] kill_anon_super+0x9/0x35 23:24:47: [<ffffffff800e64b3>] deactivate_super+0x6a/0x82 23:24:47: [<ffffffff800f0413>] sys_umount+0x245/0x27b |
| Comment by Sarah Liu [ 20/Aug/12 ] |
|
Hit this failure when trying to umount lustre after test parallel-scale-nfsv3 finished. Server: lustre-master-tag2.2.93 RHEL6 dmesg log from MDS 00:44:32:LustreError: 12315:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 00:44:32:LustreError: 12315:0:(lu_object.c:1081:lu_device_fini()) LBUG 00:44:32:Pid: 12315, comm: umount 00:44:32: 00:44:32:Call Trace: 00:44:32: [<ffffffffa044b905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 00:44:32: [<ffffffffa044bf17>] lbug_with_loc+0x47/0xb0 [libcfs] 00:44:32: [<ffffffffa05d503c>] lu_device_fini+0xcc/0xd0 [obdclass] 00:44:33: [<ffffffffa09ce234>] lovsub_device_free+0x24/0x200 [lov] 00:44:33: [<ffffffffa05d8586>] lu_stack_fini+0x96/0xf0 [obdclass] 00:44:33: [<ffffffffa05dd50e>] cl_stack_fini+0xe/0x10 [obdclass] 00:44:33: [<ffffffffa09bd6a8>] lov_device_fini+0x58/0x130 [lov] 00:44:34: [<ffffffffa05d8539>] lu_stack_fini+0x49/0xf0 [obdclass] 00:44:34: [<ffffffffa05dd50e>] cl_stack_fini+0xe/0x10 [obdclass] 00:44:34: [<ffffffffa0a8d31d>] cl_sb_fini+0x6d/0x190 [lustre] 00:44:34: [<ffffffffa0a5255c>] client_common_put_super+0x14c/0xe60 [lustre] 00:44:34: [<ffffffffa0a53340>] ll_put_super+0xd0/0x360 [lustre] 00:44:34: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190 00:44:34: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0 00:44:34: [<ffffffff8117d436>] kill_anon_super+0x16/0x60 00:44:34: [<ffffffffa05c2c7a>] lustre_kill_super+0x4a/0x60 [obdclass] 00:44:34: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 00:44:34: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110 00:44:34: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0 00:44:34: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 00:44:34: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 00:44:35: 00:44:36:Kernel panic - not syncing: LBUG 00:44:36:Pid: 12315, comm: umount Not tainted 2.6.32-279.2.1.el6_lustre.x86_64 #1 00:44:36:Call Trace: 00:44:36: [<ffffffff814fd58a>] ? panic+0xa0/0x168 00:44:36: [<ffffffffa044bf6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 00:44:37: [<ffffffffa05d503c>] ? lu_device_fini+0xcc/0xd0 [obdclass] 00:44:37: [<ffffffffa09ce234>] ? lovsub_device_free+0x24/0x200 [lov] 00:44:37: [<ffffffffa05d8586>] ? lu_stack_fini+0x96/0xf0 [obdclass] 00:44:37: [<ffffffffa05dd50e>] ? cl_stack_fini+0xe/0x10 [obdclass] 00:44:37: [<ffffffffa09bd6a8>] ? lov_device_fini+0x58/0x130 [lov] 00:44:37: [<ffffffffa05d8539>] ? lu_stack_fini+0x49/0xf0 [obdclass] 00:44:37: [<ffffffffa05dd50e>] ? cl_stack_fini+0xe/0x10 [obdclass] 00:44:37: [<ffffffffa0a8d31d>] ? cl_sb_fini+0x6d/0x190 [lustre] 00:44:37: [<ffffffffa0a5255c>] ? client_common_put_super+0x14c/0xe60 [lustre] 00:44:37: [<ffffffffa0a53340>] ? ll_put_super+0xd0/0x360 [lustre] 00:44:37: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190 00:44:37: [<ffffffff8117d34b>] ? generic_shutdown_super+0x5b/0xe0 00:44:38: [<ffffffff8117d436>] ? kill_anon_super+0x16/0x60 00:44:38: [<ffffffffa05c2c7a>] ? lustre_kill_super+0x4a/0x60 [obdclass] 00:44:39: [<ffffffff8117e4b0>] ? deactivate_super+0x70/0x90 00:44:39: [<ffffffff8119a4ff>] ? mntput_no_expire+0xbf/0x110 00:44:39: [<ffffffff8119af9b>] ? sys_umount+0x7b/0x3a0 00:44:39: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 00:44:39: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b 00:44:39:Initializing cgroup subsys cpuset |
| Comment by Peter Jones [ 21/Aug/12 ] |
|
Bobijam will look into this one |
| Comment by Jinshan Xiong (Inactive) [ 22/Aug/12 ] |
|
duplicate of |
| Comment by Andreas Dilger [ 30/Aug/12 ] |
|
This problem was still hit in conf-sanity.sh: https://maloo.whamcloud.com/test_sets/f46209aa-f1b6-11e1-9def-52540035b04c but the patch being tested (git hash e65e329d84ba05370cb7f8e7a789cad129bcc287 == patch set 9) shows that it is based on master git hash 34a60922e2b94fb35872a9bd6fe3eefa6a95826c ( |
| Comment by Andreas Dilger [ 30/Aug/12 ] |
|
It failed again later in the same test run: https://maloo.whamcloud.com/test_sets/47f9f3d0-f1b6-11e1-9def-52540035b04c |
| Comment by Andreas Dilger [ 30/Aug/12 ] |
|
Failed a third time in the same test run: |
| Comment by Zhenyu Xu [ 03/Sep/12 ] |
|
Xiong, Can we sure about that there is no race to set lli->lli_clob in the following code? cl_file_inode_init() // if (lli->lli_clob == NULL) { /* clob is slave of inode, empty lli_clob means for new inode, * there is no clob in cache with the given fid, so it is * unnecessary to perform lookup-alloc-lookup-insert, just * alloc and insert directly. */ #ifdef __KERNEL__ LASSERT(inode->i_state & I_NEW); #endif conf.coc_lu.loc_flags = LOC_F_NEW; clob = cl_object_find(env, lu2cl_dev(site->ls_top_dev), fid, &conf); if (!IS_ERR(clob)) { /* * No locking is necessary, as new inode is * locked by I_NEW bit. * * XXX not true for call from ll_update_inode(). */ lli->lli_clob = clob; lu_object_ref_add(&clob->co_lu, "inode", inode); } else result = PTR_ERR(clob); } |
| Comment by Jinshan Xiong (Inactive) [ 03/Sep/12 ] |
|
Hi Bobi, the above piece of code looks good because inode state I_NEW seems enough to protect the assignment of lli_clob. From the log, I think there were object leak on the lovsub layer. Let's reserve some nodes on toro to reproduce this problem. |
| Comment by Zhenyu Xu [ 12/Sep/12 ] |
|
git commit 35920b759ed78441db0cd9de6ac8ec66da862f22 has changed the mount logic, please lower the priority and wait to see whether we can hit the issue again. |
| Comment by Peter Jones [ 12/Sep/12 ] |
|
ok |
| Comment by Sarah Liu [ 24/Sep/12 ] |
|
Hit this issue when testing interop between 1.8 and 2.3-RC1 after paralles-scale-nfsv3 passed all the sub tests https://maloo.whamcloud.com/test_sets/71e39924-0626-11e2-9b17-52540035b04c MDS console log of test_nfsread_orphan_file shows: 19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) header@ffff8800517c9bf0[0x0, 1, [0x100060000:0x4769:0x0] hash]{
19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) ....lovsub@ffff8800517c9c88[0]
19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) ....osc@ffff8800545352a8id: 18281 gr: 0 idx: 6 gen: 0 kms_valid: 1 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 2942 mtime: 1348454797 atime: 0 ctime: 1348454797 blocks: 0
19:47:03:LustreError: 20620:0:(lu_object.c:1220:lu_stack_fini()) } header@ffff8800517c9bf0
19:47:03:LustreError: 20620:0:(lu_object.c:1081:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
19:47:03:LustreError: 20620:0:(lu_object.c:1081:lu_device_fini()) LBUG
19:47:03:Pid: 20620, comm: umount
19:47:03:
19:47:03:Call Trace:
19:47:03: [<ffffffffa0d31905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
19:47:03: [<ffffffffa0d31f17>] lbug_with_loc+0x47/0xb0 [libcfs]
19:47:03: [<ffffffffa040b2dc>] lu_device_fini+0xcc/0xd0 [obdclass]
19:47:03: [<ffffffffa089f114>] lovsub_device_free+0x24/0x200 [lov]
19:47:03: [<ffffffffa040e826>] lu_stack_fini+0x96/0xf0 [obdclass]
19:47:03: [<ffffffffa04137ae>] cl_stack_fini+0xe/0x10 [obdclass]
19:47:03: [<ffffffffa088e6a8>] lov_device_fini+0x58/0x130 [lov]
19:47:03: [<ffffffffa040e7d9>] lu_stack_fini+0x49/0xf0 [obdclass]
19:47:03: [<ffffffffa04137ae>] cl_stack_fini+0xe/0x10 [obdclass]
19:47:03: [<ffffffffa0b744cd>] cl_sb_fini+0x6d/0x190 [lustre]
19:47:03: [<ffffffffa0b3959c>] client_common_put_super+0x14c/0xe60 [lustre]
19:47:03: [<ffffffffa0b3a380>] ll_put_super+0xd0/0x360 [lustre]
19:47:03: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
19:47:03: [<ffffffff8117d34b>] generic_shutdown_super+0x5b/0xe0
19:47:03: [<ffffffff8117d436>] kill_anon_super+0x16/0x60
19:47:03: [<ffffffffa03f8eaa>] lustre_kill_super+0x4a/0x60 [obdclass]
19:47:03: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
19:47:03: [<ffffffff8119a4ff>] mntput_no_expire+0xbf/0x110
19:47:03: [<ffffffff8119af9b>] sys_umount+0x7b/0x3a0
19:47:03: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
19:47:03: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
19:47:03:
19:47:03:Kernel panic - not syncing: LBUG
19:47:03:Pid: 20620, comm: umount Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1
19:47:03:Call Trace:
19:47:03: [<ffffffff814fd58a>] ? panic+0xa0/0x168
19:47:03: [<ffffffffa0d31f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
19:47:03: [<ffffffffa040b2dc>] ? lu_device_fini+0xcc/0xd0 [obdclass]
19:47:03: [<ffffffffa089f114>] ? lovsub_device_free+0x24/0x200 [lov]
19:47:03: [<ffffffffa040e826>] ? lu_stack_fini+0x96/0xf0 [obdclass]
19:47:03: [<ffffffffa04137ae>] ? cl_stack_fini+0xe/0x10 [obdclass]
19:47:03: [<ffffffffa088e6a8>] ? lov_device_fini+0x58/0x130 [lov]
19:47:03: [<ffffffffa040e7d9>] ? lu_stack_fini+0x49/0xf0 [obdclass]
19:47:03: [<ffffffffa04137ae>] ? cl_stack_fini+0xe/0x10 [obdclass]
19:47:03: [<ffffffffa0b744cd>] ? cl_sb_fini+0x6d/0x190 [lustre]
19:47:03: [<ffffffffa0b3959c>] ? client_common_put_super+0x14c/0xe60 [lustre]
19:47:03: [<ffffffffa0b3a380>] ? ll_put_super+0xd0/0x360 [lustre]
19:47:03: [<ffffffff811961a6>] ? invalidate_inodes+0xf6/0x190
19:47:03: [<ffffffff8117d34b>] ? generic_shutdown_super+0x5b/0xe0
19:47:03: [<ffffffff8117d436>] ? kill_anon_super+0x16/0x60
19:47:03: [<ffffffffa03f8eaa>] ? lustre_kill_super+0x4a/0x60 [obdclass]
19:47:03: [<ffffffff8117e4b0>] ? deactivate_super+0x70/0x90
19:47:03: [<ffffffff8119a4ff>] ? mntput_no_expire+0xbf/0x110
19:47:03: [<ffffffff8119af9b>] ? sys_umount+0x7b/0x3a0
19:47:03: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
19:47:03: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
19:47:03:Initializing cgroup subsys cpuset
19:47:03:Initializing cgroup subsys cpu
|
| Comment by Zhenyu Xu [ 01/Oct/12 ] |
|
http://review.whamcloud.com/4108
If lov_delete_raid0 does not wait for its layout stable, lov object's |
| Comment by Jian Yu [ 11/Oct/12 ] |
|
Lustre Tag: v2_3_0_RC2 The same issue occurred in parallel-scale-nfsv3 test nfsread_orphan_file: |
| Comment by Jian Yu [ 15/Oct/12 ] |
|
Lustre Tag: v2_3_0_RC3 parallel-scale-nfsv3 test iorfpp failed with the same issue: |
| Comment by Jian Yu [ 15/Oct/12 ] |
|
Lustre Tag: v2_3_0_RC3 The same issue occurred in parallel-scale-nfsv3 test nfsread_orphan_file: |
| Comment by Li Wei (Inactive) [ 16/Oct/12 ] |
|
Prompted to Blocker for 2.4. |
| Comment by Andreas Dilger [ 19/Oct/12 ] |
|
Alex reported in
|
| Comment by Zhenyu Xu [ 12/Nov/12 ] |
|
status update: A debugging patch has landed in master branch and waiting for re-hits with the debug message. |
| Comment by Ian Colle (Inactive) [ 13/Nov/12 ] |
|
https://maloo.whamcloud.com/test_sets/be5714e2-2ce7-11e2-9af4-52540035b04c |
| Comment by Andreas Dilger [ 20/Nov/12 ] |
|
Bobijam, can you please look at the https://maloo.whamcloud.com/test_sets/be5714e2-2ce7-11e2-9af4-52540035b04c to see if your debugging patch contains the information you need to resolve this problem. |
| Comment by Zhenyu Xu [ 21/Nov/12 ] |
|
Didn't find it, the failure illustrated in the above maloo report is that a client cannot finish inode sync while trying to umount the mount point, not relating to the device refcount issue. |
| Comment by Andreas Dilger [ 23/Nov/12 ] |
|
Bobijam, this bug was reported hit 14 times in the past week, according to: The most recent is at: Can you please check if the information you need is in one of these failures. If not, is there something that can be done to improve the debugging patch to capture the information you need? |
| Comment by Zhenyu Xu [ 28/Nov/12 ] |
|
haven't seen this issue for recent tests, I think we can lower the severity. |
| Comment by Peter Jones [ 30/Nov/12 ] |
|
Dropping priority as no longer occurring regularly |
| Comment by Jian Yu [ 19/Apr/13 ] |
|
Lustre Build: https://build.whamcloud.com/job/lustre-master/1340/ Server: el6, x86_64 The issue occurred again and was reported in |
| Comment by Minh Diep [ 19/Apr/13 ] |
|
I hit this very frequent using fc18 client running sanity test |
| Comment by Zhenyu Xu [ 20/Apr/13 ] |
|
pushed a debug patch at http://review.whamcloud.com/6105 |
| Comment by Minh Diep [ 23/Apr/13 ] |
|
I have reproduced this issue and trigger kernel crash dump available at /scratch/ftp/uploads/ |
| Comment by Jian Yu [ 30/May/13 ] |
|
Lustre b1_8 client build: http://build.whamcloud.com/job/lustre-b1_8/258 (1.8.9-wc1) After running parallel-scale-nfsv3, unmounting NFS server/Lustre client on the MDS node hit the same failure: 19:29:19:LustreError: 10728:0:(lu_object.c:1018:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 19:29:19:LustreError: 10728:0:(lu_object.c:1018:lu_device_fini()) LBUG 19:29:20:Pid: 10728, comm: umount 19:29:20: 19:29:20:Call Trace: 19:29:20: [<ffffffffa049f785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 19:29:20: [<ffffffffa049fd97>] lbug_with_loc+0x47/0xb0 [libcfs] 19:29:20: [<ffffffffa05df3dc>] lu_device_fini+0xcc/0xd0 [obdclass] 19:29:20: [<ffffffffa0a798b4>] lovsub_device_free+0x24/0x1e0 [lov] 19:29:20: [<ffffffffa05e25f6>] lu_stack_fini+0x96/0xf0 [obdclass] 19:29:20: [<ffffffffa05e6bfe>] cl_stack_fini+0xe/0x10 [obdclass] 19:29:20: [<ffffffffa0a699d8>] lov_device_fini+0x58/0x130 [lov] 19:29:20: [<ffffffffa05e25a9>] lu_stack_fini+0x49/0xf0 [obdclass] 19:29:20: [<ffffffffa05e6bfe>] cl_stack_fini+0xe/0x10 [obdclass] 19:29:20: [<ffffffffa0b52b6d>] cl_sb_fini+0x6d/0x190 [lustre] 19:29:20: [<ffffffffa0b1ac9c>] client_common_put_super+0x14c/0xe60 [lustre] 19:29:20: [<ffffffffa0b1ba80>] ll_put_super+0xd0/0x360 [lustre] 19:29:20: [<ffffffff8119d546>] ? invalidate_inodes+0xf6/0x190 19:29:20: [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0 19:29:20: [<ffffffff81183436>] kill_anon_super+0x16/0x60 19:29:21: [<ffffffffa05ceaca>] lustre_kill_super+0x4a/0x60 [obdclass] 19:29:21: [<ffffffff81183bd7>] deactivate_super+0x57/0x80 19:29:21: [<ffffffff811a1bff>] mntput_no_expire+0xbf/0x110 19:29:21: [<ffffffff811a266b>] sys_umount+0x7b/0x3a0 19:29:21: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 19:29:21: 19:29:21:Kernel panic - not syncing: LBUG Maloo report: https://maloo.whamcloud.com/test_sets/84843efe-c8e9-11e2-97fe-52540035b04c |
| Comment by Zhenyu Xu [ 04/Jun/13 ] |
|
update a patchset at http://review.whamcloud.com/6105 During a file lov object initialization, we need protect the access and change of its subobj->coh_parent, since it could be another layout change race there, which makes an unreferenced lovsub obj in the site object hash table. |
| Comment by Jian Yu [ 27/Jun/13 ] |
|
Lustre b1_8 client build: http://build.whamcloud.com/job/lustre-b1_8/258 (1.8.9-wc1) After running parallel-scale-nfsv3, unmounting NFS server/Lustre client on the MDS node hit the same failure: |
| Comment by Jian Yu [ 09/Aug/13 ] |
|
Lustre Branch: b2_4 Sanity test 232 failed as follows: == sanity test 232: failed lock should not block umount ============================================== 08:02:59 (1375628579) fail_loc=0x31c 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.0207861 s, 50.4 MB/s fail_loc=0 10.10.4.122@tcp:/lustre /mnt/lustre lustre rw,flock,user_xattr 0 0 CMD: client-16vm2.lab.whamcloud.com grep -c /mnt/lustre' ' /proc/mounts Stopping client client-16vm2.lab.whamcloud.com /mnt/lustre (opts:) CMD: client-16vm2.lab.whamcloud.com lsof -t /mnt/lustre Syslog on client node client-16vm2 showed that: Aug 4 08:02:59 client-16vm2 kernel: [ 6221.766422] Lustre: DEBUG MARKER: == sanity test 232: failed lock should not block umount ============================================== 08:02:59 (1375628579) Aug 4 08:02:59 client-16vm2 mrshd[30197]: pam_unix(mrsh:session): session closed for user root Aug 4 08:02:59 client-16vm2 xinetd[427]: EXIT: mshell status=0 pid=30197 duration=0(sec) Aug 4 08:02:59 client-16vm2 systemd-logind[299]: Removed session c2300. Aug 4 08:02:59 client-16vm2 kernel: [ 6221.842241] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts Aug 4 08:02:59 client-16vm2 kernel: [ 6221.853369] Lustre: DEBUG MARKER: lsof -t /mnt/lustre Aug 4 08:03:00 client-16vm2 kernel: [ 6221.973212] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.052122] LustreError: 30246:0:(lu_object.c:1141:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.058312] LustreError: 30246:0:(lu_object.c:1141:lu_device_fini()) LBUG Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063666] Pid: 30246, comm: umount Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063669] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063669] Call Trace: Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063690] [<ffffffffa02477e7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063698] [<ffffffffa0247df5>] lbug_with_loc+0x45/0xc0 [libcfs] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063735] [<ffffffffa03b3c0a>] lu_device_fini+0xba/0xc0 [obdclass] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063754] [<ffffffffa0704816>] lovsub_device_free+0x66/0x1b0 [lov] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063776] [<ffffffffa03b7a0e>] lu_stack_fini+0x7e/0xc0 [obdclass] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063797] [<ffffffffa03bd24e>] cl_stack_fini+0xe/0x10 [obdclass] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063807] [<ffffffffa06f48c9>] lov_device_fini+0x59/0x120 [lov] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063829] [<ffffffffa03b79d9>] lu_stack_fini+0x49/0xc0 [obdclass] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063861] [<ffffffffa03bd24e>] cl_stack_fini+0xe/0x10 [obdclass] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063892] [<ffffffffa07e34d1>] cl_sb_fini+0x61/0x1a0 [lustre] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063905] [<ffffffffa07a5d84>] client_common_put_super+0x54/0x7c0 [lustre] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063918] [<ffffffffa07a6e84>] ll_put_super+0xd4/0x370 [lustre] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063922] [<ffffffff811aa7ee>] ? dispose_list+0x3e/0x60 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063925] [<ffffffff81192761>] generic_shutdown_super+0x61/0xe0 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063928] [<ffffffff81192876>] kill_anon_super+0x16/0x30 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063949] [<ffffffffa03a8b7a>] lustre_kill_super+0x4a/0x60 [obdclass] Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063951] [<ffffffff81192bf7>] deactivate_locked_super+0x57/0x90 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063953] [<ffffffff811937ae>] deactivate_super+0x4e/0x70 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063956] [<ffffffff811ae167>] mntput_no_expire+0xd7/0x130 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063958] [<ffffffff811af146>] sys_umount+0x76/0x390 Aug 4 08:03:00 client-16vm2 kernel: [ 6222.063961] [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b Maloo report: https://maloo.whamcloud.com/test_sets/996b2278-fd79-11e2-9fdb-52540035b04c The failure occurs regularly on FC18 client and is blocking the remaining sanity sub-tests from running. |
| Comment by Jian Yu [ 04/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1) sanity test 232 hit the same failure: |
| Comment by Sarah Liu [ 11/Sep/13 ] |
|
Also hit this error when running interop between 2.4.0 server and 2.5 client: server: 2.4.0 19:44:33:LustreError: 10863:0:(lu_object.c:1141:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 19:44:33:LustreError: 10863:0:(lu_object.c:1141:lu_device_fini()) LBUG 19:44:34:Pid: 10863, comm: umount 19:44:34: 19:44:35:Call Trace: 19:44:35: [<ffffffffa0478895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 19:44:35: [<ffffffffa0478e97>] lbug_with_loc+0x47/0xb0 [libcfs] 19:44:36: [<ffffffffa05ec4b8>] lu_device_fini+0xb8/0xc0 [obdclass] 19:44:36: [<ffffffffa05d0727>] ls_device_put+0x87/0x1e0 [obdclass] 19:44:36: [<ffffffffa05d0a3c>] local_oid_storage_fini+0x1bc/0x270 [obdclass] 19:44:36: [<ffffffffa0d6bd74>] mgs_fs_cleanup+0x64/0x80 [mgs] 19:44:36: [<ffffffffa0d68ae0>] mgs_device_fini+0x1d0/0x5a0 [mgs] 19:44:36: [<ffffffffa05ddba7>] class_cleanup+0x577/0xda0 [obdclass] 19:44:36: [<ffffffffa05b2b36>] ? class_name2dev+0x56/0xe0 [obdclass] 19:44:37: [<ffffffffa05df48c>] class_process_config+0x10bc/0x1c80 [obdclass] 19:44:37: [<ffffffffa05d8cb3>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] 19:44:38: [<ffffffffa05e01c9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 19:44:38: [<ffffffffa05b2b36>] ? class_name2dev+0x56/0xe0 [obdclass] 19:44:38: [<ffffffffa0614d7d>] server_put_super+0x46d/0xf00 [obdclass] 19:44:38: [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0 19:44:38: [<ffffffff81183436>] kill_anon_super+0x16/0x60 19:44:39: [<ffffffffa05e2026>] lustre_kill_super+0x36/0x60 [obdclass] 19:44:39: [<ffffffff81183bd7>] deactivate_super+0x57/0x80 19:44:39: [<ffffffff811a1c4f>] mntput_no_expire+0xbf/0x110 19:44:39: [<ffffffff811a26bb>] sys_umount+0x7b/0x3a0 19:44:39: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 19:44:39: 19:44:40:Kernel panic - not syncing: LBUG 19:44:40:Pid: 10863, comm: umount Not tainted 2.6.32-358.6.2.el6_lustre.g230b174.x86_64 #1 |
| Comment by Peter Jones [ 24/Sep/13 ] |
|
Landed for 2.5.0 |