[LU-12571] sanity-dom test_150: ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 Created: 22/Jul/19  Updated: 28/Jan/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Attachments: PNG File Screen Shot 2019-12-10 at 20.19.21.png    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d79edebe-abc3-11e9-a0be-52540065bddc

test_150 failed with the following error:

trevis-70vm1 crashed during sanity-dom test_150
LustreError: 19240:0:(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
LustreError: 19240:0:(lu_object.c:1196:lu_device_fini()) LBUG
Pid: 19240, comm: umount 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
Call Trace:
 [<ffffffffc06658bc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
 [<ffffffffc066596c>] lbug_with_loc+0x4c/0xa0 [libcfs]
 [<ffffffffc07d293b>] lu_device_fini+0xbb/0xc0 [obdclass]
 [<ffffffffc0c98994>] lovsub_device_free+0x94/0x170 [lov]
 [<ffffffffc07d7828>] lu_stack_fini+0x98/0xe0 [obdclass]
 [<ffffffffc07db89e>] cl_stack_fini+0xe/0x10 [obdclass]
 [<ffffffffc0c750dc>] lov_device_fini+0x7c/0x170 [lov]
 [<ffffffffc07d77eb>] lu_stack_fini+0x5b/0xe0 [obdclass]
 [<ffffffffc07db89e>] cl_stack_fini+0xe/0x10 [obdclass]
 [<ffffffffc0d240b0>] cl_sb_fini+0x70/0x1b0 [lustre]
 [<ffffffffc0cf2688>] ll_put_super+0x528/0x8a0 [lustre]
 [<ffffffff8a84433d>] generic_shutdown_super+0x6d/0x100
 [<ffffffff8a844732>] kill_anon_super+0x12/0x20
 [<ffffffffc07c88d5>] lustre_kill_super+0x45/0x50 [obdclass]
 [<ffffffff8a844aee>] deactivate_locked_super+0x4e/0x70
 [<ffffffff8a845276>] deactivate_super+0x46/0x60
 [<ffffffff8a86387f>] cleanup_mnt+0x3f/0x80
 [<ffffffff8a863912>] __cleanup_mnt+0x12/0x20
 [<ffffffff8a6be90b>] task_work_run+0xbb/0xe0
 [<ffffffff8a62bc65>] do_notify_resume+0xa5/0xc0
 [<ffffffff8ad76134>] int_signal+0x12/0x17
 [<ffffffffffffffff>] 0xffffffffffffffff
Kernel panic - not syncing: LBUG
CPU: 0 PID: 19240 Comm: umount Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.21.3.el7.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
 [<ffffffff8ad63107>] dump_stack+0x19/0x1b
 [<ffffffff8ad5c810>] panic+0xe8/0x21f
 [<ffffffffc06659bb>] lbug_with_loc+0x9b/0xa0 [libcfs]
 [<ffffffffc07d293b>] lu_device_fini+0xbb/0xc0 [obdclass]
 [<ffffffffc0c98994>] lovsub_device_free+0x94/0x170 [lov]
 [<ffffffffc07d7828>] lu_stack_fini+0x98/0xe0 [obdclass]
 [<ffffffffc07db89e>] cl_stack_fini+0xe/0x10 [obdclass]
 [<ffffffffc0c750dc>] lov_device_fini+0x7c/0x170 [lov]
 [<ffffffffc07d77eb>] lu_stack_fini+0x5b/0xe0 [obdclass]
 [<ffffffffc07db89e>] cl_stack_fini+0xe/0x10 [obdclass]
 [<ffffffffc0d240b0>] cl_sb_fini+0x70/0x1b0 [lustre]
 [<ffffffffc0cf2688>] ll_put_super+0x528/0x8a0 [lustre]
 [<ffffffff8a88ab83>] ? fsnotify_destroy_marks+0x13/0xc0
 [<ffffffff8a753f8d>] ? call_rcu_sched+0x1d/0x20
 [<ffffffffc0d1945c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
 [<ffffffff8a85efdb>] ? destroy_inode+0x3b/0x60
 [<ffffffff8a85f115>] ? evict+0x115/0x180
 [<ffffffff8a85f1be>] ? dispose_list+0x3e/0x50
 [<ffffffff8a85fe7b>] ? evict_inodes+0xdb/0x140
 [<ffffffff8a84433d>] generic_shutdown_super+0x6d/0x100
 [<ffffffff8a844732>] kill_anon_super+0x12/0x20
 [<ffffffffc07c88d5>] lustre_kill_super+0x45/0x50 [obdclass]
 [<ffffffff8a844aee>] deactivate_locked_super+0x4e/0x70
 [<ffffffff8a845276>] deactivate_super+0x46/0x60
 [<ffffffff8a86387f>] cleanup_mnt+0x3f/0x80
 [<ffffffff8a863912>] __cleanup_mnt+0x12/0x20
 [<ffffffff8a6be90b>] task_work_run+0xbb/0xe0
 [<ffffffff8a62bc65>] do_notify_resume+0xa5/0xc0
 [<ffffffff8ad76134>] int_signal+0x12/0x17

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-dom test_150 - trevis-70vm1 crashed during sanity-dom test_150



 Comments   
Comment by Andreas Dilger [ 10/Oct/19 ]

+1 on master: https://testing.whamcloud.com/test_sets/53c01e8a-eb12-11e9-a197-52540065bddc
six other times in the past 4 weeks, so definitely not very common.

Comment by Chris Horn [ 22/Oct/19 ]

+1 on master https://testing.whamcloud.com/test_sessions/30ed80fc-0d55-4032-8ed6-b287376bedce

Comment by Bruno Faccini (Inactive) [ 31/Oct/19 ]

+1 on master @https://testing.whamcloud.com/test_sessions/431dfbf0-778f-46c2-84d5-e61058e669d6

Comment by Olaf Faaland [ 10/Dec/19 ]

Maybe +1 on master @https://testing.whamcloud.com/test_sets/8b37d52e-ecfc-11e9-a197-52540065bddc in that maloo reported that test sanity-dom test_150 crashed. I don't see any links to console logs etc. so I do not see how I can verify what actually happened on the nodes to confirm a match.

Comment by Andreas Dilger [ 11/Dec/19 ]

ofaaland the console logs for the test session are linked on the right side of the orange test "info box" at the top, not on a per-subtest basis anymore.

Selecting "Show" for "console.trevis-70vm1.log" to view the console log, and searching for "sanity test 150" brings me to the failed subtest, which indeed shows that this is the same issue:

LustreError: 6344:0:(lu_object.c:1265:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
LustreError: 6344:0:(lu_object.c:1265:lu_device_fini()) LBUG
Comment by Alexander Zarochentsev (Inactive) [ 20/Jan/20 ]

another hit https://testing.whamcloud.com/test_sets/99598020-3974-11ea-b1e8-52540065bddc

Comment by Mikhail Pershin [ 28/Jan/20 ]

Though crashed in test 150, that is not this test issue, the remount_client just reveals object with non-released reference. Log analysis shows that:

[19380.752462] Lustre: DEBUG MARKER: == sanity test 131e: test read hitting hole ========================================================== 11:00:54 (1563706854)
<<<<  skipped  >>>>
[19384.237133] Lustre: DEBUG MARKER: == sanity test 150: truncate/append tests ============================================================ 11:00:57 (1563706857)

[19384.517120] LustreError: 19240:0:(lovsub_dev.c:94:lovsub_device_free()) header@ffff9ff5a57472a0[0x0, 3, [0x100000000:0x90ac:0x0] hash]{

[19384.519479] LustreError: 19240:0:(lovsub_dev.c:94:lovsub_device_free()) ....lovsub@ffff9ff5a5747300[65536]

[19384.521385] LustreError: 19240:0:(lovsub_dev.c:94:lovsub_device_free()) ....osc@ffff9ff5b9c95a00id: 0x0:37036 idx: 0 gen: 0 kms_valid: 1 kms 2097152 rc: 0 force_sync: 0 min_xid: 0 size: 2097152 mtime: 1563706854 atime: 0 ctime: 1563706854 blocks: 1

[19384.525535] LustreError: 19240:0:(lovsub_dev.c:94:lovsub_device_free()) } header@ffff9ff5a57472a0

[19384.527762] LustreError: 19240:0:(lovsub_dev.c:94:lovsub_device_free()) header@ffff9ff5a57471f8[0x0, 1, [0x2000470d1:0xe6:0x0] hash]{

[19384.530080] LustreError: 19240:0:(lovsub_dev.c:94:lovsub_device_free()) ....lovsub@ffff9ff5a5747258[0]

[19384.531918] LustreError: 19240:0:(lovsub_dev.c:94:lovsub_device_free()) ....mdc@ffff9ff5b9c95b40id: 0x2000470d1:230 idx: 0 gen: 0 kms_valid: 1 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 1563706854 atime: 1563706854 ctime: 1563706854 blocks: 2

[19384.536024] LustreError: 19240:0:(lovsub_dev.c:94:lovsub_device_free()) } header@ffff9ff5a57471f8

[19384.540415] LustreError: 19240:0:(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1

That object with extra reference is from test 131e, ctime/mtime indicates that and also object structure: 0-bytes in DOM part and 2MB OSC component, like test 131e does. Nevertheless, I have no idea where it lost reference and why, this needs more investigation

Comment by Alex Zhuravlev [ 28/Jan/20 ]

https://testing.whamcloud.com/test_sessions/33af08a7-bf95-423a-b01e-9eb7e9bba066

Generated at Sat Feb 10 02:53:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.