[LU-4595] lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed Created: 06/Feb/14  Updated: 12/Sep/16  Resolved: 12/Sep/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Yang Sheng
Resolution: Cannot Reproduce Votes: 0
Labels: lod, mdt

Issue Links:
Related
is related to LU-5713 Interop 2.5<->2.7 sanity-lfsck test_8... Resolved
Severity: 3
Rank (Obsolete): 12555

 Description   

Running racer against today's master (2.5.55-4-gb6a1b94) on a single node with MDSCOUNT=4 and OSTCOUNT=2 I see these LBUGs during umount.

This loop reproduced the LBUG after 3 iterations:

cd ~/lustre-release
export MDSCOUNT=4
export MOUNT_2=y
for ((i = 0; i < 10; i++)); do
  echo -e "\n\n\n########### $i $(date) ############\n\n\n"
  llmount.sh
  sh lustre/tests/racer.sh
  umount /mnt/lustre /mnt/lustre2
  umount /mnt/mds{1..4} /mnt/ost{1..2}
  llmountcleanup.sh
done
Lustre: DEBUG MARKER: == racer test complete, duration 314 sec == 10:26:08 (1391703968)
Lustre: Unmounted lustre-client
Lustre: Unmounted lustre-client
Lustre: Failing over lustre-MDT0000
Lustre: server umount lustre-MDT0000 complete
LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: Communicating with 0@lo, operation mds_disconnect failed with -107.
Lustre: Failing over lustre-MDT0001
Lustre: server umount lustre-MDT0001 complete
Lustre: Failing over lustre-MDT0002
LustreError: 3307:0:(lod_dev.c:711:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: 
LustreError: 27074:0:(mdt_handler.c:4256:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
LustreError: 27074:0:(mdt_handler.c:4256:mdt_fini()) LBUG
Pid: 27074, comm: umount

Call Trace:
 [<ffffffffa0968895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0968e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0b55cdf>] mdt_device_fini+0xd5f/0xda0 [mdt]
 [<ffffffffa0e2dee6>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
 [<ffffffffa0e532b3>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa0e2b836>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0e54fda>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa0e4d2b3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0e556b9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa0e2b836>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0e8ea19>] server_put_super+0x8e9/0xe40 [obdclass]
 [<ffffffff81184c3b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff81184d26>] kill_anon_super+0x16/0x60
 [<ffffffffa0e57576>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff811854c7>] deactivate_super+0x57/0x80
 [<ffffffff811a375f>] mntput_no_expire+0xbf/0x110
 [<ffffffff811a41cb>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 27074, comm: umount Not tainted 2.6.32-358.18.1.el6.lustre.x86_64 #1
Call Trace:
 [<ffffffff8150f018>] ? panic+0xa7/0x16f
 [<ffffffffa0968eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0b55cdf>] ? mdt_device_fini+0xd5f/0xda0 [mdt]
 [<ffffffffa0e2dee6>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
 [<ffffffffa0e532b3>] ? class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa0e2b836>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0e54fda>] ? class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa0e4d2b3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0e556b9>] ? class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa0e2b836>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0e8ea19>] ? server_put_super+0x8e9/0xe40 [obdclass]
 [<ffffffff81184c3b>] ? generic_shutdown_super+0x5b/0xe0
 [<ffffffff81184d26>] ? kill_anon_super+0x16/0x60
 [<ffffffffa0e57576>] ? lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff811854c7>] ? deactivate_super+0x57/0x80
 [<ffffffff811a375f>] ? mntput_no_expire+0xbf/0x110
 [<ffffffff811a41cb>] ? sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
crash> bt
PID: 27074  TASK: ffff880196477500  CPU: 1   COMMAND: "umount"
 #0 [ffff8801beb939b0] machine_kexec at ffffffff81035d6b
 #1 [ffff8801beb93a10] crash_kexec at ffffffff810c0e22
 #2 [ffff8801beb93ae0] panic at ffffffff8150f01f
 #3 [ffff8801beb93b60] lbug_with_loc at ffffffffa0968eeb [libcfs]
 #4 [ffff8801beb93b80] mdt_device_fini at ffffffffa0b55cdf [mdt]
 #5 [ffff8801beb93bf0] class_cleanup at ffffffffa0e532b3 [obdclass]
 #6 [ffff8801beb93c70] class_process_config at ffffffffa0e54fda [obdclass]
 #7 [ffff8801beb93d00] class_manual_cleanup at ffffffffa0e556b9 [obdclass]
 #8 [ffff8801beb93dc0] server_put_super at ffffffffa0e8ea19 [obdclass]
 #9 [ffff8801beb93e30] generic_shutdown_super at ffffffff81184c3b
#10 [ffff8801beb93e50] kill_anon_super at ffffffff81184d26
#11 [ffff8801beb93e70] lustre_kill_super at ffffffffa0e57576 [obdclass]
#12 [ffff8801beb93e90] deactivate_super at ffffffff811854c7
#13 [ffff8801beb93eb0] mntput_no_expire at ffffffff811a375f
#14 [ffff8801beb93ee0] sys_umount at ffffffff811a41cb
#15 [ffff8801beb93f80] system_call_fastpath at ffffffff8100b072
    RIP: 00007ff7634689a7  RSP: 00007fff84d7d120  RFLAGS: 00010202
    RAX: 00000000000000a6  RBX: ffffffff8100b072  RCX: 00007ff763d55009
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00007ff765c4cb90
    RBP: 00007ff765c4cb70   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00007ff765c4cbf0
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b


 Comments   
Comment by Yang Sheng [ 11/Aug/14 ]

I hit this issue when run 2.6 conf-sanity test-24a on rhel7 kernel.

[11703.385741] LustreError: Skipped 1 previous similar message
[11707.178453] LustreError: 12463:0:(mdt_handler.c:4379:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
[11707.178821] LustreError: 12463:0:(mdt_handler.c:4379:mdt_fini()) LBUG
[11707.180103] LustreError: 10181:0:(mdd_device.c:1158:mdd_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: 
[11707.180465] LustreError: 10181:0:(mdd_device.c:1158:mdd_device_free()) LBUG
[11707.181712] Kernel panic - not syncing: LBUG
[11707.182035] CPU: 0 PID: 12463 Comm: umount Tainted: GF       W  O--------------   3.10.0-123.el7.x86_64 #1
[11707.182035] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11707.182035]  ffffffffa038ac8d 00000000b55e544c ffff88001dc99b08 ffffffff815e19ba
[11707.182035]  ffff88001dc99b88 ffffffff815db549 ffffffff00000008 ffff88001dc99b98
[11707.182035]  ffff88001dc99b38 00000000b55e544c ffffffffa0d39703 ffff88001f7b6660
[11707.182035] Call Trace:
[11707.182035]  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
[11707.182035]  [<ffffffff815db549>] panic+0xd8/0x1e7
[11707.182035]  [<ffffffffa0365e6b>] lbug_with_loc+0xab/0xc0 [libcfs]
[11707.182035]  [<ffffffffa0ce2221>] mdt_device_fini+0xe61/0xe70 [mdt]
[11707.182035]  [<ffffffffa04d397f>] class_cleanup+0x8ef/0xcc0 [obdclass]
[11707.182035]  [<ffffffffa04d97f8>] class_process_config+0x1898/0x29e0 [obdclass]
[11707.182035]  [<ffffffffa0376047>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[11707.182035]  [<ffffffffa0370914>] ? libcfs_log_return+0x24/0x30 [libcfs]
[11707.182035]  [<ffffffffa04daa2f>] class_manual_cleanup+0xef/0x6b0 [obdclass]
[11707.182035]  [<ffffffffa0515e6b>] server_put_super+0x86b/0xe30 [obdclass]
[11707.182035]  [<ffffffff811b1fd6>] generic_shutdown_super+0x56/0xe0
[11707.182035]  [<ffffffff811b2242>] kill_anon_super+0x12/0x20
[11707.188420]  [<ffffffffa04ddda2>] lustre_kill_super+0x32/0x50 [obdclass]
[11707.188420]  [<ffffffff811b265d>] deactivate_locked_super+0x3d/0x60
[11707.188420]  [<ffffffff811b26c6>] deactivate_super+0x46/0x60
[11707.188420]  [<ffffffff811cf455>] mntput_no_expire+0xc5/0x120
[11707.188420]  [<ffffffff811d058f>] SyS_umount+0x9f/0x3c0
[11707.188420]  [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
[11707.188420] Shutting down cpus with NMI
[11707.188420] drm_kms_helper: panic occurred, switching back to text console

Comment by James Nunez (Inactive) [ 08/Dec/15 ]

Look like we hit this in master:
2015-12-07 18:10:15 - https://testing.hpdd.intel.com/test_sets/19d4b21c-9d41-11e5-a4d7-5254006e85c2

Comment by Yang Sheng [ 12/Sep/16 ]

I don't encounter it a long time. Close it first.

Generated at Sat Feb 10 01:44:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.