[LU-5713] Interop 2.5<->2.7 sanity-lfsck test_8: (mdt_handler.c:4378:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed Created: 07/Oct/14  Updated: 17/Apr/17  Resolved: 17/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

server: lustre-master build # 2671
client: 2.5.3


Issue Links:
Related
is related to LU-4595 lod_device_free()) ASSERTION( atomic_... Closed
Severity: 3
Rank (Obsolete): 16027

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/34941bc6-4732-11e4-8a80-5254006e85c2.

The sub-test test_8 failed with the following error:

test failed to respond and timed out

MDS console

09:49:10:Lustre: DEBUG MARKER: == sanity-lfsck test 8: LFSCK state machine == 15:48:59 (1411832939)
09:49:10:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
09:49:10:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
09:49:10:LustreError: 24584:0:(qsd_reint.c:54:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
09:49:10:LustreError: 24584:0:(qsd_reint.c:54:qsd_reint_completion()) Skipped 2 previous similar messages
09:49:10:LustreError: 25070:0:(ldlm_lib.c:2106:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery
09:49:10:Lustre: 24587:0:(ldlm_lib.c:1781:target_recovery_overseer()) recovery is aborted, evict exports in recovery
09:49:10:LustreError: 24567:0:(osp_precreate.c:725:osp_precreate_cleanup_orphans()) lustre-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -5
09:49:10:LustreError: 24569:0:(osp_precreate.c:466:osp_precreate_send()) lustre-OST0001-osc-MDT0000: can't precreate: rc = -5
09:49:10:LustreError: 24569:0:(osp_precreate.c:976:osp_precreate_thread()) lustre-OST0001-osc-MDT0000: cannot precreate objects: rc = -5
09:49:10:Lustre: *** cfs_fail_loc=1602, val=0***
09:49:10:LustreError: 2658:0:(client.c:1075:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88007be6f400 x1480397462816312/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.6.53@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
09:49:10:LustreError: 2658:0:(client.c:1075:ptlrpc_import_delay_req()) Skipped 12 previous similar messages
09:49:10:LustreError: 2643:0:(lod_dev.c:913:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: 
09:49:10:LustreError: 25070:0:(mdt_handler.c:4378:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
09:49:10:LustreError: 25070:0:(mdt_handler.c:4378:mdt_fini()) LBUG
09:49:10:Pid: 25070, comm: umount
09:49:10:
09:49:10:Call Trace:
09:49:10: [<ffffffffa0483895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
09:49:10: [<ffffffffa0483e97>] lbug_with_loc+0x47/0xb0 [libcfs]
09:49:10: [<ffffffffa0ecbb9f>] mdt_device_fini+0xc8f/0xcd0 [mdt]
09:49:10: [<ffffffffa05b95d6>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
09:49:10: [<ffffffffa05d8f52>] class_cleanup+0x562/0xd20 [obdclass]
09:49:10: [<ffffffffa05b6f16>] ? class_name2dev+0x56/0xe0 [obdclass]
09:49:10: [<ffffffffa05dac6a>] class_process_config+0x155a/0x1ac0 [obdclass]
09:49:10: [<ffffffffa05d39b5>] ? lustre_cfg_new+0x4f5/0x6f0 [obdclass]
09:49:10: [<ffffffffa05db347>] class_manual_cleanup+0x177/0x700 [obdclass]
09:49:10: [<ffffffffa05b6f16>] ? class_name2dev+0x56/0xe0 [obdclass]
09:49:10: [<ffffffffa0617a57>] server_put_super+0xb37/0xe50 [obdclass]
09:49:10: [<ffffffff8118b63b>] generic_shutdown_super+0x5b/0xe0
09:49:10: [<ffffffff8118b726>] kill_anon_super+0x16/0x60
09:49:10: [<ffffffffa05dd246>] lustre_kill_super+0x36/0x60 [obdclass]
09:49:10: [<ffffffff8118bec7>] deactivate_super+0x57/0x80
09:49:10: [<ffffffff811ab8cf>] mntput_no_expire+0xbf/0x110
09:49:10: [<ffffffff811ac41b>] sys_umount+0x7b/0x3a0
09:49:10: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
09:49:10:
09:49:10:LustreError: 2643:0:(lod_dev.c:913:lod_device_free()) LBUG
09:49:10:Pid: 2643, comm: obd_zombid

Info required for matching: sanity-lfsck 8



 Comments   
Comment by Andreas Dilger [ 10/Feb/15 ]

Hit this again: https://testing.hpdd.intel.com/test_sets/856f4938-aea4-11e4-98a3-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 27/Feb/15 ]

Andreas,
I had also one failure of my patch's auto-tests sessions (https://testing.hpdd.intel.com/test_sets/6e9169d6-a179-11e4-9db6-5254006e85c2) that has been automatically (??) linked to this ticket, but having an in-deep look of its logs I did not found the presence of the "(mdt_handler.c:4378:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) " Assert being addressed in this ticket, but rather a "sanity-lfsck test_4: '(7) unexpected status'" (tracked in LU-6147) with a further timeout/consequence in sanity-lfscktest_5.
Having a look in your own/reported failed session, it seems very similar, and thus much more related to LU-6147.

Comment by Sarah Liu [ 01/Sep/15 ]

Hit this error when running sanity test_65j with 2.8 server (DNE mode) and 2.5.3 client:

MDS

Lustre: DEBUG MARKER: == sanity test 65j: set default striping on root directory (bug 6367)=== 19:58:05 (1441076285)
LustreError: 46370:0:(client.c:1138:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880821fb8cc0 x1511074876772396/t0(0) o1000->lustre-MDT0001-osp-MDT0000@0@lo:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
LustreError: 46370:0:(client.c:1138:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
LustreError: 46370:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000402:0x1:0x0]: rc = -5
LustreError: 46370:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742850#00000000: rc = -5
Lustre: lustre-MDT0000: Not available for connect from 10.2.4.56@tcp (stopping)
Lustre: Skipped 10 previous similar messages
LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation obd_ping to node 0@lo failed: rc = -107
Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
LustreError: Skipped 7 previous similar messages
Lustre: lustre-MDT0000: Not available for connect from 10.2.4.56@tcp (stopping)
Lustre: Skipped 18 previous similar messages
Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 3. Is it stuck?
LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) header@ffff880400337c80[0x0, 1, [0x1:0x0:0x0] hash exist]{

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....local_storage@ffff880400337cd0

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....osd-ldiskfs@ffff8804002f4e40osd-ldiskfs-object@ffff8804002f4e40(i:ffff88041678d6e0:81/3187989141)[plain]

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) } header@ffff880400337c80

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) header@ffff8804358af840[0x0, 1, [0x200000003:0x0:0x0] hash exist]{

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....local_storage@ffff8804358af890

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....osd-ldiskfs@ffff8804012a5200osd-ldiskfs-object@ffff8804012a5200(i:ffff880401c1b560:79/3187989106)[plain]

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) } header@ffff8804358af840

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) header@ffff8804002476c0[0x0, 1, [0x200000003:0x2:0x0] hash exist]{

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....local_storage@ffff880400247710

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....osd-ldiskfs@ffff880400f2ea80osd-ldiskfs-object@ffff880400f2ea80(i:ffff88042a53b350:80/3187989107)[plain]

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) } header@ffff8804002476c0

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) header@ffff88040004a5c0[0x0, 1, [0xa:0x0:0x0] hash exist]{

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....local_storage@ffff88040004a610

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....osd-ldiskfs@ffff88040031d800osd-ldiskfs-object@ffff88040031d800(i:ffff8804166a3110:82/3187989175)[plain]

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) } header@ffff88040004a5c0

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) header@ffff88040031dec0[0x1, 1, [0x200000001:0x1017:0x0] hash exist]{

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....local_storage@ffff88040031df10

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) ....osd-ldiskfs@ffff88040002f840osd-ldiskfs-object@ffff88040002f840(i:ffff8804220777a0:12/1262305354)[plain]

LustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) } header@ffff88040031dec0

LustreError: 46370:0:(mdt_handler.c:4275:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
LustreError: 46370:0:(mdt_handler.c:4275:mdt_fini()) LBUG

Message fromLustreError: 4752:0:(osp_dev.c:1255:osp_device_free()) header@ffInitializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Comment by Frank Heckes (Inactive) [ 18/Nov/15 ]

Error happens also during soak testing of build '20151116' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151116) upon umounting MDTs:

<0>LustreError: 6529:0:(lod_dev.c:1570:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff8804006dc000
<0>LustreError: 6529:0:(lod_dev.c:1570:lod_device_free()) LBUG
<4>Pid: 6529, comm: obd_zombid
<4>
<4>Call Trace:
<0>LustreError: 13065:0:(mdt_handler.c:4295:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
<0>LustreError: 13065:0:(mdt_handler.c:4295:mdt_fini()) LBUG
<4>Pid: 13065, comm: umount
<4>
<4>Call Trace:
<4> [<ffffffffa08a5875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa08a5e77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa08a5875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa08a5e77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa13d7ba1>] lod_device_free+0x2c1/0x330 [lod]
<4> [<ffffffffa1308b79>] mdt_device_fini+0xed9/0xf40 [mdt]
<4> [<ffffffffa09d00d6>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
<4> [<ffffffffa09ea1f2>] class_cleanup+0x572/0xd20 [obdclass]
<4> [<ffffffffa09e3fed>] class_decref+0x3dd/0x4c0 [obdclass]
<4> [<ffffffffa09cadb6>] ? class_name2dev+0x56/0xe0 [obdclass]
<4> [<ffffffffa09ec876>] class_process_config+0x1ed6/0x2830 [obdclass]
<4> [<ffffffffa08b16c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230
<4> [<ffffffffa09ceffc>] obd_zombie_impexp_cull+0x61c/0xac0 [obdclass]
<4> [<ffffffffa09ed68f>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
<4> [<ffffffffa09cadb6>] ? class_name2dev+0x56/0xe0 [obdclass]
<4> [<ffffffffa0a2558c>] server_put_super+0xa0c/0xed0 [obdclass]
<4> [<ffffffffa09cf505>] obd_zombie_impexp_thread+0x65/0x190 [obdclass]
<4> [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190
<4> [<ffffffff81190b7b>] generic_shutdown_super+0x5b/0xe0
<4> [<ffffffff81190c66>] kill_anon_super+0x16/0x60
<4> [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa09f0546>] lustre_kill_super+0x36/0x60 [obdclass]
<4> [<ffffffff81191407>] deactivate_super+0x57/0x80
<4> [<ffffffff811b10df>] mntput_no_expire+0xbf/0x110
<4> [<ffffffff811b1c2b>] sys_umount+0x7b/0x3a0
<4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
<4>
<4> [<ffffffffa09cf4a0>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
<4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 13065, comm: umount Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.gb64632c.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81529c9c>] ? panic+0xa7/0x16f
<4> [<ffffffffa08a5ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa1308b79>] ? mdt_device_fini+0xed9/0xf40 [mdt]
<4> [<ffffffffa09d00d6>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
<4> [<ffffffffa09ea1f2>] ? class_cleanup+0x572/0xd20 [obdclass]
<4> [<ffffffffa09cadb6>] ? class_name2dev+0x56/0xe0 [obdclass]
<4> [<ffffffffa09ec876>] ? class_process_config+0x1ed6/0x2830 [obdclass]
<4> [<ffffffffa08b16c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230
<4> [<ffffffffa09ed68f>] ? class_manual_cleanup+0x4bf/0x8e0 [obdclass]
<4> [<ffffffffa09cadb6>] ? class_name2dev+0x56/0xe0 [obdclass]
<4> [<ffffffffa0a2558c>] ? server_put_super+0xa0c/0xed0 [obdclass]
<4> [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190
<4> [<ffffffff81190b7b>] ? generic_shutdown_super+0x5b/0xe0
<4> [<ffffffff81190c66>] ? kill_anon_super+0x16/0x60
<4> [<ffffffffa09f0546>] ? lustre_kill_super+0x36/0x60 [obdclass]
<4> [<ffffffff81191407>] ? deactivate_super+0x57/0x80
<4> [<ffffffff811b10df>] ? mntput_no_expire+0xbf/0x110
<4> [<ffffffff811b1c2b>] ? sys_umount+0x7b/0x3a0
<4> [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

crash dump file is available via cluster lola's head node: lhn:/scratch/crashdumps/lu-5713/lola-8-127.0.0.1-2015-11-18-04:11:26

Comment by James Nunez (Inactive) [ 16/Dec/15 ]

We hit this LBUG on master (not interop testing):
2015-12-15 19:41:33 - https://testing.hpdd.intel.com/test_sets/7ff1ea68-a392-11e5-9b3d-5254006e85c2

Comment by Andreas Dilger [ 17/Apr/17 ]

Close old issue.

Generated at Sat Feb 10 01:53:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.