Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
Lustre 2.3.0
-
None
-
SWL Hyperion/LLNL
-
3
-
4375
Description
MDS crash dumped, attempting to locate dump at this time.
Message from MDS:
2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed.
MDS did not dump a stack, was configured with panic_on_lbug.
Will attempt to replicate
Attachments
Activity
Liang, Oleg, what about http://review.whamcloud.com/4136? That patch didn't land to b2_3. Was that intended to fix the MDS crash, or is it a secondary problem that doesn't need to be fixed for 2.3.0?
Liang, probably makes sense to set add CERROR() to see whether we hit this path.
I've posted another patch for this: http://review.whamcloud.com/#change,4197
it should have fixed something but not sure if it can fix this bug.
I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me:
/* Remove old target object * For tobj is remote case cmm layer has processed * and set tobj to NULL then. So when tobj is NOT NULL, * it must be local one. */ if (tobj && mdd_object_exists(mdd_tobj)) { mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD); if (mdd_is_dead_obj(mdd_tobj)) { mdd_write_unlock(env, mdd_tobj); /* shld not be dead, something is wrong */ CERROR("tobj is dead, something is wrong\n"); rc = -EINVAL; goto cleanup; } mdo_ref_del(env, mdd_tobj, handle); /* Remove dot reference. */ if (is_dir) mdo_ref_del(env, mdd_tobj, handle); la->la_valid = LA_CTIME; rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0); if (rc) GOTO(fixup_tpobj, rc); rc = mdd_finish_unlink(env, mdd_tobj, ma, handle); mdd_write_unlock(env, mdd_tobj); if (rc) GOTO(fixup_tpobj, rc);
If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj:
fixup_tpobj: if (rc) { rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle, BYPASS_CAPA); if (rc2) CWARN("tp obj fix error %d\n",rc2); if (mdd_tobj && mdd_object_exists(mdd_tobj) && !mdd_is_dead_obj(mdd_tobj)) { rc2 = __mdd_index_insert(env, mdd_tpobj, mdo2fid(mdd_tobj), tname, is_dir, handle, BYPASS_CAPA); if (rc2) CWARN("tp obj fix error %d\n",rc2); } }
So if everything got reverted, refcount on target object will be wrong.
Is this analysis correct?
The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?
Hit this again while running SWL, backtrace:
PID: 4891 TASK: ffff88016dedaaa0 CPU: 13 COMMAND: "mdt03_014"
#0 [ffff88016e693918] machine_kexec at ffffffff8103281b
#1 [ffff88016e693978] crash_kexec at ffffffff810ba792
#2 [ffff88016e693a48] panic at ffffffff814fd591
#3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs]
#4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs]
#5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd]
#6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd]
#7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm]
#8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt]
#9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt]
#10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt]
#11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt]
#12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt]
#13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt]
#14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc]
#15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc]
#16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a
Check the last comment posted by Cliff on LU-1948, we hit this again while running SWL.
MDS crashed, panic stack:
2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
2012-09-27 21:56:35
2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
2012-09-27 21:56:35 0 ) failed:
2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
2012-09-27 21:56:35 LBUG
2012-09-27 21:56:35 LBUG
2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-09-27 21:56:35
2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1
2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
2012-09-27 21:56:35 : LBUG
2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20
vmcore is on brent in ~/cliffw/lu1948/erofs
Hongchao
If you think that it should be then please set inspectors on the patch.
Thanks
Peter
I think the problem fixed by review-4136 existed for long time (since 2.1), so probably it's not the reason of the crash here, but we should land it to master at least.