[LU-1951] SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.3.0
Labels:
None
Environment:
SWL Hyperion/LLNL

Severity:
3
Rank (Obsolete):
4375

Description

MDS crash dumped, attempting to locate dump at this time.
Message from MDS:

2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG

This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed.

MDS did not dump a stack, was configured with panic_on_lbug.
Will attempt to replicate

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dump1.out.gz
670 kB
04/Oct/12 4:17 PM

Activity

[LU-1951] SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:

Hongchao Zhang added a comment - 16/Apr/13 2:38 AM

duplicate of ~~LU-3022~~

Hongchao Zhang added a comment - 16/Apr/13 2:38 AM duplicate of LU-3022

Bob Glossman (Inactive) added a comment - 29/Oct/12 2:58 PM

patch for master: http://review.whamcloud.com/4405
port of http://review.whamcloud.com/#change,4197. Context too different to just be cherry picked directly.

Bob Glossman (Inactive) added a comment - 29/Oct/12 2:58 PM patch for master: http://review.whamcloud.com/4405 port of http://review.whamcloud.com/#change,4197 . Context too different to just be cherry picked directly.

Liang Zhen (Inactive) added a comment - 18/Oct/12 10:53 PM

I think the problem fixed by review-4136 existed for long time (since 2.1), so probably it's not the reason of the crash here, but we should land it to master at least.

Liang Zhen (Inactive) added a comment - 18/Oct/12 10:53 PM I think the problem fixed by review-4136 existed for long time (since 2.1), so probably it's not the reason of the crash here, but we should land it to master at least.

Andreas Dilger added a comment - 18/Oct/12 12:33 PM

Liang, Oleg, what about http://review.whamcloud.com/4136? That patch didn't land to b2_3. Was that intended to fix the MDS crash, or is it a secondary problem that doesn't need to be fixed for 2.3.0?

Andreas Dilger added a comment - 18/Oct/12 12:33 PM Liang, Oleg, what about http://review.whamcloud.com/4136? That patch didn't land to b2_3. Was that intended to fix the MDS crash, or is it a secondary problem that doesn't need to be fixed for 2.3.0?

Peter Jones added a comment - 08/Oct/12 11:23 PM

Dropping priority because landed for 2.3

Peter Jones added a comment - 08/Oct/12 11:23 PM Dropping priority because landed for 2.3

Alex Zhuravlev added a comment - 05/Oct/12 12:14 PM

Liang, probably makes sense to set add CERROR() to see whether we hit this path.

Alex Zhuravlev added a comment - 05/Oct/12 12:14 PM Liang, probably makes sense to set add CERROR() to see whether we hit this path.

Liang Zhen (Inactive) added a comment - 05/Oct/12 8:29 AM

I've posted another patch for this: http://review.whamcloud.com/#change,4197
it should have fixed something but not sure if it can fix this bug.

Liang Zhen (Inactive) added a comment - 05/Oct/12 8:29 AM I've posted another patch for this: http://review.whamcloud.com/#change,4197 it should have fixed something but not sure if it can fix this bug.

Liang Zhen (Inactive) added a comment - 05/Oct/12 5:33 AM - edited

I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me:

        /* Remove old target object
         * For tobj is remote case cmm layer has processed
         * and set tobj to NULL then. So when tobj is NOT NULL,
         * it must be local one.
         */
        if (tobj && mdd_object_exists(mdd_tobj)) {
                mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD);
                if (mdd_is_dead_obj(mdd_tobj)) {
                        mdd_write_unlock(env, mdd_tobj);
                        /* shld not be dead, something is wrong */
                        CERROR("tobj is dead, something is wrong\n");
                        rc = -EINVAL;
                        goto cleanup;
                }
                mdo_ref_del(env, mdd_tobj, handle);

                /* Remove dot reference. */
                if (is_dir)
                        mdo_ref_del(env, mdd_tobj, handle);

                la->la_valid = LA_CTIME;
                rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0);
                if (rc)
                        GOTO(fixup_tpobj, rc);

                rc = mdd_finish_unlink(env, mdd_tobj, ma, handle);
                mdd_write_unlock(env, mdd_tobj);
                if (rc)
                        GOTO(fixup_tpobj, rc);

If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj:

fixup_tpobj:
        if (rc) {
                rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle,
                                         BYPASS_CAPA);
                if (rc2)
                        CWARN("tp obj fix error %d\n",rc2);

                if (mdd_tobj && mdd_object_exists(mdd_tobj) &&
                    !mdd_is_dead_obj(mdd_tobj)) {
                        rc2 = __mdd_index_insert(env, mdd_tpobj,
                                         mdo2fid(mdd_tobj), tname,
                                         is_dir, handle,
                                         BYPASS_CAPA);

                        if (rc2)
                                CWARN("tp obj fix error %d\n",rc2);
                }
        }

So if everything got reverted, refcount on target object will be wrong.
Is this analysis correct?

Liang Zhen (Inactive) added a comment - 05/Oct/12 5:33 AM - edited I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me: /* Remove old target object * For tobj is remote case cmm layer has processed * and set tobj to NULL then. So when tobj is NOT NULL, * it must be local one. */ if (tobj && mdd_object_exists(mdd_tobj)) { mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD); if (mdd_is_dead_obj(mdd_tobj)) { mdd_write_unlock(env, mdd_tobj); /* shld not be dead, something is wrong */ CERROR("tobj is dead, something is wrong\n"); rc = -EINVAL; goto cleanup; } mdo_ref_del(env, mdd_tobj, handle); /* Remove dot reference. */ if (is_dir) mdo_ref_del(env, mdd_tobj, handle); la->la_valid = LA_CTIME; rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0); if (rc) GOTO(fixup_tpobj, rc); rc = mdd_finish_unlink(env, mdd_tobj, ma, handle); mdd_write_unlock(env, mdd_tobj); if (rc) GOTO(fixup_tpobj, rc); If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj: fixup_tpobj: if (rc) { rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle, BYPASS_CAPA); if (rc2) CWARN("tp obj fix error %d\n",rc2); if (mdd_tobj && mdd_object_exists(mdd_tobj) && !mdd_is_dead_obj(mdd_tobj)) { rc2 = __mdd_index_insert(env, mdd_tpobj, mdo2fid(mdd_tobj), tname, is_dir, handle, BYPASS_CAPA); if (rc2) CWARN("tp obj fix error %d\n",rc2); } } So if everything got reverted, refcount on target object will be wrong. Is this analysis correct?

Di Wang added a comment - 04/Oct/12 4:17 PM

The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?

Di Wang added a comment - 04/Oct/12 4:17 PM The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?

Cliff White (Inactive) added a comment - 04/Oct/12 11:25 AM

Hit this again while running SWL, backtrace:

PID: 4891   TASK: ffff88016dedaaa0  CPU: 13  COMMAND: "mdt03_014"
 #0 [ffff88016e693918] machine_kexec at ffffffff8103281b
 #1 [ffff88016e693978] crash_kexec at ffffffff810ba792
 #2 [ffff88016e693a48] panic at ffffffff814fd591
 #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs]
 #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs]
 #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd]
 #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd]
 #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm]
 #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt]
 #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt]
#10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt]
#11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt]
#12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt]
#13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt]
#14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc]
#15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc]
#16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a

Cliff White (Inactive) added a comment - 04/Oct/12 11:25 AM Hit this again while running SWL, backtrace: PID: 4891 TASK: ffff88016dedaaa0 CPU: 13 COMMAND: "mdt03_014" #0 [ffff88016e693918] machine_kexec at ffffffff8103281b #1 [ffff88016e693978] crash_kexec at ffffffff810ba792 #2 [ffff88016e693a48] panic at ffffffff814fd591 #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs] #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs] #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd] #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd] #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm] #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt] #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt] #10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt] #11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt] #12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt] #13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt] #14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc] #15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc] #16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a

People

Assignee:: Hongchao Zhang

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 16/Sep/12 2:47 PM

Updated:: 16/Apr/13 2:38 AM

Resolved:: 16/Apr/13 2:38 AM