Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1951

SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 2.4.0
    • Lustre 2.3.0
    • None
    • SWL Hyperion/LLNL
    • 3
    • 4375

    Description

      MDS crash dumped, attempting to locate dump at this time.
      Message from MDS:

      2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
      2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG

      This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed.

      MDS did not dump a stack, was configured with panic_on_lbug.
      Will attempt to replicate

      Attachments

        Activity

          [LU-1951] SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:

          duplicate of LU-3022

          hongchao.zhang Hongchao Zhang added a comment - duplicate of LU-3022

          patch for master: http://review.whamcloud.com/4405
          port of http://review.whamcloud.com/#change,4197. Context too different to just be cherry picked directly.

          bogl Bob Glossman (Inactive) added a comment - patch for master: http://review.whamcloud.com/4405 port of http://review.whamcloud.com/#change,4197 . Context too different to just be cherry picked directly.

          I think the problem fixed by review-4136 existed for long time (since 2.1), so probably it's not the reason of the crash here, but we should land it to master at least.

          liang Liang Zhen (Inactive) added a comment - I think the problem fixed by review-4136 existed for long time (since 2.1), so probably it's not the reason of the crash here, but we should land it to master at least.

          Liang, Oleg, what about http://review.whamcloud.com/4136? That patch didn't land to b2_3. Was that intended to fix the MDS crash, or is it a secondary problem that doesn't need to be fixed for 2.3.0?

          adilger Andreas Dilger added a comment - Liang, Oleg, what about http://review.whamcloud.com/4136? That patch didn't land to b2_3. Was that intended to fix the MDS crash, or is it a secondary problem that doesn't need to be fixed for 2.3.0?
          pjones Peter Jones added a comment -

          Dropping priority because landed for 2.3

          pjones Peter Jones added a comment - Dropping priority because landed for 2.3

          Liang, probably makes sense to set add CERROR() to see whether we hit this path.

          bzzz Alex Zhuravlev added a comment - Liang, probably makes sense to set add CERROR() to see whether we hit this path.

          I've posted another patch for this: http://review.whamcloud.com/#change,4197
          it should have fixed something but not sure if it can fix this bug.

          liang Liang Zhen (Inactive) added a comment - I've posted another patch for this: http://review.whamcloud.com/#change,4197 it should have fixed something but not sure if it can fix this bug.
          liang Liang Zhen (Inactive) added a comment - - edited

          I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me:

                  /* Remove old target object
                   * For tobj is remote case cmm layer has processed
                   * and set tobj to NULL then. So when tobj is NOT NULL,
                   * it must be local one.
                   */
                  if (tobj && mdd_object_exists(mdd_tobj)) {
                          mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD);
                          if (mdd_is_dead_obj(mdd_tobj)) {
                                  mdd_write_unlock(env, mdd_tobj);
                                  /* shld not be dead, something is wrong */
                                  CERROR("tobj is dead, something is wrong\n");
                                  rc = -EINVAL;
                                  goto cleanup;
                          }
                          mdo_ref_del(env, mdd_tobj, handle);
          
                          /* Remove dot reference. */
                          if (is_dir)
                                  mdo_ref_del(env, mdd_tobj, handle);
          
                          la->la_valid = LA_CTIME;
                          rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0);
                          if (rc)
                                  GOTO(fixup_tpobj, rc);
          
                          rc = mdd_finish_unlink(env, mdd_tobj, ma, handle);
                          mdd_write_unlock(env, mdd_tobj);
                          if (rc)
                                  GOTO(fixup_tpobj, rc);
          
          

          If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj:

          fixup_tpobj:
                  if (rc) {
                          rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle,
                                                   BYPASS_CAPA);
                          if (rc2)
                                  CWARN("tp obj fix error %d\n",rc2);
          
                          if (mdd_tobj && mdd_object_exists(mdd_tobj) &&
                              !mdd_is_dead_obj(mdd_tobj)) {
                                  rc2 = __mdd_index_insert(env, mdd_tpobj,
                                                   mdo2fid(mdd_tobj), tname,
                                                   is_dir, handle,
                                                   BYPASS_CAPA);
          
                                  if (rc2)
                                          CWARN("tp obj fix error %d\n",rc2);
                          }
                  }
          
          

          So if everything got reverted, refcount on target object will be wrong.
          Is this analysis correct?

          liang Liang Zhen (Inactive) added a comment - - edited I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me: /* Remove old target object * For tobj is remote case cmm layer has processed * and set tobj to NULL then. So when tobj is NOT NULL, * it must be local one. */ if (tobj && mdd_object_exists(mdd_tobj)) { mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD); if (mdd_is_dead_obj(mdd_tobj)) { mdd_write_unlock(env, mdd_tobj); /* shld not be dead, something is wrong */ CERROR("tobj is dead, something is wrong\n"); rc = -EINVAL; goto cleanup; } mdo_ref_del(env, mdd_tobj, handle); /* Remove dot reference. */ if (is_dir) mdo_ref_del(env, mdd_tobj, handle); la->la_valid = LA_CTIME; rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0); if (rc) GOTO(fixup_tpobj, rc); rc = mdd_finish_unlink(env, mdd_tobj, ma, handle); mdd_write_unlock(env, mdd_tobj); if (rc) GOTO(fixup_tpobj, rc); If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj: fixup_tpobj: if (rc) { rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle, BYPASS_CAPA); if (rc2) CWARN("tp obj fix error %d\n",rc2); if (mdd_tobj && mdd_object_exists(mdd_tobj) && !mdd_is_dead_obj(mdd_tobj)) { rc2 = __mdd_index_insert(env, mdd_tpobj, mdo2fid(mdd_tobj), tname, is_dir, handle, BYPASS_CAPA); if (rc2) CWARN("tp obj fix error %d\n",rc2); } } So if everything got reverted, refcount on target object will be wrong. Is this analysis correct?
          di.wang Di Wang added a comment -

          The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?

          di.wang Di Wang added a comment - The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?

          Hit this again while running SWL, backtrace:

          PID: 4891   TASK: ffff88016dedaaa0  CPU: 13  COMMAND: "mdt03_014"
           #0 [ffff88016e693918] machine_kexec at ffffffff8103281b
           #1 [ffff88016e693978] crash_kexec at ffffffff810ba792
           #2 [ffff88016e693a48] panic at ffffffff814fd591
           #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs]
           #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs]
           #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd]
           #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd]
           #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm]
           #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt]
           #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt]
          #10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt]
          #11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt]
          #12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt]
          #13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt]
          #14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc]
          #15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc]
          #16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a
          
          cliffw Cliff White (Inactive) added a comment - Hit this again while running SWL, backtrace: PID: 4891 TASK: ffff88016dedaaa0 CPU: 13 COMMAND: "mdt03_014" #0 [ffff88016e693918] machine_kexec at ffffffff8103281b #1 [ffff88016e693978] crash_kexec at ffffffff810ba792 #2 [ffff88016e693a48] panic at ffffffff814fd591 #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs] #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs] #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd] #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd] #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm] #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt] #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt] #10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt] #11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt] #12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt] #13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt] #14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc] #15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc] #16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a

          People

            hongchao.zhang Hongchao Zhang
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: