Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1951

SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 2.4.0
    • Lustre 2.3.0
    • None
    • SWL Hyperion/LLNL
    • 3
    • 4375

    Description

      MDS crash dumped, attempting to locate dump at this time.
      Message from MDS:

      2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
      2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG

      This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed.

      MDS did not dump a stack, was configured with panic_on_lbug.
      Will attempt to replicate

      Attachments

        Activity

          [LU-1951] SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          liang Liang Zhen (Inactive) added a comment - - edited

          I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me:

                  /* Remove old target object
                   * For tobj is remote case cmm layer has processed
                   * and set tobj to NULL then. So when tobj is NOT NULL,
                   * it must be local one.
                   */
                  if (tobj && mdd_object_exists(mdd_tobj)) {
                          mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD);
                          if (mdd_is_dead_obj(mdd_tobj)) {
                                  mdd_write_unlock(env, mdd_tobj);
                                  /* shld not be dead, something is wrong */
                                  CERROR("tobj is dead, something is wrong\n");
                                  rc = -EINVAL;
                                  goto cleanup;
                          }
                          mdo_ref_del(env, mdd_tobj, handle);
          
                          /* Remove dot reference. */
                          if (is_dir)
                                  mdo_ref_del(env, mdd_tobj, handle);
          
                          la->la_valid = LA_CTIME;
                          rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0);
                          if (rc)
                                  GOTO(fixup_tpobj, rc);
          
                          rc = mdd_finish_unlink(env, mdd_tobj, ma, handle);
                          mdd_write_unlock(env, mdd_tobj);
                          if (rc)
                                  GOTO(fixup_tpobj, rc);
          
          

          If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj:

          fixup_tpobj:
                  if (rc) {
                          rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle,
                                                   BYPASS_CAPA);
                          if (rc2)
                                  CWARN("tp obj fix error %d\n",rc2);
          
                          if (mdd_tobj && mdd_object_exists(mdd_tobj) &&
                              !mdd_is_dead_obj(mdd_tobj)) {
                                  rc2 = __mdd_index_insert(env, mdd_tpobj,
                                                   mdo2fid(mdd_tobj), tname,
                                                   is_dir, handle,
                                                   BYPASS_CAPA);
          
                                  if (rc2)
                                          CWARN("tp obj fix error %d\n",rc2);
                          }
                  }
          
          

          So if everything got reverted, refcount on target object will be wrong.
          Is this analysis correct?

          liang Liang Zhen (Inactive) added a comment - - edited I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me: /* Remove old target object * For tobj is remote case cmm layer has processed * and set tobj to NULL then. So when tobj is NOT NULL, * it must be local one. */ if (tobj && mdd_object_exists(mdd_tobj)) { mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD); if (mdd_is_dead_obj(mdd_tobj)) { mdd_write_unlock(env, mdd_tobj); /* shld not be dead, something is wrong */ CERROR("tobj is dead, something is wrong\n"); rc = -EINVAL; goto cleanup; } mdo_ref_del(env, mdd_tobj, handle); /* Remove dot reference. */ if (is_dir) mdo_ref_del(env, mdd_tobj, handle); la->la_valid = LA_CTIME; rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0); if (rc) GOTO(fixup_tpobj, rc); rc = mdd_finish_unlink(env, mdd_tobj, ma, handle); mdd_write_unlock(env, mdd_tobj); if (rc) GOTO(fixup_tpobj, rc); If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj: fixup_tpobj: if (rc) { rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle, BYPASS_CAPA); if (rc2) CWARN("tp obj fix error %d\n",rc2); if (mdd_tobj && mdd_object_exists(mdd_tobj) && !mdd_is_dead_obj(mdd_tobj)) { rc2 = __mdd_index_insert(env, mdd_tpobj, mdo2fid(mdd_tobj), tname, is_dir, handle, BYPASS_CAPA); if (rc2) CWARN("tp obj fix error %d\n",rc2); } } So if everything got reverted, refcount on target object will be wrong. Is this analysis correct?

          The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?

          di.wang Di Wang (Inactive) added a comment - The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?

          Hit this again while running SWL, backtrace:

          PID: 4891   TASK: ffff88016dedaaa0  CPU: 13  COMMAND: "mdt03_014"
           #0 [ffff88016e693918] machine_kexec at ffffffff8103281b
           #1 [ffff88016e693978] crash_kexec at ffffffff810ba792
           #2 [ffff88016e693a48] panic at ffffffff814fd591
           #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs]
           #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs]
           #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd]
           #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd]
           #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm]
           #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt]
           #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt]
          #10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt]
          #11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt]
          #12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt]
          #13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt]
          #14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc]
          #15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc]
          #16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a
          
          cliffw Cliff White (Inactive) added a comment - Hit this again while running SWL, backtrace: PID: 4891 TASK: ffff88016dedaaa0 CPU: 13 COMMAND: "mdt03_014" #0 [ffff88016e693918] machine_kexec at ffffffff8103281b #1 [ffff88016e693978] crash_kexec at ffffffff810ba792 #2 [ffff88016e693a48] panic at ffffffff814fd591 #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs] #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs] #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd] #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd] #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm] #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt] #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt] #10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt] #11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt] #12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt] #13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt] #14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc] #15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc] #16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a

          Check the last comment posted by Cliff on LU-1948, we hit this again while running SWL.

          liang Liang Zhen (Inactive) added a comment - Check the last comment posted by Cliff on LU-1948 , we hit this again while running SWL.

          MDS crashed, panic stack:
          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
          2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35 0 ) failed:
          2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:35 LBUG
          2012-09-27 21:56:35 LBUG
          2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
          2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
          2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 : LBUG
          2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
          2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20

          
          

          vmcore is on brent in ~/cliffw/lu1948/erofs

          cliffw Cliff White (Inactive) added a comment - MDS crashed, panic stack: 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 2012-09-27 21:56:35 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] 2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 0 ) failed: 2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:35 LBUG 2012-09-27 21:56:35 LBUG 2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-27 21:56:35 2012-09-27 21:56:35 Kernel panic - not syncing: LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168 2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 : LBUG 2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20 vmcore is on brent in ~/cliffw/lu1948/erofs
          pjones Peter Jones added a comment -

          Hongchao

          If you think that it should be then please set inspectors on the patch.

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao If you think that it should be then please set inspectors on the patch. Thanks Peter
          pjones Peter Jones added a comment -

          No it is not.

          pjones Peter Jones added a comment - No it is not.

          Is the patch (http://review.whamcloud.com/#change,4041) included in this build(24)?

          hongchao.zhang Hongchao Zhang added a comment - Is the patch ( http://review.whamcloud.com/#change,4041 ) included in this build(24)?

          /var/log/messages from evicted client

          Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1348812749/real 1348812749]  req@ffff8802116cec00 x1414295663929201/t0(0) o250->MGC192.168.127.6@o2ib@192.168.127.6@o2ib:26/25 lens 400/544 e 0 to 1 dl 1348812804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 31 previous similar messages
          Sep 27 23:14:59 hyperion787 kernel: Lustre: Evicted from MGS (at MGC192.168.127.6@o2ib_0) after server handle changed from 0x5e1536afc64a7a2d to 0x9107144105607180
          Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Reactivating import
          Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 192.168.127.6@o2ib)
          Sep 27 23:16:21 hyperion787 kernel: Lustre: lustre-MDT0000-mdc-ffff88033ba61400: Connection restored to lustre-MDT0000 (at 192.168.127.6@o2ib1)
          
          cliffw Cliff White (Inactive) added a comment - /var/log/messages from evicted client Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1348812749/real 1348812749] req@ffff8802116cec00 x1414295663929201/t0(0) o250->MGC192.168.127.6@o2ib@192.168.127.6@o2ib:26/25 lens 400/544 e 0 to 1 dl 1348812804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 31 previous similar messages Sep 27 23:14:59 hyperion787 kernel: Lustre: Evicted from MGS (at MGC192.168.127.6@o2ib_0) after server handle changed from 0x5e1536afc64a7a2d to 0x9107144105607180 Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Reactivating import Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 192.168.127.6@o2ib) Sep 27 23:16:21 hyperion787 kernel: Lustre: lustre-MDT0000-mdc-ffff88033ba61400: Connection restored to lustre-MDT0000 (at 192.168.127.6@o2ib1)

          Also, some issues when recovering the MDS after the dump:

          Lustre: lustre-MDT0000: used disk, loading
          Lustre: 4204:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 8624944131
          Lustre: 4140:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22
          Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 91 clients reconnect
          LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
          LustreError: 4216:0:(mdt_recovery.c:497:mdt_txn_stop_cb()) Replay transno 8624998654 failed: rc -39
          LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
          Lustre: lustre-MDT0000: disconnecting 1 stale clients
          Lustre: lustre-MDT0000: Recovery over after 1:22, of 91 clients 90 recovered and 1 was evicted.
          Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000f_UUID now active, resetting orphans
          Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST001b_UUID now active, resetting orphans
          Lustre: Skipped 5 previous similar messages
          Lustre: Skipped 45 previous similar messages
          
          cliffw Cliff White (Inactive) added a comment - Also, some issues when recovering the MDS after the dump: Lustre: lustre-MDT0000: used disk, loading Lustre: 4204:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 8624944131 Lustre: 4140:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22 Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 91 clients reconnect LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids LustreError: 4216:0:(mdt_recovery.c:497:mdt_txn_stop_cb()) Replay transno 8624998654 failed: rc -39 LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids Lustre: lustre-MDT0000: disconnecting 1 stale clients Lustre: lustre-MDT0000: Recovery over after 1:22, of 91 clients 90 recovered and 1 was evicted. Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000f_UUID now active, resetting orphans Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST001b_UUID now active, resetting orphans Lustre: Skipped 5 previous similar messages Lustre: Skipped 45 previous similar messages

          Looks like we still have this issue with build 24:

          ep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          Sep 27 21:56:35 hyperion-rst6 kernel: Pid: 5611, comm: mdt00_015
          Sep 27 21:56:35 hyperion-rst6 kernel:
          Sep 27 21:56:35 hyperion-rst6 kernel: Call Trace:
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
          Sep 27 21:56:35 hyperion-rst6 kernel:
          Sep 27 21:56:35 hyperion-rst6 kernel: Kernel panic - not syncing: LBUG
          
          cliffw Cliff White (Inactive) added a comment - Looks like we still have this issue with build 24: ep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: Pid: 5611, comm: mdt00_015 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Call Trace: Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Kernel panic - not syncing: LBUG

          People

            hongchao.zhang Hongchao Zhang
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: