[LU-1951] SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Created: 16/Sep/12 Updated: 16/Apr/13 Resolved: 16/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SWL Hyperion/LLNL |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4375 |
| Description |
|
MDS crash dumped, attempting to locate dump at this time. 2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed. MDS did not dump a stack, was configured with panic_on_lbug. |
| Comments |
| Comment by Cliff White (Inactive) [ 16/Sep/12 ] |
|
Okay, have vmcore here is the LBUG stack. Pid: 5503, comm: mdt01_012 Call Trace: Kernel panic - not syncing: LBUG |
| Comment by Peter Jones [ 17/Sep/12 ] |
|
Hongchao Could you please comment on this one? Thanks Peter |
| Comment by Hongchao Zhang [ 17/Sep/12 ] |
|
this issue is different from ORI-577, in which is the local object(say, llog file) that triggers the assertion, and this one this issue could be caused by LDLM problem, for MDT uses LDLM locks to protect parallel directory lock, Hi Cliff, |
| Comment by Cliff White (Inactive) [ 17/Sep/12 ] |
|
The patch does not build. I have a vmcore from the crash - would that be useful? |
| Comment by Hongchao Zhang [ 18/Sep/12 ] |
|
okay, thanks! |
| Comment by Cliff White (Inactive) [ 18/Sep/12 ] |
|
The core dump is on brent.whamcloud.com ~/cliffw/lu1951/vmcore-lu1951.gz |
| Comment by Hongchao Zhang [ 19/Sep/12 ] |
|
the possible patch is tracked at http://review.whamcloud.com/#change,4041 Hi Cliff, Could you please test it with this patch? thanks! |
| Comment by nasf (Inactive) [ 21/Sep/12 ] |
|
We have found memory corruption in |
| Comment by Peter Jones [ 22/Sep/12 ] |
|
Dropping priority as this only occurred once in five days of testing |
| Comment by Cliff White (Inactive) [ 28/Sep/12 ] |
|
Looks like we still have this issue with build 24: ep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: Pid: 5611, comm: mdt00_015 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Call Trace: Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Kernel panic - not syncing: LBUG |
| Comment by Cliff White (Inactive) [ 28/Sep/12 ] |
|
Also, some issues when recovering the MDS after the dump: Lustre: lustre-MDT0000: used disk, loading Lustre: 4204:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 8624944131 Lustre: 4140:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22 Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 91 clients reconnect LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids LustreError: 4216:0:(mdt_recovery.c:497:mdt_txn_stop_cb()) Replay transno 8624998654 failed: rc -39 LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids Lustre: lustre-MDT0000: disconnecting 1 stale clients Lustre: lustre-MDT0000: Recovery over after 1:22, of 91 clients 90 recovered and 1 was evicted. Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000f_UUID now active, resetting orphans Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST001b_UUID now active, resetting orphans Lustre: Skipped 5 previous similar messages Lustre: Skipped 45 previous similar messages |
| Comment by Cliff White (Inactive) [ 28/Sep/12 ] |
|
/var/log/messages from evicted client Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1348812749/real 1348812749] req@ffff8802116cec00 x1414295663929201/t0(0) o250->MGC192.168.127.6@o2ib@192.168.127.6@o2ib:26/25 lens 400/544 e 0 to 1 dl 1348812804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 31 previous similar messages Sep 27 23:14:59 hyperion787 kernel: Lustre: Evicted from MGS (at MGC192.168.127.6@o2ib_0) after server handle changed from 0x5e1536afc64a7a2d to 0x9107144105607180 Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Reactivating import Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 192.168.127.6@o2ib) Sep 27 23:16:21 hyperion787 kernel: Lustre: lustre-MDT0000-mdc-ffff88033ba61400: Connection restored to lustre-MDT0000 (at 192.168.127.6@o2ib1) |
| Comment by Hongchao Zhang [ 28/Sep/12 ] |
|
Is the patch (http://review.whamcloud.com/#change,4041) included in this build(24)? |
| Comment by Peter Jones [ 28/Sep/12 ] |
|
No it is not. |
| Comment by Peter Jones [ 28/Sep/12 ] |
|
Hongchao If you think that it should be then please set inspectors on the patch. Thanks Peter |
| Comment by Cliff White (Inactive) [ 29/Sep/12 ] |
|
MDS crashed, panic stack: vmcore is on brent in ~/cliffw/lu1948/erofs |
| Comment by Liang Zhen (Inactive) [ 29/Sep/12 ] |
|
Check the last comment posted by Cliff on |
| Comment by Cliff White (Inactive) [ 04/Oct/12 ] |
|
Hit this again while running SWL, backtrace: PID: 4891 TASK: ffff88016dedaaa0 CPU: 13 COMMAND: "mdt03_014"
#0 [ffff88016e693918] machine_kexec at ffffffff8103281b
#1 [ffff88016e693978] crash_kexec at ffffffff810ba792
#2 [ffff88016e693a48] panic at ffffffff814fd591
#3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs]
#4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs]
#5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd]
#6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd]
#7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm]
#8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt]
#9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt]
#10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt]
#11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt]
#12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt]
#13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt]
#14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc]
#15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc]
#16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a
|
| Comment by Di Wang [ 04/Oct/12 ] |
|
The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look? |
| Comment by Liang Zhen (Inactive) [ 05/Oct/12 ] |
|
I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me: /* Remove old target object
* For tobj is remote case cmm layer has processed
* and set tobj to NULL then. So when tobj is NOT NULL,
* it must be local one.
*/
if (tobj && mdd_object_exists(mdd_tobj)) {
mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD);
if (mdd_is_dead_obj(mdd_tobj)) {
mdd_write_unlock(env, mdd_tobj);
/* shld not be dead, something is wrong */
CERROR("tobj is dead, something is wrong\n");
rc = -EINVAL;
goto cleanup;
}
mdo_ref_del(env, mdd_tobj, handle);
/* Remove dot reference. */
if (is_dir)
mdo_ref_del(env, mdd_tobj, handle);
la->la_valid = LA_CTIME;
rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0);
if (rc)
GOTO(fixup_tpobj, rc);
rc = mdd_finish_unlink(env, mdd_tobj, ma, handle);
mdd_write_unlock(env, mdd_tobj);
if (rc)
GOTO(fixup_tpobj, rc);
If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj: fixup_tpobj:
if (rc) {
rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle,
BYPASS_CAPA);
if (rc2)
CWARN("tp obj fix error %d\n",rc2);
if (mdd_tobj && mdd_object_exists(mdd_tobj) &&
!mdd_is_dead_obj(mdd_tobj)) {
rc2 = __mdd_index_insert(env, mdd_tpobj,
mdo2fid(mdd_tobj), tname,
is_dir, handle,
BYPASS_CAPA);
if (rc2)
CWARN("tp obj fix error %d\n",rc2);
}
}
So if everything got reverted, refcount on target object will be wrong. |
| Comment by Liang Zhen (Inactive) [ 05/Oct/12 ] |
|
I've posted another patch for this: http://review.whamcloud.com/#change,4197 |
| Comment by Alex Zhuravlev [ 05/Oct/12 ] |
|
Liang, probably makes sense to set add CERROR() to see whether we hit this path. |
| Comment by Peter Jones [ 08/Oct/12 ] |
|
Dropping priority because landed for 2.3 |
| Comment by Andreas Dilger [ 18/Oct/12 ] |
|
Liang, Oleg, what about http://review.whamcloud.com/4136? That patch didn't land to b2_3. Was that intended to fix the MDS crash, or is it a secondary problem that doesn't need to be fixed for 2.3.0? |
| Comment by Liang Zhen (Inactive) [ 18/Oct/12 ] |
|
I think the problem fixed by review-4136 existed for long time (since 2.1), so probably it's not the reason of the crash here, but we should land it to master at least. |
| Comment by Bob Glossman (Inactive) [ 29/Oct/12 ] |
|
patch for master: http://review.whamcloud.com/4405 |
| Comment by Hongchao Zhang [ 16/Apr/13 ] |
|
duplicate of |