[LU-1951] SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Created: 16/Sep/12  Updated: 16/Apr/13  Resolved: 16/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: None
Environment:

SWL Hyperion/LLNL


Attachments: File dump1.out.gz    
Severity: 3
Rank (Obsolete): 4375

 Description   

MDS crash dumped, attempting to locate dump at this time.
Message from MDS:

2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG

This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed.

MDS did not dump a stack, was configured with panic_on_lbug.
Will attempt to replicate



 Comments   
Comment by Cliff White (Inactive) [ 16/Sep/12 ]

Okay, have vmcore here is the LBUG stack.

Pid: 5503, comm: mdt01_012

Call Trace:
[<ffffffffa0397905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa0397f17>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa100c6a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
[<ffffffffa0ef109d>] mdo_ref_del+0xad/0xb0 [mdd]
[<ffffffffa0ef6715>] mdd_unlink+0x815/0xd40 [mdd]
[<ffffffffa08a61e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
[<ffffffffa06a7037>] cml_unlink+0x97/0x200 [cmm]
[<ffffffffa0f7b454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
[<ffffffffa0f78151>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0f719aa>] mdt_reint_internal+0x50a/0x810 [mdt]
[<ffffffffa0f71cf4>] mdt_reint+0x44/0xe0 [mdt]
[<ffffffffa0f65802>] mdt_handle_common+0x922/0x1740 [mdt]
[<ffffffffa0f666f5>] mdt_regular_handle+0x15/0x20 [mdt]
[<ffffffffa08b499d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
[<ffffffffa08abf37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
[<ffffffffa03a85b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffff810533f3>] ? __wake_up+0x53/0x70
[<ffffffffa08b5f89>] ptlrpc_main+0xb59/0x1860 [ptlrpc]
[<ffffffffa08b5430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa08b5430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
[<ffffffffa08b5430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
[<ffffffff8100c140>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG

Comment by Peter Jones [ 17/Sep/12 ]

Hongchao

Could you please comment on this one?

Thanks

Peter

Comment by Hongchao Zhang [ 17/Sep/12 ]

this issue is different from ORI-577, in which is the local object(say, llog file) that triggers the assertion, and this one
is due to normal file/object during unlink it.

this issue could be caused by LDLM problem, for MDT uses LDLM locks to protect parallel directory lock,

Hi Cliff,
could you please test it with the debug patch http://review.whamcloud.com/#change,4009,
which disables "MDD_DISABLE_PDO_LOCK" in mdd/mdd_internal.h, so we can test whether it is related to LDLM or not, thanks!

Comment by Cliff White (Inactive) [ 17/Sep/12 ]

The patch does not build. I have a vmcore from the crash - would that be useful?

Comment by Hongchao Zhang [ 18/Sep/12 ]

okay, thanks!
I'm also trying to create an updated debug patch to collect some logs to help track this issue.

Comment by Cliff White (Inactive) [ 18/Sep/12 ]

The core dump is on brent.whamcloud.com ~/cliffw/lu1951/vmcore-lu1951.gz

Comment by Hongchao Zhang [ 19/Sep/12 ]

the possible patch is tracked at http://review.whamcloud.com/#change,4041

Hi Cliff, Could you please test it with this patch? thanks!

Comment by nasf (Inactive) [ 21/Sep/12 ]

We have found memory corruption in LU-1976, the crashed OI index node may cause accessing unknown RAM areas. Under such case, some objects/inodes may be affected also. Although there is no directly evidence to say that LU-1951 is just the duplication of LU-1976, I tend to the duplication of LU-1976. Since the patch for LU-1976 is ready, we can wait for the new result with such patch to check whether LU-1951 can be reproduced or not.

Comment by Peter Jones [ 22/Sep/12 ]

Dropping priority as this only occurred once in five days of testing

Comment by Cliff White (Inactive) [ 28/Sep/12 ]

Looks like we still have this issue with build 24:

ep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
Sep 27 21:56:35 hyperion-rst6 kernel: Pid: 5611, comm: mdt00_015
Sep 27 21:56:35 hyperion-rst6 kernel:
Sep 27 21:56:35 hyperion-rst6 kernel: Call Trace:
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Sep 27 21:56:35 hyperion-rst6 kernel:
Sep 27 21:56:35 hyperion-rst6 kernel: Kernel panic - not syncing: LBUG
Comment by Cliff White (Inactive) [ 28/Sep/12 ]

Also, some issues when recovering the MDS after the dump:

Lustre: lustre-MDT0000: used disk, loading
Lustre: 4204:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 8624944131
Lustre: 4140:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22
Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 91 clients reconnect
LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
LustreError: 4216:0:(mdt_recovery.c:497:mdt_txn_stop_cb()) Replay transno 8624998654 failed: rc -39
LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
Lustre: lustre-MDT0000: disconnecting 1 stale clients
Lustre: lustre-MDT0000: Recovery over after 1:22, of 91 clients 90 recovered and 1 was evicted.
Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000f_UUID now active, resetting orphans
Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST001b_UUID now active, resetting orphans
Lustre: Skipped 5 previous similar messages
Lustre: Skipped 45 previous similar messages
Comment by Cliff White (Inactive) [ 28/Sep/12 ]

/var/log/messages from evicted client

Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1348812749/real 1348812749]  req@ffff8802116cec00 x1414295663929201/t0(0) o250->MGC192.168.127.6@o2ib@192.168.127.6@o2ib:26/25 lens 400/544 e 0 to 1 dl 1348812804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 31 previous similar messages
Sep 27 23:14:59 hyperion787 kernel: Lustre: Evicted from MGS (at MGC192.168.127.6@o2ib_0) after server handle changed from 0x5e1536afc64a7a2d to 0x9107144105607180
Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Reactivating import
Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 192.168.127.6@o2ib)
Sep 27 23:16:21 hyperion787 kernel: Lustre: lustre-MDT0000-mdc-ffff88033ba61400: Connection restored to lustre-MDT0000 (at 192.168.127.6@o2ib1)
Comment by Hongchao Zhang [ 28/Sep/12 ]

Is the patch (http://review.whamcloud.com/#change,4041) included in this build(24)?

Comment by Peter Jones [ 28/Sep/12 ]

No it is not.

Comment by Peter Jones [ 28/Sep/12 ]

Hongchao

If you think that it should be then please set inspectors on the patch.

Thanks

Peter

Comment by Cliff White (Inactive) [ 29/Sep/12 ]

MDS crashed, panic stack:
2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
2012-09-27 21:56:35
2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
2012-09-27 21:56:35 0 ) failed:
2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
2012-09-27 21:56:35 LBUG
2012-09-27 21:56:35 LBUG
2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-09-27 21:56:35
2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1
2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
2012-09-27 21:56:35 : LBUG
2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20


vmcore is on brent in ~/cliffw/lu1948/erofs

Comment by Liang Zhen (Inactive) [ 29/Sep/12 ]

Check the last comment posted by Cliff on LU-1948, we hit this again while running SWL.

Comment by Cliff White (Inactive) [ 04/Oct/12 ]

Hit this again while running SWL, backtrace:

PID: 4891   TASK: ffff88016dedaaa0  CPU: 13  COMMAND: "mdt03_014"
 #0 [ffff88016e693918] machine_kexec at ffffffff8103281b
 #1 [ffff88016e693978] crash_kexec at ffffffff810ba792
 #2 [ffff88016e693a48] panic at ffffffff814fd591
 #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs]
 #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs]
 #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd]
 #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd]
 #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm]
 #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt]
 #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt]
#10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt]
#11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt]
#12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt]
#13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt]
#14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc]
#15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc]
#16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a
Comment by Di Wang [ 04/Oct/12 ]

The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?

Comment by Liang Zhen (Inactive) [ 05/Oct/12 ]

I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me:

        /* Remove old target object
         * For tobj is remote case cmm layer has processed
         * and set tobj to NULL then. So when tobj is NOT NULL,
         * it must be local one.
         */
        if (tobj && mdd_object_exists(mdd_tobj)) {
                mdd_write_lock(env, mdd_tobj, MOR_TGT_CHILD);
                if (mdd_is_dead_obj(mdd_tobj)) {
                        mdd_write_unlock(env, mdd_tobj);
                        /* shld not be dead, something is wrong */
                        CERROR("tobj is dead, something is wrong\n");
                        rc = -EINVAL;
                        goto cleanup;
                }
                mdo_ref_del(env, mdd_tobj, handle);

                /* Remove dot reference. */
                if (is_dir)
                        mdo_ref_del(env, mdd_tobj, handle);

                la->la_valid = LA_CTIME;
                rc = mdd_attr_check_set_internal(env, mdd_tobj, la, handle, 0);
                if (rc)
                        GOTO(fixup_tpobj, rc);

                rc = mdd_finish_unlink(env, mdd_tobj, ma, handle);
                mdd_write_unlock(env, mdd_tobj);
                if (rc)
                        GOTO(fixup_tpobj, rc);

If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj:

fixup_tpobj:
        if (rc) {
                rc2 = __mdd_index_delete(env, mdd_tpobj, tname, is_dir, handle,
                                         BYPASS_CAPA);
                if (rc2)
                        CWARN("tp obj fix error %d\n",rc2);

                if (mdd_tobj && mdd_object_exists(mdd_tobj) &&
                    !mdd_is_dead_obj(mdd_tobj)) {
                        rc2 = __mdd_index_insert(env, mdd_tpobj,
                                         mdo2fid(mdd_tobj), tname,
                                         is_dir, handle,
                                         BYPASS_CAPA);

                        if (rc2)
                                CWARN("tp obj fix error %d\n",rc2);
                }
        }

So if everything got reverted, refcount on target object will be wrong.
Is this analysis correct?

Comment by Liang Zhen (Inactive) [ 05/Oct/12 ]

I've posted another patch for this: http://review.whamcloud.com/#change,4197
it should have fixed something but not sure if it can fix this bug.

Comment by Alex Zhuravlev [ 05/Oct/12 ]

Liang, probably makes sense to set add CERROR() to see whether we hit this path.

Comment by Peter Jones [ 08/Oct/12 ]

Dropping priority because landed for 2.3

Comment by Andreas Dilger [ 18/Oct/12 ]

Liang, Oleg, what about http://review.whamcloud.com/4136? That patch didn't land to b2_3. Was that intended to fix the MDS crash, or is it a secondary problem that doesn't need to be fixed for 2.3.0?

Comment by Liang Zhen (Inactive) [ 18/Oct/12 ]

I think the problem fixed by review-4136 existed for long time (since 2.1), so probably it's not the reason of the crash here, but we should land it to master at least.

Comment by Bob Glossman (Inactive) [ 29/Oct/12 ]

patch for master: http://review.whamcloud.com/4405
port of http://review.whamcloud.com/#change,4197. Context too different to just be cherry picked directly.

Comment by Hongchao Zhang [ 16/Apr/13 ]

duplicate of LU-3022

Generated at Sat Feb 10 01:21:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.