Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
Lustre 2.3.0
-
None
-
SWL Hyperion/LLNL
-
3
-
4375
Description
MDS crash dumped, attempting to locate dump at this time.
Message from MDS:
2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed.
MDS did not dump a stack, was configured with panic_on_lbug.
Will attempt to replicate
Attachments
Activity
The lustre debug dump log. Though not much useful for this LBUG. But it seems there are some lnet error, Liang, could you please have a look?
Hit this again while running SWL, backtrace:
PID: 4891 TASK: ffff88016dedaaa0 CPU: 13 COMMAND: "mdt03_014"
#0 [ffff88016e693918] machine_kexec at ffffffff8103281b
#1 [ffff88016e693978] crash_kexec at ffffffff810ba792
#2 [ffff88016e693a48] panic at ffffffff814fd591
#3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs]
#4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs]
#5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd]
#6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd]
#7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm]
#8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt]
#9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt]
#10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt]
#11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt]
#12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt]
#13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt]
#14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc]
#15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc]
#16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a
Check the last comment posted by Cliff on LU-1948, we hit this again while running SWL.
MDS crashed, panic stack:
2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
2012-09-27 21:56:35
2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
2012-09-27 21:56:35 0 ) failed:
2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
2012-09-27 21:56:35 LBUG
2012-09-27 21:56:35 LBUG
2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-09-27 21:56:35
2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1
2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
2012-09-27 21:56:35 : LBUG
2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20
vmcore is on brent in ~/cliffw/lu1948/erofs
Hongchao
If you think that it should be then please set inspectors on the patch.
Thanks
Peter
Is the patch (http://review.whamcloud.com/#change,4041) included in this build(24)?
/var/log/messages from evicted client
Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1348812749/real 1348812749] req@ffff8802116cec00 x1414295663929201/t0(0) o250->MGC192.168.127.6@o2ib@192.168.127.6@o2ib:26/25 lens 400/544 e 0 to 1 dl 1348812804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 31 previous similar messages Sep 27 23:14:59 hyperion787 kernel: Lustre: Evicted from MGS (at MGC192.168.127.6@o2ib_0) after server handle changed from 0x5e1536afc64a7a2d to 0x9107144105607180 Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Reactivating import Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 192.168.127.6@o2ib) Sep 27 23:16:21 hyperion787 kernel: Lustre: lustre-MDT0000-mdc-ffff88033ba61400: Connection restored to lustre-MDT0000 (at 192.168.127.6@o2ib1)
Also, some issues when recovering the MDS after the dump:
Lustre: lustre-MDT0000: used disk, loading Lustre: 4204:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 8624944131 Lustre: 4140:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22 Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 91 clients reconnect LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids LustreError: 4216:0:(mdt_recovery.c:497:mdt_txn_stop_cb()) Replay transno 8624998654 failed: rc -39 LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids Lustre: lustre-MDT0000: disconnecting 1 stale clients Lustre: lustre-MDT0000: Recovery over after 1:22, of 91 clients 90 recovered and 1 was evicted. Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000f_UUID now active, resetting orphans Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST001b_UUID now active, resetting orphans Lustre: Skipped 5 previous similar messages Lustre: Skipped 45 previous similar messages
Looks like we still have this issue with build 24:
ep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: Pid: 5611, comm: mdt00_015 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Call Trace: Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Kernel panic - not syncing: LBUG
I found something suspicious in mdd_rename(), but I'm not expert of this, so please check this for me:
If mdd_attr_check_set_internal() or mdd_finish_unlink() failed, it will try to revert changes by re-inserting @mdd_tobj into @mdd_tpobj again without fix refcount of @mdd_tobj:
So if everything got reverted, refcount on target object will be wrong.
Is this analysis correct?