Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
Lustre 2.3.0
-
None
-
SWL Hyperion/LLNL
-
3
-
4375
Description
MDS crash dumped, attempting to locate dump at this time.
Message from MDS:
2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed.
MDS did not dump a stack, was configured with panic_on_lbug.
Will attempt to replicate
Attachments
Activity
Check the last comment posted by Cliff on LU-1948, we hit this again while running SWL.
MDS crashed, panic stack:
2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
2012-09-27 21:56:35
2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
2012-09-27 21:56:35 0 ) failed:
2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
2012-09-27 21:56:35 LBUG
2012-09-27 21:56:35 LBUG
2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-09-27 21:56:35
2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1
2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
2012-09-27 21:56:35 : LBUG
2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20
vmcore is on brent in ~/cliffw/lu1948/erofs
Hongchao
If you think that it should be then please set inspectors on the patch.
Thanks
Peter
Is the patch (http://review.whamcloud.com/#change,4041) included in this build(24)?
/var/log/messages from evicted client
Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1348812749/real 1348812749] req@ffff8802116cec00 x1414295663929201/t0(0) o250->MGC192.168.127.6@o2ib@192.168.127.6@o2ib:26/25 lens 400/544 e 0 to 1 dl 1348812804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 31 previous similar messages Sep 27 23:14:59 hyperion787 kernel: Lustre: Evicted from MGS (at MGC192.168.127.6@o2ib_0) after server handle changed from 0x5e1536afc64a7a2d to 0x9107144105607180 Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Reactivating import Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 192.168.127.6@o2ib) Sep 27 23:16:21 hyperion787 kernel: Lustre: lustre-MDT0000-mdc-ffff88033ba61400: Connection restored to lustre-MDT0000 (at 192.168.127.6@o2ib1)
Also, some issues when recovering the MDS after the dump:
Lustre: lustre-MDT0000: used disk, loading Lustre: 4204:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 8624944131 Lustre: 4140:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22 Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 91 clients reconnect LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids LustreError: 4216:0:(mdt_recovery.c:497:mdt_txn_stop_cb()) Replay transno 8624998654 failed: rc -39 LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids Lustre: lustre-MDT0000: disconnecting 1 stale clients Lustre: lustre-MDT0000: Recovery over after 1:22, of 91 clients 90 recovered and 1 was evicted. Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000f_UUID now active, resetting orphans Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST001b_UUID now active, resetting orphans Lustre: Skipped 5 previous similar messages Lustre: Skipped 45 previous similar messages
Looks like we still have this issue with build 24:
ep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: Pid: 5611, comm: mdt00_015 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Call Trace: Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Kernel panic - not syncing: LBUG
We have found memory corruption in LU-1976, the crashed OI index node may cause accessing unknown RAM areas. Under such case, some objects/inodes may be affected also. Although there is no directly evidence to say that LU-1951 is just the duplication of LU-1976, I tend to the duplication of LU-1976. Since the patch for LU-1976 is ready, we can wait for the new result with such patch to check whether LU-1951 can be reproduced or not.
Hit this again while running SWL, backtrace:
PID: 4891 TASK: ffff88016dedaaa0 CPU: 13 COMMAND: "mdt03_014" #0 [ffff88016e693918] machine_kexec at ffffffff8103281b #1 [ffff88016e693978] crash_kexec at ffffffff810ba792 #2 [ffff88016e693a48] panic at ffffffff814fd591 #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs] #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs] #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd] #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd] #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm] #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt] #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt] #10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt] #11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt] #12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt] #13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt] #14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc] #15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc] #16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a