Jan 30 09:10:48 ehyperion-rst6 LDAPOTP-AUTH[1760]: root@hyperion244.llnl.gov as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "/usr/sbin/lctl mark mds1 has failed over 3 times, and counting...");echo XXRETCODE:$?' Jan 30 09:10:48 ehyperion-rst6 kernel: Lustre: DEBUG MARKER: mds1 has failed over 3 times, and counting... Jan 30 09:10:48 ehyperion-rst6 kernel: Lustre: 1468:0:(ldlm_lib.c:2112:target_queue_recovery_request()) Next recovery transno: 30068878164, current: 30068430667, replaying Jan 30 09:10:48 ehyperion-rst6 kernel: Lustre: 1468:0:(ldlm_lib.c:2112:target_queue_recovery_request()) Skipped 4 previous similar messages Jan 30 09:10:53 ehyperion-rst6 kernel: Lustre: 1468:0:(ldlm_lib.c:909:target_handle_connect()) lustre-MDT0000: connection from ac1367b6-25ba-a204-fafb-e63f3325899f@192.168.115.144@o2ib1 recovering/t30068876723 exp ffff8802d11b6800 cur 1327943453 last 1327943404 Jan 30 09:10:53 ehyperion-rst6 kernel: Lustre: 1468:0:(ldlm_lib.c:909:target_handle_connect()) Skipped 13 previous similar messages Jan 30 09:11:05 ehyperion-rst6 kernel: Lustre: 1468:0:(ldlm_lib.c:2112:target_queue_recovery_request()) Next recovery transno: 30068878164, current: 30068814506, replaying Jan 30 09:11:05 ehyperion-rst6 kernel: Lustre: 1468:0:(ldlm_lib.c:2112:target_queue_recovery_request()) Skipped 77 previous similar messages Jan 30 09:11:18 ehyperion-rst6 kernel: Lustre: lustre-MDT0000: Recovery over after 1:09, of 106 clients 106 recovered and 0 were evicted. Jan 30 09:11:18 ehyperion-rst6 kernel: Lustre: 1466:0:(mds_lov.c:1026:mds_notify()) MDS mdd_obd-lustre-MDT0000: in recovery, not resetting orphans on lustre-OST0000_UUID Jan 30 09:11:18 ehyperion-rst6 kernel: Lustre: 1466:0:(mds_lov.c:1026:mds_notify()) Skipped 21 previous similar messages Jan 30 09:11:22 ehyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0014_UUID now active, resetting orphans Jan 30 09:11:22 ehyperion-rst6 kernel: Lustre: Skipped 15 previous similar messages Jan 30 09:11:22 ehyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0002_UUID now active, resetting orphans Jan 30 09:11:22 ehyperion-rst6 kernel: Lustre: Skipped 51 previous similar messages Jan 30 09:11:24 ehyperion-rst6 kernel: Lustre: 1466:0:(mdd_orphans.c:447:orph_key_test_and_del()) Found orphan! Delete it Jan 30 09:11:24 ehyperion-rst6 kernel: Lustre: 1466:0:(mdd_orphans.c:447:orph_key_test_and_del()) Skipped 3788 previous similar messages Jan 30 09:11:24 ehyperion-rst6 kernel: LustreError: 1466:0:(osd_handler.c:1966:osd_declare_object_destroy()) ASSERTION(!lu_object_is_dying(dt->do_lu.lo_header)) failed Jan 30 09:11:24 ehyperion-rst6 kernel: LustreError: 1466:0:(osd_handler.c:1966:osd_declare_object_destroy()) ASSERTION(!lu_object_is_dying(dt->do_lu.lo_header)) failed Jan 30 09:11:24 ehyperion-rst6 kernel: LustreError: 1466:0:(osd_handler.c:1966:osd_declare_object_destroy()) LBUG Jan 30 09:11:24 ehyperion-rst6 kernel: LustreError: 1466:0:(osd_handler.c:1966:osd_declare_object_destroy()) LBUG Jan 30 09:11:24 ehyperion-rst6 kernel: Pid: 1466, comm: tgt_recov Jan 30 09:11:24 ehyperion-rst6 kernel: Jan 30 09:11:24 ehyperion-rst6 kernel: Call Trace: Jan 30 09:11:24 ehyperion-rst6 kernel: [] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jan 30 09:11:24 ehyperion-rst6 kernel: [] lbug_with_loc+0x75/0xe0 [libcfs] Jan 30 09:11:24 ehyperion-rst6 kernel: [] libcfs_assertion_failed+0x66/0x70 [libcfs] Jan 30 09:11:24 ehyperion-rst6 kernel: [] osd_declare_object_destroy+0x1e1/0x210 [osd_ldiskfs] Jan 30 09:11:24 ehyperion-rst6 kernel: [] ? dt_declare_ref_del+0x47/0xd0 [mdd] Jan 30 09:11:24 ehyperion-rst6 kernel: [] mdd_declare_object_kill+0x7b/0x110 [mdd] Jan 30 09:11:24 ehyperion-rst6 kernel: [] __mdd_orphan_cleanup+0x625/0xdc0 [mdd] Jan 30 09:11:24 ehyperion-rst6 kernel: [] mdd_recovery_complete+0x188/0x590 [mdd] Jan 30 09:11:24 ehyperion-rst6 kernel: [] ? cmm_key_init+0x59/0x190 [cmm] Jan 30 09:11:24 ehyperion-rst6 kernel: [] cmm_recovery_complete+0x3d/0x100 [cmm] Jan 30 09:11:24 ehyperion-rst6 kernel: [] ? lu_context_init+0xab/0x240 [obdclass] Jan 30 09:11:24 ehyperion-rst6 kernel: [] mdt_postrecov+0x4e/0x130 [mdt] Jan 30 09:11:24 ehyperion-rst6 kernel: [] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc] Jan 30 09:11:24 ehyperion-rst6 kernel: [] mdt_obd_postrecov+0xd7/0x100 [mdt] Jan 30 09:11:24 ehyperion-rst6 kernel: [] target_recovery_thread+0xab2/0x1020 [ptlrpc] Jan 30 09:11:24 ehyperion-rst6 kernel: [] ? target_recovery_thread+0x0/0x1020 [ptlrpc] Jan 30 09:11:24 ehyperion-rst6 kernel: [] child_rip+0xa/0x20 Jan 30 09:11:24 ehyperion-rst6 kernel: [] ? target_recovery_thread+0x0/0x1020 [ptlrpc] Jan 30 09:11:24 ehyperion-rst6 kernel: [] ? target_recovery_thread+0x0/0x1020 [ptlrpc] Jan 30 09:11:24 ehyperion-rst6 kernel: [] ? child_rip+0x0/0x20 Jan 30 09:11:24 ehyperion-rst6 kernel: Jan 30 09:11:24 ehyperion-rst6 kernel: LustreError: dumping log to /tmp/lustre-log.1327943484.1466 Jan 30 09:14:32 ehyperion-rst6 kernel: INFO: task tgt_recov:1466 blocked for more than 120 seconds. Jan 30 09:14:32 ehyperion-rst6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 30 09:14:32 ehyperion-rst6 kernel: tgt_recov D 0000000000000005 0 1466 2 0x00000000 Jan 30 09:14:32 ehyperion-rst6 kernel: ffff8802d568f9e0 0000000000000046 0000000000000000 ffffffffa03eedaa Jan 30 09:14:32 ehyperion-rst6 kernel: ffffffffa0aaa500 ffff8802d568f9a0 ffffffffa0aaa500 ffffffffa0aab440 Jan 30 09:14:32 ehyperion-rst6 kernel: ffff88032a687ab8 ffff8802d568ffd8 000000000000f4e8 ffff88032a687ab8 Jan 30 09:14:32 ehyperion-rst6 kernel: Call Trace: Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? libcfs_run_lbug_upcall+0x8a/0x100 [libcfs] Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? default_wake_function+0x0/0x20 Jan 30 09:14:32 ehyperion-rst6 kernel: [] lbug_with_loc+0xad/0xe0 [libcfs] Jan 30 09:14:32 ehyperion-rst6 kernel: [] libcfs_assertion_failed+0x66/0x70 [libcfs] Jan 30 09:14:32 ehyperion-rst6 kernel: [] osd_declare_object_destroy+0x1e1/0x210 [osd_ldiskfs] Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? dt_declare_ref_del+0x47/0xd0 [mdd] Jan 30 09:14:32 ehyperion-rst6 kernel: [] mdd_declare_object_kill+0x7b/0x110 [mdd] Jan 30 09:14:32 ehyperion-rst6 kernel: [] __mdd_orphan_cleanup+0x625/0xdc0 [mdd] Jan 30 09:14:32 ehyperion-rst6 kernel: [] mdd_recovery_complete+0x188/0x590 [mdd] Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? cmm_key_init+0x59/0x190 [cmm] Jan 30 09:14:32 ehyperion-rst6 kernel: [] cmm_recovery_complete+0x3d/0x100 [cmm] Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? lu_context_init+0xab/0x240 [obdclass] Jan 30 09:14:32 ehyperion-rst6 kernel: [] mdt_postrecov+0x4e/0x130 [mdt] Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc] Jan 30 09:14:32 ehyperion-rst6 kernel: [] mdt_obd_postrecov+0xd7/0x100 [mdt] Jan 30 09:14:32 ehyperion-rst6 kernel: [] target_recovery_thread+0xab2/0x1020 [ptlrpc] Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? target_recovery_thread+0x0/0x1020 [ptlrpc] Jan 30 09:14:32 ehyperion-rst6 kernel: [] child_rip+0xa/0x20 Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? target_recovery_thread+0x0/0x1020 [ptlrpc] Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? target_recovery_thread+0x0/0x1020 [ptlrpc] Jan 30 09:14:32 ehyperion-rst6 kernel: [] ? child_rip+0x0/0x20 Jan 30 09:14:58 ehyperion-rst6 LDAPOTP-AUTH[1997]: root@hyperion244.llnl.gov as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "/usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=1853 DURATION=86400 PERIOD=700");echo XXRETCODE:$?'