Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.9.0
-
Soak test cluster, tip of lustre-master lustre: 2.8.59_79_gb8811a0
-
3
-
9223372036854775807
Description
After previous failover, MDT0003 is failed back to lola-11:
Nov 8 22:10:14 lola-10 kernel: Lustre: server umount soaked-MDT0003 complete
MDS failover fault induced, lola-11 fails MDT0003 to lola-10:
016-11-08 22:24:25,959:fsmgmt.fsmgmt:INFO Failing over soaked-MDT0003 ... 2016-11-08 22:24:25,959:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0003 on lola-10 ...
Recovery starts on MDS: Nov 8 22:25:29 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Task block on lola-10:
ov 8 22:29:17 lola-10 kernel: INFO: task mdt_out01_007:6305 blocked for more than 120 seconds. Nov 8 22:29:17 lola-10 kernel: Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 Nov 8 22:29:17 lola-10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 8 22:29:17 lola-10 kernel: mdt_out01_007 D 0000000000000018 0 6305 2 0x00000080 Nov 8 22:29:17 lola-10 kernel: ffff8808104b7b40 0000000000000046 0000000000000000 ffff8808104b7d00 Nov 8 22:29:17 lola-10 kernel: ffff8804337ea800 ffff880824300118 000014206fe1f7c1 ffff880824300118 Nov 8 22:29:17 lola-10 kernel: ffff8808104b7b40 00000001014d3a26 ffff880834add068 ffff8808104b7fd8 Nov 8 22:29:17 lola-10 kernel: Call Trace: Nov 8 22:29:17 lola-10 kernel: [<ffffffffa07bca3d>] lu_object_find_at+0x3d/0xe0 [obdclass] Nov 8 22:29:17 lola-10 kernel: [<ffffffffa09dd442>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] Nov 8 22:29:17 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Nov 8 22:29:17 lola-10 kernel: [<ffffffffa09b2af0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc] Nov 8 22:29:17 lola-10 kernel: [<ffffffffa07bd7cc>] dt_locate_at+0x1c/0xa0 [obdclass] Nov 8 22:29:17 lola-10 kernel: [<ffffffffa0a24147>] out_handle+0x1067/0x18d0 [ptlrpc] Nov 8 22:29:17 lola-10 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 Nov 8 22:29:17 lola-10 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 Nov 8 22:29:17 lola-10 kernel: [<ffffffffa067ac8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] Nov 8 22:29:17 lola-10 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 Nov 8 22:29:17 lola-10 kernel: [<ffffffffa0a142ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] Nov 8 22:29:17 lola-10 kernel: [<ffffffffa0a1b4bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Nov 8 22:29:17 lola-10 kernel: [<ffffffffa09c7bc1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] Nov 8 22:29:17 lola-10 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Nov 8 22:29:17 lola-10 kernel: [<ffffffffa09c6e90>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Nov 8 22:29:17 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Nov 8 22:29:17 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Nov 8 22:29:17 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Nov 8 22:29:18 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Second blocked task:
Nov 8 22:29:32 lola-10 kernel: Call Trace: Nov 8 22:29:32 lola-10 kernel: [<ffffffffa07bca3d>] lu_object_find_at+0x3d/0xe0 [obdclass] Nov 8 22:29:32 lola-10 kernel: [<ffffffffa09dd442>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] Nov 8 22:29:32 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Nov 8 22:29:32 lola-10 kernel: [<ffffffffa09b2af0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc] Nov 8 22:29:32 lola-10 kernel: [<ffffffffa07bd7cc>] dt_locate_at+0x1c/0xa0 [obdclass] Nov 8 22:29:32 lola-10 kernel: [<ffffffffa0a24147>] out_handle+0x1067/0x18d0 [ptlrpc] Nov 8 22:29:32 lola-10 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 Nov 8 22:29:32 lola-10 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 Nov 8 22:29:32 lola-10 kernel: [<ffffffffa067ac8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] Nov 8 22:29:32 lola-10 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 Nov 8 22:29:32 lola-10 kernel: [<ffffffffa0a142ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] Nov 8 22:29:32 lola-10 kernel: [<ffffffffa0a1b4bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Nov 8 22:29:32 lola-10 kernel: [<ffffffffa09c7bc1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] Nov 8 22:29:32 lola-10 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Nov 8 22:29:32 lola-10 kernel: [<ffffffffa09c6e90>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Nov 8 22:29:32 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Nov 8 22:29:32 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Nov 8 22:29:32 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Nov 8 22:29:32 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Recovery timer competes, but recovery does not:
mdt.soaked-MDT0003.recovery_status= status: RECOVERING recovery_start: 1478672734 time_remaining: 0 connected_clients: 21/21 req_replay_clients: 4 lock_repay_clients: 5 completed_clients: 16 evicted_clients: 0 replayed_requests: 7 queued_requests: 4 next_transno: 77309522798
System finally notices:
Nov 9 08:22:35 lola-10 kernel: Lustre: soaked-MDT0003: Recovery already passed deadline 588:42, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
Attached lustre-log dumps from lola-10