Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.7.0
-
lustre-master build #2770
-
3
-
16931
Description
Many recovery tests start to fail because unexpected recovery abort due to hard timeout.
This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/722432b2-80fa-11e4-9c9a-5254006e85c2.
The sub-test test_4k failed with the following error:
onyx-35vm1.onyx.hpdd.intel.com evicted
MDS dmesg
Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0000_UUID (at 10.2.4.141@tcp), waiting for all 6 known clients (0 recovered, 5 in progress, and 0 evicted) to recover in 0:25 Lustre: Skipped 90 previous similar messages INFO: task tgt_recov:2119 blocked for more than 120 seconds. Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tgt_recov D 0000000000000000 0 2119 2 0x00000080 ffff88006fb2fda0 0000000000000046 0000000000000000 ffff880002316880 ffff88006fb2fd10 ffffffff81030b59 ffff88006fb2fd20 ffffffff810554f8 ffff88006faad058 ffff88006fb2ffd8 000000000000fbc8 ffff88006faad058 Call Trace: [<ffffffff81030b59>] ? native_smp_send_reschedule+0x49/0x60 [<ffffffff810554f8>] ? resched_task+0x68/0x80 [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80 [<ffffffffa080d9c0>] ? check_for_clients+0x0/0x70 [ptlrpc] [<ffffffffa080ef2d>] target_recovery_overseer+0xad/0x2d0 [ptlrpc] [<ffffffffa080d610>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0815850>] ? target_recovery_thread+0x0/0x1a20 [ptlrpc] [<ffffffffa0815f34>] target_recovery_thread+0x6e4/0x1a20 [ptlrpc] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [<ffffffffa0815850>] ? target_recovery_thread+0x0/0x1a20 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Lustre: lustre-MDT0000: recovery is timed out, evict stale exports Lustre: lustre-MDT0000: disconnecting 1 stale clients Lustre: 2119:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout Lustre: 2119:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery Lustre: 2119:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted: req@ffff880079bf6980 x1487142925659804/t0(38654705688) o36->c0baea22-119d-b8af-1550-c0592a66b0c4@10.2.4.138@tcp:277/0 lens 520/0 e 0 to 0 dl 1418252677 ref 1 fl Complete:/4/ffffffff rc 0/-1 Lustre: lustre-MDT0000: Recovery over after 3:00, of 6 clients 0 recovered and 6 were evicted. Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-vbr test_4k: @@@@@@ FAIL: onyx-35vm1.onyx.hpdd.intel.com evicted