[LU-6084] Tests are failed due to 'recovery is aborted by hard timeout' Created: 06/Jan/15 Updated: 11/Mar/15 Resolved: 09/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
lustre-master build #2770 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 16931 | ||||||||||||||||
| Description |
|
Many recovery tests start to fail because unexpected recovery abort due to hard timeout. This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/722432b2-80fa-11e4-9c9a-5254006e85c2. The sub-test test_4k failed with the following error: onyx-35vm1.onyx.hpdd.intel.com evicted MDS dmesg Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0000_UUID (at 10.2.4.141@tcp), waiting for all 6 known clients (0 recovered, 5 in progress, and 0 evicted) to recover in 0:25
Lustre: Skipped 90 previous similar messages
INFO: task tgt_recov:2119 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tgt_recov D 0000000000000000 0 2119 2 0x00000080
ffff88006fb2fda0 0000000000000046 0000000000000000 ffff880002316880
ffff88006fb2fd10 ffffffff81030b59 ffff88006fb2fd20 ffffffff810554f8
ffff88006faad058 ffff88006fb2ffd8 000000000000fbc8 ffff88006faad058
Call Trace:
[<ffffffff81030b59>] ? native_smp_send_reschedule+0x49/0x60
[<ffffffff810554f8>] ? resched_task+0x68/0x80
[<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80
[<ffffffffa080d9c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
[<ffffffffa080ef2d>] target_recovery_overseer+0xad/0x2d0 [ptlrpc]
[<ffffffffa080d610>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
[<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa0815850>] ? target_recovery_thread+0x0/0x1a20 [ptlrpc]
[<ffffffffa0815f34>] target_recovery_thread+0x6e4/0x1a20 [ptlrpc]
[<ffffffff81061d12>] ? default_wake_function+0x12/0x20
[<ffffffffa0815850>] ? target_recovery_thread+0x0/0x1a20 [ptlrpc]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
Lustre: lustre-MDT0000: disconnecting 1 stale clients
Lustre: 2119:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
Lustre: 2119:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Lustre: 2119:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted: req@ffff880079bf6980 x1487142925659804/t0(38654705688) o36->c0baea22-119d-b8af-1550-c0592a66b0c4@10.2.4.138@tcp:277/0 lens 520/0 e 0 to 0 dl 1418252677 ref 1 fl Complete:/4/ffffffff rc 0/-1
Lustre: lustre-MDT0000: Recovery over after 3:00, of 6 clients 0 recovered and 6 were evicted.
Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-vbr test_4k: @@@@@@ FAIL: onyx-35vm1.onyx.hpdd.intel.com evicted
|
| Comments |
| Comment by Oleg Drokin [ 06/Jan/15 ] |
|
So the backtrace has nothing to do with this bug. It's just a known harmless message due to recovery taking kind of long. |
| Comment by Jodi Levi (Inactive) [ 13/Jan/15 ] |
|
Mike, |
| Comment by Mikhail Pershin [ 14/Jan/15 ] |
|
15:08:55:Lustre: 2135:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout That mean test doesn't work as expected. Test (and many others in replay-var) removes one client during recovery, simulating missed client. It expects recovery to timed out and continue with VBR and existing clients, checking the result - was there eviction or not. But we see here that recovery was just aborted by hard timeout, so all clients are evicted no matter could they proceed with recovery or not. Therefore replay-vbr tests are broken in that sense, we must set hard timeout to bigger value to let recovery happen as expected. Also I suspect that we have the same situation as with replay-dual tests and patch for I know Yu Jian did related changes to replay-dual, same should be done for replay-vbr along with disabling or changing hard_timeout to bigger value |
| Comment by Gerrit Updater [ 18/Jan/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13447 |
| Comment by Mikhail Pershin [ 19/Jan/15 ] |
|
The reason is commit df89c74a320278acac7466a83393af6abd99932b of This commit itself is correct but the extend_recovery_timer() code originally had limit for recovery timeout as time_hard. Previously that wasn't problem, because recovery wasn't aborted by hard timeout and timer could be set several times during recovery even with hard timeout limit. Patch changes extend_recovery_timer() logic limiting it with (hard timeout / 3), so timer can expire 3 times during single recovery without reaching hard limit and recovery may finish softly with VBR. There is one thing to think about, from these 3 recovery steps the first one is more important and expected to lasts long because of clients reconnections, other two cases with request replay and lock replay usually happen quite fast, timer is needed here just for case when some of already connected clients are died. Therefore we might use (hard_timeout / 2) limit considering that half of recovery time can be spent for reconnection waiting, and other half for recovery completion. That would reduce the evictions of clients missing connection stage. |
| Comment by James A Simmons [ 20/Jan/15 ] |
|
Will this be needed along with |
| Comment by Mikhail Pershin [ 21/Jan/15 ] |
|
James, yes, I think so. |
| Comment by Gerrit Updater [ 24/Jan/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13520 |
| Comment by Mikhail Pershin [ 24/Jan/15 ] |
|
The second patch is to cover another issue which cause request timeout grow and recovery time grow as result. Usually server doesn't return request processing time to the client for recovery requests to don't pollute AT history with wrong data. But except normal reply there can be early reply and it passes server processing time to the client as well causing the same issue. |
| Comment by Mikhail Pershin [ 27/Jan/15 ] |
|
The first patch is abandoned, in fact the check for recovery_time_hard was correct, the problem is not there. The second patch is updated and should resolve issue. It prevent AT stats from being updated with data from recovering requests on both server and client. |
| Comment by Gerrit Updater [ 09/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13520/ |
| Comment by Peter Jones [ 09/Feb/15 ] |
|
Landed for 2.7 |
| Comment by Gerrit Updater [ 09/Feb/15 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13685 |