[LU-6084] Tests are failed due to 'recovery is aborted by hard timeout' Created: 06/Jan/15  Updated: 11/Mar/15  Resolved: 09/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: HB
Environment:

lustre-master build #2770


Issue Links:
Related
is related to LU-4119 recovery time hard doesn't limit reco... Resolved
is related to LU-6057 replay-dual test_9 failed - post-fail... Resolved
is related to LU-6238 replay-dual test 10 fails with "FAIL:... Resolved
Severity: 3
Rank (Obsolete): 16931

 Description   

Many recovery tests start to fail because unexpected recovery abort due to hard timeout.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/722432b2-80fa-11e4-9c9a-5254006e85c2.

The sub-test test_4k failed with the following error:

onyx-35vm1.onyx.hpdd.intel.com evicted

MDS dmesg

Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0000_UUID (at 10.2.4.141@tcp), waiting for all 6 known clients (0 recovered, 5 in progress, and 0 evicted) to recover in 0:25
Lustre: Skipped 90 previous similar messages
INFO: task tgt_recov:2119 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tgt_recov     D 0000000000000000     0  2119      2 0x00000080
 ffff88006fb2fda0 0000000000000046 0000000000000000 ffff880002316880
 ffff88006fb2fd10 ffffffff81030b59 ffff88006fb2fd20 ffffffff810554f8
 ffff88006faad058 ffff88006fb2ffd8 000000000000fbc8 ffff88006faad058
Call Trace:
 [<ffffffff81030b59>] ? native_smp_send_reschedule+0x49/0x60
 [<ffffffff810554f8>] ? resched_task+0x68/0x80
 [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa080d9c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
 [<ffffffffa080ef2d>] target_recovery_overseer+0xad/0x2d0 [ptlrpc]
 [<ffffffffa080d610>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
 [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0815850>] ? target_recovery_thread+0x0/0x1a20 [ptlrpc]
 [<ffffffffa0815f34>] target_recovery_thread+0x6e4/0x1a20 [ptlrpc]
 [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
 [<ffffffffa0815850>] ? target_recovery_thread+0x0/0x1a20 [ptlrpc]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
Lustre: lustre-MDT0000: disconnecting 1 stale clients
Lustre: 2119:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
Lustre: 2119:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Lustre: 2119:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted:  req@ffff880079bf6980 x1487142925659804/t0(38654705688) o36->c0baea22-119d-b8af-1550-c0592a66b0c4@10.2.4.138@tcp:277/0 lens 520/0 e 0 to 0 dl 1418252677 ref 1 fl Complete:/4/ffffffff rc 0/-1
Lustre: lustre-MDT0000: Recovery over after 3:00, of 6 clients 0 recovered and 6 were evicted.
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-vbr test_4k: @@@@@@ FAIL: onyx-35vm1.onyx.hpdd.intel.com evicted 


 Comments   
Comment by Oleg Drokin [ 06/Jan/15 ]

So the backtrace has nothing to do with this bug. It's just a known harmless message due to recovery taking kind of long.

Comment by Jodi Levi (Inactive) [ 13/Jan/15 ]

Mike,
Could you please comment on this one?
Thank you!

Comment by Mikhail Pershin [ 14/Jan/15 ]

15:08:55:Lustre: 2135:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
15:08:55:Lustre: 2135:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery

That mean test doesn't work as expected. Test (and many others in replay-var) removes one client during recovery, simulating missed client. It expects recovery to timed out and continue with VBR and existing clients, checking the result - was there eviction or not. But we see here that recovery was just aborted by hard timeout, so all clients are evicted no matter could they proceed with recovery or not.

Therefore replay-vbr tests are broken in that sense, we must set hard timeout to bigger value to let recovery happen as expected. Also I suspect that we have the same situation as with replay-dual tests and patch for LU-5079 which may cause recovery timeout extension if recovery happens one after another several times - like we have in replay-vbr.sh

I know Yu Jian did related changes to replay-dual, same should be done for replay-vbr along with disabling or changing hard_timeout to bigger value

Comment by Gerrit Updater [ 18/Jan/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13447
Subject: LU-6084 recovery: use soft timeout to limit recovery timer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8447952e0056294e24fa1da98c6dbb0d9e149324

Comment by Mikhail Pershin [ 19/Jan/15 ]

The reason is commit df89c74a320278acac7466a83393af6abd99932b of LU-4119 ldlm: abort recovery by time_hard

This commit itself is correct but the extend_recovery_timer() code originally had limit for recovery timeout as time_hard. Previously that wasn't problem, because recovery wasn't aborted by hard timeout and timer could be set several times during recovery even with hard timeout limit.
After that patch the whole recovery is limited by hard timeout and aborted. That means the single timer extended to hard timeout can consume whole recovery time just for waiting for missed clients after what recovery will be just aborted hardly. Meanwhile that timer is set at least 3 times during recovery while:
1. waiting for clients to connect
2. waiting for clients to replay requests
3. waiting for lock replays

Patch changes extend_recovery_timer() logic limiting it with (hard timeout / 3), so timer can expire 3 times during single recovery without reaching hard limit and recovery may finish softly with VBR.

There is one thing to think about, from these 3 recovery steps the first one is more important and expected to lasts long because of clients reconnections, other two cases with request replay and lock replay usually happen quite fast, timer is needed here just for case when some of already connected clients are died. Therefore we might use (hard_timeout / 2) limit considering that half of recovery time can be spent for reconnection waiting, and other half for recovery completion. That would reduce the evictions of clients missing connection stage.

Comment by James A Simmons [ 20/Jan/15 ]

Will this be needed along with LU-4119 for b2_5?

Comment by Mikhail Pershin [ 21/Jan/15 ]

James, yes, I think so.

Comment by Gerrit Updater [ 24/Jan/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13520
Subject: LU-6084 ptlrpc: don't remember server stat from early reply
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d69461cb7aca7400f19bae8a244f473aaa3d3760

Comment by Mikhail Pershin [ 24/Jan/15 ]

The second patch is to cover another issue which cause request timeout grow and recovery time grow as result. Usually server doesn't return request processing time to the client for recovery requests to don't pollute AT history with wrong data. But except normal reply there can be early reply and it passes server processing time to the client as well causing the same issue.

Comment by Mikhail Pershin [ 27/Jan/15 ]

The first patch is abandoned, in fact the check for recovery_time_hard was correct, the problem is not there. The second patch is updated and should resolve issue. It prevent AT stats from being updated with data from recovering requests on both server and client.

Comment by Gerrit Updater [ 09/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13520/
Subject: LU-6084 ptlrpc: prevent request timeout grow due to recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 84f813bf639a7d078e19a3cf41f7c06a3824caa9

Comment by Peter Jones [ 09/Feb/15 ]

Landed for 2.7

Comment by Gerrit Updater [ 09/Feb/15 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13685
Subject: LU-6084 ptlrpc: prevent request timeout grow due to recovery
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 2b1cb3b7dfa56fe1a67508b1113ee64453b5bab3

Generated at Sat Feb 10 01:57:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.