[LU-4119] recovery time hard doesn't limit recovery duration Created: 18/Oct/13  Updated: 11/Mar/15  Resolved: 06/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.4
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Major
Reporter: Sergey Cheremencev Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
is duplicated by LU-5724 IR recovery doesn't behave properly w... Closed
Related
is related to LU-5986 conf-sanity test_83, test_84: failed ... Resolved
is related to LU-6084 Tests are failed due to 'recovery is ... Resolved
Severity: 3
Rank (Obsolete): 11111

 Description   

Firstly, I think there is a bug in extend_recovery_timer:

        if (to > obd->obd_recovery_time_hard)
                to = obd->obd_recovery_time_hard;
        if (obd->obd_recovery_timeout < to ||
            obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
                obd->obd_recovery_timeout = to;
                cfs_timer_arm(&obd->obd_recovery_timer,
                              cfs_time_shift(drt));
        }     

When "to"(recovery_timeout) will be limited by obd_recovery_time_hard, timer will be armed to (now+duration) whereas it must be armed to (recovery_start + to). I suppose following:

        if (obd->obd_recovery_timeout < to ||
            obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
                obd->obd_recovery_timeout = to;
                end = obd->obd_recovery_start + to;
                cfs_timer_arm(&obd->obd_recovery_timer, end);

But even if upper problem will be fixed, recovery will not be aborted when recovery_timeout >= time_hard.
Possible we should set obd_abort_recovery to 1 when recovery_time_hard is reached.

--- a/lustre/ldlm/ldlm_lib.c
+++ b/lustre/ldlm/ldlm_lib.c
@@ -1793,6 +1793,12 @@ static int target_recovery_overseer(struct obd_device *obd,
                                    int (*health_check)(struct obd_export *))
 {
 repeat:
+       if (cfs_time_current_sec() >=
+           (obd->obd_recovery_start + obd->obd_recovery_time_hard)) {
+               CWARN("recovery is aborted by hard timeout\n");
+               obd->obd_abort_recovery = 1;
+       }
+

Another problem is that server_cacl_timeout rewrites obd_recovery_time_hard, so we can't use proc interface to set recovery_time_hard.



 Comments   
Comment by Sergey Cheremencev [ 18/Oct/13 ]

Upper fix is wrong: we need to check that recovery has started:

--- a/lustre/ldlm/ldlm_lib.c
+++ b/lustre/ldlm/ldlm_lib.c
@@ -1793,6 +1793,11 @@ static int target_recovery_overseer(struct obd_device *obd,
                                    int (*health_check)(struct obd_export *))
 {
 repeat:
+       if ((obd->obd_recovery_start != 0) && (cfs_time_current_sec() >=
+             (obd->obd_recovery_start + obd->obd_recovery_time_hard))) {
+               CWARN("recovery is aborted by hard timeout\n");
+               obd->obd_abort_recovery = 1;
+       }
Comment by Sergey Cheremencev [ 31/Jan/14 ]

Patch with fix
http://review.whamcloud.com/#/c/9078/

Comment by Cliff White (Inactive) [ 23/Jun/14 ]

The patch is currently failing in our testing, and there are some review comments, is an updated patch possible?

Comment by Sergey Cheremencev [ 23/Jun/14 ]

I saw Mike Pershin set +1 and wrote comments. I replied him and also asked a question. I thought we also need Andreas Dilger review to continue process.

Sorry, but i can't find logs for failed test. When i see into this link https://maloo.whamcloud.com/test_sets/811df926-da54-11e3-a2f8-52540035b04c last test name is "test_53a".
Could you please provide link to failed test ?
Sure, i will update patch when will get failed test logs. But this week i am in PTO and could do it only next week.

Comment by Cliff White (Inactive) [ 11/Jul/14 ]

There may be some issues with getting old test logs due to the change in test systems. If you can submit an updated patch, we can retest and go from there.

Comment by Cliff White (Inactive) [ 19/Sep/14 ]

Updated patch http://review.whamcloud.com/#/c/9078/

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9078/
Subject: LU-4119 ldlm: abort recovery by time_hard
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: df89c74a320278acac7466a83393af6abd99932b

Comment by Peter Jones [ 06/Jan/15 ]

Landed for 2.7

Comment by Gerrit Updater [ 13/Jan/15 ]

James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13381
Subject: LU-4119 ldlm: abort recovery by time_hard
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: ade5834f5e1a6419cea47edcede01ca2f5ea79a0

Generated at Sat Feb 10 01:39:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.