[LU-4119] recovery time hard doesn't limit recovery duration Created: 18/Oct/13 Updated: 11/Mar/15 Resolved: 06/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.4 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sergey Cheremencev | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 11111 | ||||||||||||||||||||
| Description |
|
Firstly, I think there is a bug in extend_recovery_timer: if (to > obd->obd_recovery_time_hard)
to = obd->obd_recovery_time_hard;
if (obd->obd_recovery_timeout < to ||
obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
obd->obd_recovery_timeout = to;
cfs_timer_arm(&obd->obd_recovery_timer,
cfs_time_shift(drt));
}
When "to"(recovery_timeout) will be limited by obd_recovery_time_hard, timer will be armed to (now+duration) whereas it must be armed to (recovery_start + to). I suppose following: if (obd->obd_recovery_timeout < to ||
obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
obd->obd_recovery_timeout = to;
end = obd->obd_recovery_start + to;
cfs_timer_arm(&obd->obd_recovery_timer, end);
But even if upper problem will be fixed, recovery will not be aborted when recovery_timeout >= time_hard. --- a/lustre/ldlm/ldlm_lib.c
+++ b/lustre/ldlm/ldlm_lib.c
@@ -1793,6 +1793,12 @@ static int target_recovery_overseer(struct obd_device *obd,
int (*health_check)(struct obd_export *))
{
repeat:
+ if (cfs_time_current_sec() >=
+ (obd->obd_recovery_start + obd->obd_recovery_time_hard)) {
+ CWARN("recovery is aborted by hard timeout\n");
+ obd->obd_abort_recovery = 1;
+ }
+
Another problem is that server_cacl_timeout rewrites obd_recovery_time_hard, so we can't use proc interface to set recovery_time_hard. |
| Comments |
| Comment by Sergey Cheremencev [ 18/Oct/13 ] |
|
Upper fix is wrong: we need to check that recovery has started: --- a/lustre/ldlm/ldlm_lib.c
+++ b/lustre/ldlm/ldlm_lib.c
@@ -1793,6 +1793,11 @@ static int target_recovery_overseer(struct obd_device *obd,
int (*health_check)(struct obd_export *))
{
repeat:
+ if ((obd->obd_recovery_start != 0) && (cfs_time_current_sec() >=
+ (obd->obd_recovery_start + obd->obd_recovery_time_hard))) {
+ CWARN("recovery is aborted by hard timeout\n");
+ obd->obd_abort_recovery = 1;
+ }
|
| Comment by Sergey Cheremencev [ 31/Jan/14 ] |
|
Patch with fix |
| Comment by Cliff White (Inactive) [ 23/Jun/14 ] |
|
The patch is currently failing in our testing, and there are some review comments, is an updated patch possible? |
| Comment by Sergey Cheremencev [ 23/Jun/14 ] |
|
I saw Mike Pershin set +1 and wrote comments. I replied him and also asked a question. I thought we also need Andreas Dilger review to continue process. Sorry, but i can't find logs for failed test. When i see into this link https://maloo.whamcloud.com/test_sets/811df926-da54-11e3-a2f8-52540035b04c last test name is "test_53a". |
| Comment by Cliff White (Inactive) [ 11/Jul/14 ] |
|
There may be some issues with getting old test logs due to the change in test systems. If you can submit an updated patch, we can retest and go from there. |
| Comment by Cliff White (Inactive) [ 19/Sep/14 ] |
|
Updated patch http://review.whamcloud.com/#/c/9078/ |
| Comment by Gerrit Updater [ 04/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9078/ |
| Comment by Peter Jones [ 06/Jan/15 ] |
|
Landed for 2.7 |
| Comment by Gerrit Updater [ 13/Jan/15 ] |
|
James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13381 |