Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.6.0, Lustre 2.5.4
-
3
-
11111
Description
Firstly, I think there is a bug in extend_recovery_timer:
if (to > obd->obd_recovery_time_hard)
to = obd->obd_recovery_time_hard;
if (obd->obd_recovery_timeout < to ||
obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
obd->obd_recovery_timeout = to;
cfs_timer_arm(&obd->obd_recovery_timer,
cfs_time_shift(drt));
}
When "to"(recovery_timeout) will be limited by obd_recovery_time_hard, timer will be armed to (now+duration) whereas it must be armed to (recovery_start + to). I suppose following:
if (obd->obd_recovery_timeout < to ||
obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
obd->obd_recovery_timeout = to;
end = obd->obd_recovery_start + to;
cfs_timer_arm(&obd->obd_recovery_timer, end);
But even if upper problem will be fixed, recovery will not be aborted when recovery_timeout >= time_hard.
Possible we should set obd_abort_recovery to 1 when recovery_time_hard is reached.
--- a/lustre/ldlm/ldlm_lib.c
+++ b/lustre/ldlm/ldlm_lib.c
@@ -1793,6 +1793,12 @@ static int target_recovery_overseer(struct obd_device *obd,
int (*health_check)(struct obd_export *))
{
repeat:
+ if (cfs_time_current_sec() >=
+ (obd->obd_recovery_start + obd->obd_recovery_time_hard)) {
+ CWARN("recovery is aborted by hard timeout\n");
+ obd->obd_abort_recovery = 1;
+ }
+
Another problem is that server_cacl_timeout rewrites obd_recovery_time_hard, so we can't use proc interface to set recovery_time_hard.