Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.6.0, Lustre 2.5.4
-
3
-
11111
Description
Firstly, I think there is a bug in extend_recovery_timer:
if (to > obd->obd_recovery_time_hard) to = obd->obd_recovery_time_hard; if (obd->obd_recovery_timeout < to || obd->obd_recovery_timeout == obd->obd_recovery_time_hard) { obd->obd_recovery_timeout = to; cfs_timer_arm(&obd->obd_recovery_timer, cfs_time_shift(drt)); }
When "to"(recovery_timeout) will be limited by obd_recovery_time_hard, timer will be armed to (now+duration) whereas it must be armed to (recovery_start + to). I suppose following:
if (obd->obd_recovery_timeout < to || obd->obd_recovery_timeout == obd->obd_recovery_time_hard) { obd->obd_recovery_timeout = to; end = obd->obd_recovery_start + to; cfs_timer_arm(&obd->obd_recovery_timer, end);
But even if upper problem will be fixed, recovery will not be aborted when recovery_timeout >= time_hard.
Possible we should set obd_abort_recovery to 1 when recovery_time_hard is reached.
--- a/lustre/ldlm/ldlm_lib.c +++ b/lustre/ldlm/ldlm_lib.c @@ -1793,6 +1793,12 @@ static int target_recovery_overseer(struct obd_device *obd, int (*health_check)(struct obd_export *)) { repeat: + if (cfs_time_current_sec() >= + (obd->obd_recovery_start + obd->obd_recovery_time_hard)) { + CWARN("recovery is aborted by hard timeout\n"); + obd->obd_abort_recovery = 1; + } +
Another problem is that server_cacl_timeout rewrites obd_recovery_time_hard, so we can't use proc interface to set recovery_time_hard.