Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Lustre 2.7.0
-
None
-
3
-
9223372036854775807
Description
After upgrading Lustre from 2.5.1 to 2.7.61 on snx11117 the clients can not be mounted (mount client hangs) because of endless recovery:
[423225.578209] Lustre: snx11117-MDT0000: Denying connection for new client 3ec3f6c4-e172-39d7-383c-d4c19737f54c(at 10.9.100.9@o2ib3), waitin g for 2 known clients (2 recovered, 0 in progress, and 0 evicted) to recover in 21188498:05 [423225.601237] Lustre: Skipped 41 previous similar messages
It seems "LU-3540 lod: update recovery thread" broke recovery_time_hard functionality.
check_for_recovery_ready causes endless loop in target_recovery_overseer when flag tdtd_replay_ready is not set:
static int check_for_recovery_ready(struct lu_target *lut)
...
if (!obd->obd_abort_recovery && !obd->obd_recovery_expired) {
LASSERT(clnts <= obd->obd_max_recoverable_clients);
if (clnts + obd->obd_stale_clients <
obd->obd_max_recoverable_clients)
return 0;
}
if (lut->lut_tdtd != NULL) {
if (!lut->lut_tdtd->tdtd_replay_ready) {
/* Let's extend recovery timer, in case the recovery
* timer expired, and some clients got evicted */
extend_recovery_timer(obd, obd->obd_recovery_timeout,
true);
return 0;
} else {
dtrq_list_dump(lut->lut_tdtd, D_HA);
}
}
check_for_recovery_ready doesn't return 1 despite of the fact that all clients already connected and recovery has expired:
00010000:00080000:0.0:1450170133.405945:0:243397:0:(ldlm_lib.c:2081:check_for_recovery_ready()) connected 2 stale 0 max_recoverable_clients 2 abort 0 expired 1
Because of not set tdtd_replay_ready flag check_for_recovery_ready returns 0 and trying to extend recovery timer(without success):
00010000:00080000:0.0:1450170133.405947:0:243397:0:(ldlm_lib.c:1745:extend_recovery_timer()) snx11117-MDT0000: recovery timer will expire in 4294905278 seconds
Imo below strings brake previous logic of target_recovery_overseer and recovery_time_hard:
if (!obd->obd_abort_recovery && !obd->obd_recovery_expired) {
LASSERT(clnts <= obd->obd_max_recoverable_clients);
if (clnts + obd->obd_stale_clients <
obd->obd_max_recoverable_clients)
return 0;
}
See the difference with check_for_clients used before LU-3540:
static int check_for_clients(struct obd_device *obd)
{
unsigned int clnts = atomic_read(&obd->obd_connected_clients);
if (obd->obd_abort_recovery || obd->obd_recovery_expired)
return 1;
LASSERT(clnts <= obd->obd_max_recoverable_clients);
return (clnts + obd->obd_stale_clients ==
obd->obd_max_recoverable_clients);
}