[LU-14318] Add the option to limit the overall recovery time Created: 11/Jan/21 Updated: 14/Sep/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Hongchao Zhang | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Currently, the recovery time could be extended to several hours if the recovery |
| Comments |
| Comment by Hongchao Zhang [ 14/Jan/21 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41171 |
| Comment by Hongchao Zhang [ 18/Jan/21 ] |
|
Currently, if some other MDT can't be connected during the MDT recovery, the recovery process can be extended to static int check_for_recovery_ready(struct lu_target *lut)
{
struct obd_device *obd = lut->lut_obd;
unsigned int clnts = atomic_read(&obd->obd_connected_clients);
CDEBUG(D_HA,
"connected %d stale %d max_recoverable_clients %d abort %d expired %d\n",
clnts, obd->obd_stale_clients,
atomic_read(&obd->obd_max_recoverable_clients),
obd->obd_abort_recovery, obd->obd_recovery_expired);
if (!obd->obd_abort_recovery && !obd->obd_recovery_expired) {
LASSERT(clnts <=
atomic_read(&obd->obd_max_recoverable_clients));
if (clnts + obd->obd_stale_clients <
atomic_read(&obd->obd_max_recoverable_clients))
return 0;
}
if (!obd->obd_abort_recov_mdt && lut->lut_tdtd != NULL) {
if (!lut->lut_tdtd->tdtd_replay_ready &&
!obd->obd_abort_recovery && !obd->obd_stopping) {
/*
* Let's extend recovery timer, in case the recovery
* timer expired, and some clients got evicted
*/
extend_recovery_timer(obd, obd->obd_recovery_timeout, <--- the recovery will be extended even if the timer expired
true);
CDEBUG(D_HA,
"%s update recovery is not ready, extend recovery %d\n",
obd->obd_name, obd->obd_recovery_timeout);
return 0;
}
}
return 1;
}
|
| Comment by Andreas Dilger [ 29/Jan/21 ] |
|
Hongchao, is this patch only needed for the case when the remote MDT is not available at all, or is there some other problem like Rather than timing out recovery for a remote MDT completely, it would probably be better to keep the recovery for that MDT pending until the MDT is available again, and then do the remote recovery when the MDTs reconnect. That might only be a small (or no) difference in the normal case when all of the MDTs are available at mount, but I think this may give a very important improvement when some MDTs are unavailable. The big improvement would be if MDT0000 and other MDTs are available at restart time, it would complete recovery with all those MDTs quickly, and not block access to files/directories that are on available MDTs. It would allow most client access to work, and only remote/striped directories would be blocked and/or time out (allow CTRL-C for client processes). This would be better for users, if they are mostly using remote directories for subtrees of the filesystem, since only the subtrees on the missing MDTs would be inaccessible. Eventually, having mirrored entries for ROOT/ on several/all MDTs could allow the filesystem to be accessible even if MDT0000 is unavailable, but that is definitely a separate project. |
| Comment by Hongchao Zhang [ 01/Feb/21 ] |
|
HI, Okay, I will create another patch to allow the recovery to continue if some of MDTs other than MDT0000 is unavailable during recovery. |
| Comment by Gerrit Updater [ 05/Feb/21 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41424 |