Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4119

recovery time hard doesn't limit recovery duration

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.6.0, Lustre 2.5.4
    • 3
    • 11111

    Description

      Firstly, I think there is a bug in extend_recovery_timer:

              if (to > obd->obd_recovery_time_hard)
                      to = obd->obd_recovery_time_hard;
              if (obd->obd_recovery_timeout < to ||
                  obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
                      obd->obd_recovery_timeout = to;
                      cfs_timer_arm(&obd->obd_recovery_timer,
                                    cfs_time_shift(drt));
              }     
      

      When "to"(recovery_timeout) will be limited by obd_recovery_time_hard, timer will be armed to (now+duration) whereas it must be armed to (recovery_start + to). I suppose following:

              if (obd->obd_recovery_timeout < to ||
                  obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
                      obd->obd_recovery_timeout = to;
                      end = obd->obd_recovery_start + to;
                      cfs_timer_arm(&obd->obd_recovery_timer, end);
      

      But even if upper problem will be fixed, recovery will not be aborted when recovery_timeout >= time_hard.
      Possible we should set obd_abort_recovery to 1 when recovery_time_hard is reached.

      --- a/lustre/ldlm/ldlm_lib.c
      +++ b/lustre/ldlm/ldlm_lib.c
      @@ -1793,6 +1793,12 @@ static int target_recovery_overseer(struct obd_device *obd,
                                          int (*health_check)(struct obd_export *))
       {
       repeat:
      +       if (cfs_time_current_sec() >=
      +           (obd->obd_recovery_start + obd->obd_recovery_time_hard)) {
      +               CWARN("recovery is aborted by hard timeout\n");
      +               obd->obd_abort_recovery = 1;
      +       }
      +
      

      Another problem is that server_cacl_timeout rewrites obd_recovery_time_hard, so we can't use proc interface to set recovery_time_hard.

      Attachments

        Issue Links

          Activity

            People

              bogl Bob Glossman (Inactive)
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: