Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4119

recovery time hard doesn't limit recovery duration

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.6.0, Lustre 2.5.4
    • 3
    • 11111

    Description

      Firstly, I think there is a bug in extend_recovery_timer:

              if (to > obd->obd_recovery_time_hard)
                      to = obd->obd_recovery_time_hard;
              if (obd->obd_recovery_timeout < to ||
                  obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
                      obd->obd_recovery_timeout = to;
                      cfs_timer_arm(&obd->obd_recovery_timer,
                                    cfs_time_shift(drt));
              }     
      

      When "to"(recovery_timeout) will be limited by obd_recovery_time_hard, timer will be armed to (now+duration) whereas it must be armed to (recovery_start + to). I suppose following:

              if (obd->obd_recovery_timeout < to ||
                  obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
                      obd->obd_recovery_timeout = to;
                      end = obd->obd_recovery_start + to;
                      cfs_timer_arm(&obd->obd_recovery_timer, end);
      

      But even if upper problem will be fixed, recovery will not be aborted when recovery_timeout >= time_hard.
      Possible we should set obd_abort_recovery to 1 when recovery_time_hard is reached.

      --- a/lustre/ldlm/ldlm_lib.c
      +++ b/lustre/ldlm/ldlm_lib.c
      @@ -1793,6 +1793,12 @@ static int target_recovery_overseer(struct obd_device *obd,
                                          int (*health_check)(struct obd_export *))
       {
       repeat:
      +       if (cfs_time_current_sec() >=
      +           (obd->obd_recovery_start + obd->obd_recovery_time_hard)) {
      +               CWARN("recovery is aborted by hard timeout\n");
      +               obd->obd_abort_recovery = 1;
      +       }
      +
      

      Another problem is that server_cacl_timeout rewrites obd_recovery_time_hard, so we can't use proc interface to set recovery_time_hard.

      Attachments

        Issue Links

          Activity

            [LU-4119] recovery time hard doesn't limit recovery duration
            yujian Jian Yu made changes -
            Fix Version/s New: Lustre 2.5.4 [ 11190 ]
            Labels Original: 22pl mq414 patch New: patch
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-8 [ JFC-8 ]
            yujian Jian Yu made changes -
            Link New: This issue is duplicated by LU-5724 [ LU-5724 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LU-6084 [ LU-6084 ]
            yujian Jian Yu made changes -
            Affects Version/s New: Lustre 2.5.4 [ 11190 ]
            Labels Original: patch New: 22pl mq414 patch

            James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13381
            Subject: LU-4119 ldlm: abort recovery by time_hard
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: ade5834f5e1a6419cea47edcede01ca2f5ea79a0

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13381 Subject: LU-4119 ldlm: abort recovery by time_hard Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: ade5834f5e1a6419cea47edcede01ca2f5ea79a0
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.7.0 [ 10631 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-5986 [ LU-5986 ]

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9078/
            Subject: LU-4119 ldlm: abort recovery by time_hard
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: df89c74a320278acac7466a83393af6abd99932b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9078/ Subject: LU-4119 ldlm: abort recovery by time_hard Project: fs/lustre-release Branch: master Current Patch Set: Commit: df89c74a320278acac7466a83393af6abd99932b

            People

              bogl Bob Glossman (Inactive)
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: