Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4119

recovery time hard doesn't limit recovery duration

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.6.0, Lustre 2.5.4
    • 3
    • 11111

    Description

      Firstly, I think there is a bug in extend_recovery_timer:

              if (to > obd->obd_recovery_time_hard)
                      to = obd->obd_recovery_time_hard;
              if (obd->obd_recovery_timeout < to ||
                  obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
                      obd->obd_recovery_timeout = to;
                      cfs_timer_arm(&obd->obd_recovery_timer,
                                    cfs_time_shift(drt));
              }     
      

      When "to"(recovery_timeout) will be limited by obd_recovery_time_hard, timer will be armed to (now+duration) whereas it must be armed to (recovery_start + to). I suppose following:

              if (obd->obd_recovery_timeout < to ||
                  obd->obd_recovery_timeout == obd->obd_recovery_time_hard) {
                      obd->obd_recovery_timeout = to;
                      end = obd->obd_recovery_start + to;
                      cfs_timer_arm(&obd->obd_recovery_timer, end);
      

      But even if upper problem will be fixed, recovery will not be aborted when recovery_timeout >= time_hard.
      Possible we should set obd_abort_recovery to 1 when recovery_time_hard is reached.

      --- a/lustre/ldlm/ldlm_lib.c
      +++ b/lustre/ldlm/ldlm_lib.c
      @@ -1793,6 +1793,12 @@ static int target_recovery_overseer(struct obd_device *obd,
                                          int (*health_check)(struct obd_export *))
       {
       repeat:
      +       if (cfs_time_current_sec() >=
      +           (obd->obd_recovery_start + obd->obd_recovery_time_hard)) {
      +               CWARN("recovery is aborted by hard timeout\n");
      +               obd->obd_abort_recovery = 1;
      +       }
      +
      

      Another problem is that server_cacl_timeout rewrites obd_recovery_time_hard, so we can't use proc interface to set recovery_time_hard.

      Attachments

        Issue Links

          Activity

            [LU-4119] recovery time hard doesn't limit recovery duration

            James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13381
            Subject: LU-4119 ldlm: abort recovery by time_hard
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: ade5834f5e1a6419cea47edcede01ca2f5ea79a0

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13381 Subject: LU-4119 ldlm: abort recovery by time_hard Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: ade5834f5e1a6419cea47edcede01ca2f5ea79a0
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9078/
            Subject: LU-4119 ldlm: abort recovery by time_hard
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: df89c74a320278acac7466a83393af6abd99932b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9078/ Subject: LU-4119 ldlm: abort recovery by time_hard Project: fs/lustre-release Branch: master Current Patch Set: Commit: df89c74a320278acac7466a83393af6abd99932b
            cliffw Cliff White (Inactive) added a comment - Updated patch http://review.whamcloud.com/#/c/9078/

            There may be some issues with getting old test logs due to the change in test systems. If you can submit an updated patch, we can retest and go from there.

            cliffw Cliff White (Inactive) added a comment - There may be some issues with getting old test logs due to the change in test systems. If you can submit an updated patch, we can retest and go from there.

            I saw Mike Pershin set +1 and wrote comments. I replied him and also asked a question. I thought we also need Andreas Dilger review to continue process.

            Sorry, but i can't find logs for failed test. When i see into this link https://maloo.whamcloud.com/test_sets/811df926-da54-11e3-a2f8-52540035b04c last test name is "test_53a".
            Could you please provide link to failed test ?
            Sure, i will update patch when will get failed test logs. But this week i am in PTO and could do it only next week.

            scherementsev Sergey Cheremencev added a comment - I saw Mike Pershin set +1 and wrote comments. I replied him and also asked a question. I thought we also need Andreas Dilger review to continue process. Sorry, but i can't find logs for failed test. When i see into this link https://maloo.whamcloud.com/test_sets/811df926-da54-11e3-a2f8-52540035b04c last test name is "test_53a". Could you please provide link to failed test ? Sure, i will update patch when will get failed test logs. But this week i am in PTO and could do it only next week.

            The patch is currently failing in our testing, and there are some review comments, is an updated patch possible?

            cliffw Cliff White (Inactive) added a comment - The patch is currently failing in our testing, and there are some review comments, is an updated patch possible?
            scherementsev Sergey Cheremencev added a comment - Patch with fix http://review.whamcloud.com/#/c/9078/

            Upper fix is wrong: we need to check that recovery has started:

            --- a/lustre/ldlm/ldlm_lib.c
            +++ b/lustre/ldlm/ldlm_lib.c
            @@ -1793,6 +1793,11 @@ static int target_recovery_overseer(struct obd_device *obd,
                                                int (*health_check)(struct obd_export *))
             {
             repeat:
            +       if ((obd->obd_recovery_start != 0) && (cfs_time_current_sec() >=
            +             (obd->obd_recovery_start + obd->obd_recovery_time_hard))) {
            +               CWARN("recovery is aborted by hard timeout\n");
            +               obd->obd_abort_recovery = 1;
            +       }
            scherementsev Sergey Cheremencev added a comment - Upper fix is wrong: we need to check that recovery has started: --- a/lustre/ldlm/ldlm_lib.c +++ b/lustre/ldlm/ldlm_lib.c @@ -1793,6 +1793,11 @@ static int target_recovery_overseer(struct obd_device *obd, int (*health_check)(struct obd_export *)) { repeat: + if ((obd->obd_recovery_start != 0) && (cfs_time_current_sec() >= + (obd->obd_recovery_start + obd->obd_recovery_time_hard))) { + CWARN("recovery is aborted by hard timeout\n"); + obd->obd_abort_recovery = 1; + }

            People

              bogl Bob Glossman (Inactive)
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: