Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7588

endless recovery on lustre 2.7

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      After upgrading Lustre from 2.5.1 to 2.7.61 on snx11117 the clients can not be mounted (mount client hangs) because of endless recovery:

      [423225.578209] Lustre: snx11117-MDT0000: Denying connection for new client 3ec3f6c4-e172-39d7-383c-d4c19737f54c(at 10.9.100.9@o2ib3), waitin
      g for 2 known clients (2 recovered, 0 in progress, and 0 evicted) to recover in 21188498:05                                                  
      [423225.601237] Lustre: Skipped 41 previous similar messages   

      It seems "LU-3540 lod: update recovery thread" broke recovery_time_hard functionality.
      check_for_recovery_ready causes endless loop in target_recovery_overseer when flag tdtd_replay_ready is not set:

      static int check_for_recovery_ready(struct lu_target *lut)
      ...
             if (!obd->obd_abort_recovery && !obd->obd_recovery_expired) {
                      LASSERT(clnts <= obd->obd_max_recoverable_clients);
                      if (clnts + obd->obd_stale_clients <
                          obd->obd_max_recoverable_clients)
                              return 0;
              }    
      
              if (lut->lut_tdtd != NULL) {
                      if (!lut->lut_tdtd->tdtd_replay_ready) {
                              /* Let's extend recovery timer, in case the recovery
                               * timer expired, and some clients got evicted */
                              extend_recovery_timer(obd, obd->obd_recovery_timeout,
                                                    true);
                              return 0;
                      } else {
                              dtrq_list_dump(lut->lut_tdtd, D_HA);
                      }    
              }    
      

      check_for_recovery_ready doesn't return 1 despite of the fact that all clients already connected and recovery has expired:

      00010000:00080000:0.0:1450170133.405945:0:243397:0:(ldlm_lib.c:2081:check_for_recovery_ready()) connected 2 stale 0 max_recoverable_clients 2 abort 0 expired 1
      

      Because of not set tdtd_replay_ready flag check_for_recovery_ready returns 0 and trying to extend recovery timer(without success):

      00010000:00080000:0.0:1450170133.405947:0:243397:0:(ldlm_lib.c:1745:extend_recovery_timer()) snx11117-MDT0000: recovery timer will expire in 4294905278 seconds

      Imo below strings brake previous logic of target_recovery_overseer and recovery_time_hard:

              if (!obd->obd_abort_recovery && !obd->obd_recovery_expired) {
                      LASSERT(clnts <= obd->obd_max_recoverable_clients);
                      if (clnts + obd->obd_stale_clients <
                          obd->obd_max_recoverable_clients)
                              return 0;
              }

      See the difference with check_for_clients used before LU-3540:

      static int check_for_clients(struct obd_device *obd)
      {
             unsigned int clnts = atomic_read(&obd->obd_connected_clients);
      
             if (obd->obd_abort_recovery || obd->obd_recovery_expired)
                     return 1;
             LASSERT(clnts <= obd->obd_max_recoverable_clients);
             return (clnts + obd->obd_stale_clients ==
                     obd->obd_max_recoverable_clients);
      }
      

      Attachments

        Activity

          People

            di.wang Di Wang
            scherementsev Sergey Cheremencev
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: