Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1166

recovery never finished

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.2
    • Lustre 2.1.0
    • None
    • 2.1.0 + with minimal back porting from 2.2
    • 3
    • 4669

    Description

      while testing we hit a situation when recovery never finished and recovery timer exceed a hard recovery timer.

      00010000:00080000:20.0:1330709620.108824:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
      00010000:00000400:20.0:1330709690.108951:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:20.0:1330709690.120847:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
      00010000:00000400:1.0:1330709760.120868:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:1.0:1330709760.132776:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
      00010000:00000400:1.0:1330709830.131858:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:1.0:1330709830.143745:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
      00010000:00000400:1.0:1330709900.142871:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:1.0:1330709900.154725:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 40 seconds
      00010000:00000400:1.0:1330709940.153865:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:1.0:1330709940.165727:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
      00010000:00000400:13.0:1330709940.165827:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:13.0:1330709940.177697:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
      00010000:00000400:1.0:1330709940.178088:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:1.0:1330709940.189941:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
      00010000:00000400:13.0:1330709940.190014:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:13.0:1330709940.201864:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
      00010000:00000400:1.0:1330709940.202082:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
      00010000:00080000:1.0:1330709940.213933:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
      00010000:00000400:1.0:1330709940.214821:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports

      ...

      after analyzing a logs that hand looks addressed to waiting in target_recovery_overseer function with check_for_clients() argument.
      that hung looks a result of using a
             if (obd->obd_no_conn == 0 &&
                 obd->obd_connected_clients + obd->obd_stale_clients ==
                 obd->obd_max_recoverable_clients)

      in case of MDT
      obd_no_conn set by post recovery if at least one ost connected and config llog processed.
      but mdt_postrecov can't called because recovery isn't finished.

      second issue in that area - reset_recovery_timer function.
      if we have a race and reset_recovery_timer function called in same time as recovery should be finished, but before timer a hit, we set a '0' (and negative number at next turn) as next timer time.

      00010000:00080000:1.0:1330709942.007696:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 4294967294 seconds
      00010000:00080000:1.1:1330709942.007794:0:9:0:(ldlm_lib.c:1887:target_recovery_expired()) snxs4-MDT0000: recovery timed out; 36 clients are still in recovery after 902s (49 clients connected)
      00010000:00000400:13.0:1330709942.007802:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: