Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1352

spurious recovery timer resets

    Details

    • Severity:
      3
    • Rank (Obsolete):
      4031

      Description

      A production MDS in recovery seemed to restart the recovery timer more than we would expect. Here is a paraphrased transcript of the console log:

      Apr 25 12:21:28 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 5:00
      Apr 25 12:21:28 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 5:00
      Apr 25 12:36:03 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 12:36:03 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 193 clients in recovery for 0:24
      (repeats once)                                                                 
      Apr 25 12:36:29 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 12:36:29 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 12:36:30 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 12:36:32 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 12:36:33 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00
      Apr 25 12:36:33 roc-mds1 kernel: Lustre: ls4-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 15:00
      (repeats about ten times over ten minutes)                                     
      Apr 25 12:51:33 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 12:51:34 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 12:51:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 12:51:35 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00
      Apr 25 12:55:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 11:15
      Apr 25 13:05:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 1:15
      Apr 25 13:06:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 13:06:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 13:06:36 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 13:06:37 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 13:06:39 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 13:06:39 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00
      Apr 25 13:15:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 6:19
      Apr 25 13:21:39 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 13:21:40 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
      Apr 25 13:21:41 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00
      

      This raises several questions.

      • Is it unexpected to have multiple "Starting recovery timer" messages, or is this just normal extension of recovery timeout due to new client connections?
      • Should it be possible to start the recovery timer twice in the same second, as at 12:21:28?
      • How many times should "recovery is timed out, evict stale exports" appear? (Looking at target_recovery_overseer(), it seems it should be at most twice.)
      • In check_and_start_recovery_timer(), is the cfs_timer_is_armed() check sufficient to avoid race conditions, or does it need to also check the obd_device recovery flags?

      Let's also think about how to improve the code comments, manual, and log messages to make the recovery process more transparent.

        Attachments

          Activity

            People

            • Assignee:
              hongchao.zhang Hongchao Zhang
              Reporter:
              nedbass Ned Bass
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: