[LU-1352] spurious recovery timer resets - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.2.0
Affects Version/s: Lustre 2.1.1
Labels:
- llnl
Environment:
https://github.com/chaos/lustre/commits/2.1.1-4chaos

Severity:
3
Rank (Obsolete):
4031

Description

A production MDS in recovery seemed to restart the recovery timer more than we would expect. Here is a paraphrased transcript of the console log:

Apr 25 12:21:28 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 5:00
Apr 25 12:21:28 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 5:00
Apr 25 12:36:03 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 12:36:03 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 193 clients in recovery for 0:24
(repeats once)                                                                 
Apr 25 12:36:29 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 12:36:29 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 12:36:30 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 12:36:32 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 12:36:33 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00
Apr 25 12:36:33 roc-mds1 kernel: Lustre: ls4-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 15:00
(repeats about ten times over ten minutes)                                     
Apr 25 12:51:33 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 12:51:34 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 12:51:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 12:51:35 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00
Apr 25 12:55:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 11:15
Apr 25 13:05:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 1:15
Apr 25 13:06:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 13:06:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 13:06:36 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 13:06:37 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 13:06:39 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 13:06:39 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00
Apr 25 13:15:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 6:19
Apr 25 13:21:39 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 13:21:40 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports
Apr 25 13:21:41 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00

This raises several questions.

Is it unexpected to have multiple "Starting recovery timer" messages, or is this just normal extension of recovery timeout due to new client connections?
Should it be possible to start the recovery timer twice in the same second, as at 12:21:28?
How many times should "recovery is timed out, evict stale exports" appear? (Looking at target_recovery_overseer(), it seems it should be at most twice.)
In check_and_start_recovery_timer(), is the cfs_timer_is_armed() check sufficient to avoid race conditions, or does it need to also check the obd_device recovery flags?

Let's also think about how to improve the code comments, manual, and log messages to make the recovery process more transparent.

Attachments

Activity

People

Assignee:: Hongchao Zhang

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 27/Apr/12 9:02 PM

Updated:: 03/Apr/16 3:07 AM

Resolved:: 03/Apr/16 3:07 AM