[LU-1352] spurious recovery timer resets Created: 27/Apr/12 Updated: 03/Apr/16 Resolved: 03/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.2.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Ned Bass | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: | |||
| Severity: | 3 |
| Rank (Obsolete): | 4031 |
| Description |
|
A production MDS in recovery seemed to restart the recovery timer more than we would expect. Here is a paraphrased transcript of the console log: Apr 25 12:21:28 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 5:00 Apr 25 12:21:28 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 5:00 Apr 25 12:36:03 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 12:36:03 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 193 clients in recovery for 0:24 (repeats once) Apr 25 12:36:29 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 12:36:29 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 12:36:30 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 12:36:32 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 12:36:33 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00 Apr 25 12:36:33 roc-mds1 kernel: Lustre: ls4-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 15:00 (repeats about ten times over ten minutes) Apr 25 12:51:33 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 12:51:34 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 12:51:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 12:51:35 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00 Apr 25 12:55:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 11:15 Apr 25 13:05:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 1:15 Apr 25 13:06:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 13:06:35 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 13:06:36 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 13:06:37 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 13:06:39 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 13:06:39 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00 Apr 25 13:15:20 roc-mds1 kernel: Lustre: ls5-MDT0000: Denying connection for new client ..., waiting for 100 clients in recovery for 6:19 Apr 25 13:21:39 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 13:21:40 roc-mds1 kernel: Lustre: ls5-MDT0000: recovery is timed out, evict stale exports Apr 25 13:21:41 roc-mds1 kernel: Lustre: ls5-MDT0000: Starting recovery timer for 15:00 This raises several questions.
Let's also think about how to improve the code comments, manual, and log messages to make the recovery process more transparent. |
| Comments |
| Comment by Peter Jones [ 29/Apr/12 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 02/May/12 ] |
|
in 2.1.1-4chaos, there could be a bug in reset_recovery_timer, if (extend && (duration > left) when the recovery timer expired, this function will be called by target_recovery_overseer(in our master branch, this has been -Is it unexpected to have multiple "Starting recovery timer" messages, or is this just normal extension of recovery timeout due to new client connections? it's not a extension of recovery timer, and there is a chance for target_handle_connect/target_start_and_reset_recovery_timer -Should it be possible to start the recovery timer twice in the same second, as at 12:21:28? there is a race in check_and_start_recovery_timer (in our master branch, it has been fixed) LCONSOLE_INFO("%s: Starting recovery timer for %d:%.02d\n", -How many times should "recovery is timed out, evict stale exports" appear? (Looking at target_recovery_overseer(), it seems it should be at most twice.) it will continue to appear if there are still healthy recovering clients. -In check_and_start_recovery_timer(), is the cfs_timer_is_armed() check sufficient to avoid race conditions, or does it need to also check the obd_device recovery flags? yes, it isn't sufficient |
| Comment by Hongchao Zhang [ 08/May/12 ] |
|
status update: the related tickets are: the pending patch will try to backport these patches to b2_1 |
| Comment by Hongchao Zhang [ 10/May/12 ] |
|
status update: |
| Comment by Hongchao Zhang [ 11/May/12 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,2719 |
| Comment by Christopher Morrone [ 11/May/12 ] |
|
I think that when porting patches to another branch the patches need to remain separate. The commit messages should be mostly the same as well (especially the bug reference number and first line summary). It will be very difficult to make sense of what is in a branch using "git log --oneline" if we squash multiple commits into one on another branch. |
| Comment by Hongchao Zhang [ 15/May/12 ] |
|
this patch is part of the patches landed on master, and it mainly bases on the implementation of IR(Imperative Recovery), |
| Comment by Ned Bass [ 13/Jun/12 ] |
|
Hi Hongchao, That approach sounds reasonable to me. Thanks |
| Comment by Hongchao Zhang [ 03/Jul/12 ] |
|
status update: |
| Comment by Christopher Morrone [ 10/Jul/12 ] |
|
If you have a second patch that is needed, please submit it to gerrit now as well. We can give it some testing now on our end. |
| Comment by Bruno Faccini (Inactive) [ 30/Oct/12 ] |
|
We also experienced the same issue at TGCC site, and what is not clear for me is if "http://review.whamcloud.com/#change,2719" patch is b2_1 only and just a backport of the new recovery timer algorithms as implemented in b2_3/master ??... |
| Comment by Andreas Dilger [ 03/Apr/16 ] |
|
Closing old 2.1 bug. Problem was fixed in 2.2 and fix was backported to 2.1. |