[LU-3844] Double recovery period in 1.8.9 after OSS failure Created: 27/Aug/13 Updated: 28/Aug/13 Resolved: 28/Aug/13 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Blake Caldwell | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
kernel 2.6.18-348.3.1.el5, rhel5.9, distribution-provided ofed, gni behind o2iblnd routers |
||
| Severity: | 3 |
| Rank (Obsolete): | 9949 |
| Description |
|
We frequently observe that the recovery process repeats itself after reaching the timer expiration. Often times it reaches the first timer expiration because a client has died, so it is going to go through the whole recovery period again the second time. In this example, recovery took 60s rather than the better case of 30. Does this fall under the case implied by the wording 'Will be in recovery for at least 30:00'. During the failure several OSS had to be rebooted. In an earlier case, only the MSD failed, and recovery finished in less that 30min because all clients reconnected. I'm sure there are cases where recovery just takes 30min, but mostly now when it goes to 30min, it will also go to 60 min obd_timeout is set to 600 ldlm timeout is set to 200 |
| Comments |
| Comment by James Nunez (Inactive) [ 27/Aug/13 ] |
|
Hongchao, Would you please comment on this one? Thanks, |
| Comment by Hongchao Zhang [ 28/Aug/13 ] |
|
Hi Blake, the extra recovery period is caused by VBR (version based recovery). in target_recovery_check_and_stop, after the first recovery period (3*obd_timeout = 1800s = 30m) is expired, obd_device->obd_version_recov will be set and int target_recovery_check_and_stop(struct obd_device *obd) { int abort_recovery = 0; if (obd->obd_stopping || !obd->obd_recovering) return 1; spin_lock_bh(&obd->obd_processing_task_lock); abort_recovery = obd->obd_abort_recovery; obd->obd_abort_recovery = 0; spin_unlock_bh(&obd->obd_processing_task_lock); if (!abort_recovery) return 0; /** check if fs version-capable */ if (target_fs_version_capable(obd)) { class_handle_stale_exports(obd); } else { CWARN("Versions are not supported by ldiskfs, VBR is OFF\n"); class_disconnect_stale_exports(obd, exp_flags_from_obd(obd)); } /* VBR: no clients are remained to replay, stop recovery */ spin_lock_bh(&obd->obd_processing_task_lock); if (obd->obd_recovering && obd->obd_recoverable_clients == 0) { spin_unlock_bh(&obd->obd_processing_task_lock); target_stop_recovery(obd, 0); return 1; } /* always check versions now */ obd->obd_version_recov = 1; cfs_waitq_signal(&obd->obd_next_transno_waitq); spin_unlock_bh(&obd->obd_processing_task_lock); /* reset timer, recovery will proceed with versions now */ reset_recovery_timer(obd, OBD_RECOVERY_TIME_SOFT, 1); return 0; } |
| Comment by Blake Caldwell [ 28/Aug/13 ] |
|
Thanks much for the explanation! |
| Comment by James Nunez (Inactive) [ 28/Aug/13 ] |
|
Blake, Is there anything else we need to do under this ticket or should we close it? Thanks, |
| Comment by Blake Caldwell [ 28/Aug/13 ] |
|
This can be closed. |