[LU-8407] Recovery timer hangs at zero on DNE MDTs Created: 15/Jul/16 Updated: 03/Mar/17 Resolved: 08/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
The recovery timer is poorly behaved, and pretty confusing to Lustre admins. We have long had the odd behavior that the recovery timer counts down to zero and then starts all over again. I think that behavior was in support of older clients that didn't support new recovery semantics. Can we kill that off finally? Or maybe allow users to configure a mode where older clients aren't permitted, allowing a single reasonable countdown? With DNE MDTs, recovery is even more screwy. The timer counts to zero twice (at least twice...), and then it sits there forever if any single other MDT is not up. While somewhere in the console logs it says something wishy-washy about maybe this is DNE related, we really need Lustre to do better. Lustre should clearly state somewhere that things are hung waiting on another MDT to start up. Other newer developers have already been confused about recovery on our testbed. If they have been confused, then it is pretty certain that this is going to cause trouble for our admins on production systems. |
| Comments |
| Comment by Oleg Drokin [ 18/Jul/16 ] |
|
So we have two problems on our hand, I guess. The "recovery timer restarts" is totally unexpected - the "old clients" (IR incapable?) would just be accounted for and so the system would start with a bigger timeout from the get go. Can you please file a separate ticket about this with some samples of what goes on? As for DNE in particular - currently we cannot guarantee FS consistency when one of the MDTs is down, so we will wait for them indefinitely. similarly MDTs are never evicted no matter what for the same reason. So I guesss once the other ticket is filed, we can concentrate here about more clearly indicating that recovery is waiting until all MDTs are rejoined? |
| Comment by Christopher Morrone [ 18/Jul/16 ] |
|
Yes, when the test cluster is available, I'll get more information on the restarts and open a new ticket for the timer restarts. We can focus on the DNE MDTs in this ticket. |
| Comment by Peter Jones [ 18/Jul/16 ] |
|
Fan Yong Could you please advise? Thanks Peter |
| Comment by Di Wang [ 03/Aug/16 ] |
|
Particularly, in DNE recovery, before one MDT starts recovering, it first tries to get recovery update logs from all other MDTs. |
| Comment by nasf (Inactive) [ 04/Aug/16 ] |
|
Currently, the DNE recovery depends on the update logs on the MDTs. If some MDT does not start, then the recovery logic cannot get update logs from such MDT as to cannot go ahead. Different from client-side recovery failure, the cross-MDT recovery failure may cause the namespace inconsistency. Because we does not want to export the inconsistent namespace to client, then we make the recovery to wait there until related update logs available. As for the suggestion of using LFSCK to handle the recovery trouble, that will be the last choice under the worst case. Currently, the namespace LFSCK does not understand the update logs, it just scans the namespace and fix the inconsistency based on its non-global knowledge. For example, for cross-MDT rename, the normal operation can be described as "delete name entry 'a' from 'dir_B', and insert it into the 'dir_C' with name 'd'". From the namespace LFSCK view, it only guarantees that both the 'dir_B' and the 'dir_C' do not contain dangling name entries, and the target object either referenced by the old name entry 'a', or the new name entry 'd', and matches related linkEA. Means it may be different from expected cross-MDT rename result. But the namespace is consistent. So unless related MDT hit hardware trouble and cannot get the update logs, it is better to try to make the normal DNE recovery done; otherwise, the users have to face the case of either possible inconsistent namespace or partly lost some former cross-DNE operations. Anyway, I will make patch to make the console message more accurate and clear to avoid confusing. |
| Comment by Christopher Morrone [ 04/Aug/16 ] |
|
It is more than the console that needs work, though. In proc, the recovery status counts down to zero and then hangs there forever. If recovery can't complete until all MDTs are in contact, then the countdown probably should not start until all MDTs have established contact. |
| Comment by Di Wang [ 04/Aug/16 ] |
|
Right now, the recovery timer is started since the fail MDT start (or when it receive the 1st connect req) until it is fully functional, i.e. ready to receive the new req. But after DNE, the recovery phase is actually divided 2 phase, 1. collecting the update logs. So yes, it makes sense to differentiate the phase here. so how about we add collecting_update_logs_time under /proc to indicate how long it has been taken for this phase (maybe a few more items to indicate how many or which MDTs are left), and time_remaining could start count after it gets the update log. |
| Comment by Gerrit Updater [ 05/Aug/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21759 |
| Comment by Gerrit Updater [ 08/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21759/ |
| Comment by Peter Jones [ 08/Sep/16 ] |
|
Landed for 2.9 |