[LU-8407] Recovery timer hangs at zero on DNE MDTs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.8.0
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The recovery timer is poorly behaved, and pretty confusing to Lustre admins.

We have long had the odd behavior that the recovery timer counts down to zero and then starts all over again. I think that behavior was in support of older clients that didn't support new recovery semantics. Can we kill that off finally? Or maybe allow users to configure a mode where older clients aren't permitted, allowing a single reasonable countdown?

With DNE MDTs, recovery is even more screwy. The timer counts to zero twice (at least twice...), and then it sits there forever if any single other MDT is not up. While somewhere in the console logs it says something wishy-washy about maybe this is DNE related, we really need Lustre to do better.

Lustre should clearly state somewhere that things are hung waiting on another MDT to start up.

Other newer developers have already been confused about recovery on our testbed. If they have been confused, then it is pretty certain that this is going to cause trouble for our admins on production systems.

Attachments

Issue Links

is related to

LU-6994 MDT recovery timer goes negative, recovery never ends

Resolved

Activity

[LU-8407] Recovery timer hangs at zero on DNE MDTs

Peter Jones added a comment - 08/Sep/16 4:21 AM

Landed for 2.9

Peter Jones added a comment - 08/Sep/16 4:21 AM Landed for 2.9

Gerrit Updater added a comment - 08/Sep/16 2:06 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21759/
Subject: ~~LU-8407~~ recovery: more clear message about recovery failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6dd0be19a97945db5da61ecdf845087b936805fa

Gerrit Updater added a comment - 08/Sep/16 2:06 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21759/ Subject: LU-8407 recovery: more clear message about recovery failure Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6dd0be19a97945db5da61ecdf845087b936805fa

Gerrit Updater added a comment - 05/Aug/16 3:19 PM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21759
Subject: ~~LU-8407~~ recovery: more clear message about recovery failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7cd6fa75df0c05b2fadbad8cddf391a8897332a0

Gerrit Updater added a comment - 05/Aug/16 3:19 PM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21759 Subject: LU-8407 recovery: more clear message about recovery failure Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7cd6fa75df0c05b2fadbad8cddf391a8897332a0

Di Wang (Inactive) added a comment - 04/Aug/16 6:27 PM

Right now, the recovery timer is started since the fail MDT start (or when it receive the 1st connect req) until it is fully functional, i.e. ready to receive the new req. But after DNE, the recovery phase is actually divided 2 phase,

1. collecting the update logs.
2. replay the request either from replay request or update log.

So yes, it makes sense to differentiate the phase here. so how about we add collecting_update_logs_time under /proc to indicate how long it has been taken for this phase (maybe a few more items to indicate how many or which MDTs are left), and time_remaining could start count after it gets the update log.

Di Wang (Inactive) added a comment - 04/Aug/16 6:27 PM Right now, the recovery timer is started since the fail MDT start (or when it receive the 1st connect req) until it is fully functional, i.e. ready to receive the new req. But after DNE, the recovery phase is actually divided 2 phase, 1. collecting the update logs. 2. replay the request either from replay request or update log. So yes, it makes sense to differentiate the phase here. so how about we add collecting_update_logs_time under /proc to indicate how long it has been taken for this phase (maybe a few more items to indicate how many or which MDTs are left), and time_remaining could start count after it gets the update log.

Christopher Morrone (Inactive) added a comment - 04/Aug/16 5:43 PM - edited

It is more than the console that needs work, though. In proc, the recovery status counts down to zero and then hangs there forever. If recovery can't complete until all MDTs are in contact, then the countdown probably should not start until all MDTs have established contact.

Christopher Morrone (Inactive) added a comment - 04/Aug/16 5:43 PM - edited It is more than the console that needs work, though. In proc, the recovery status counts down to zero and then hangs there forever. If recovery can't complete until all MDTs are in contact, then the countdown probably should not start until all MDTs have established contact.

nasf (Inactive) added a comment - 04/Aug/16 7:51 AM - edited

Currently, the DNE recovery depends on the update logs on the MDTs. If some MDT does not start, then the recovery logic cannot get update logs from such MDT as to cannot go ahead. Different from client-side recovery failure, the cross-MDT recovery failure may cause the namespace inconsistency. Because we does not want to export the inconsistent namespace to client, then we make the recovery to wait there until related update logs available.

As for the suggestion of using LFSCK to handle the recovery trouble, that will be the last choice under the worst case. Currently, the namespace LFSCK does not understand the update logs, it just scans the namespace and fix the inconsistency based on its non-global knowledge. For example, for cross-MDT rename, the normal operation can be described as "delete name entry 'a' from 'dir_B', and insert it into the 'dir_C' with name 'd'". From the namespace LFSCK view, it only guarantees that both the 'dir_B' and the 'dir_C' do not contain dangling name entries, and the target object either referenced by the old name entry 'a', or the new name entry 'd', and matches related linkEA. Means it may be different from expected cross-MDT rename result. But the namespace is consistent.

So unless related MDT hit hardware trouble and cannot get the update logs, it is better to try to make the normal DNE recovery done; otherwise, the users have to face the case of either possible inconsistent namespace or partly lost some former cross-DNE operations.

Anyway, I will make patch to make the console message more accurate and clear to avoid confusing.

nasf (Inactive) added a comment - 04/Aug/16 7:51 AM - edited Currently, the DNE recovery depends on the update logs on the MDTs. If some MDT does not start, then the recovery logic cannot get update logs from such MDT as to cannot go ahead. Different from client-side recovery failure, the cross-MDT recovery failure may cause the namespace inconsistency. Because we does not want to export the inconsistent namespace to client, then we make the recovery to wait there until related update logs available. As for the suggestion of using LFSCK to handle the recovery trouble, that will be the last choice under the worst case. Currently, the namespace LFSCK does not understand the update logs, it just scans the namespace and fix the inconsistency based on its non-global knowledge. For example, for cross-MDT rename, the normal operation can be described as "delete name entry 'a' from 'dir_B', and insert it into the 'dir_C' with name 'd'". From the namespace LFSCK view, it only guarantees that both the 'dir_B' and the 'dir_C' do not contain dangling name entries, and the target object either referenced by the old name entry 'a', or the new name entry 'd', and matches related linkEA. Means it may be different from expected cross-MDT rename result. But the namespace is consistent. So unless related MDT hit hardware trouble and cannot get the update logs, it is better to try to make the normal DNE recovery done; otherwise, the users have to face the case of either possible inconsistent namespace or partly lost some former cross-DNE operations. Anyway, I will make patch to make the console message more accurate and clear to avoid confusing.

Di Wang (Inactive) added a comment - 03/Aug/16 10:56 PM

Particularly, in DNE recovery, before one MDT starts recovering, it first tries to get recovery update logs from all other MDTs.
If one MDT is not up at the moment, it waits forever until the admin step in to abort the recovery. So the inconsistency is aware.
The better way might be instead of waiting, it should mark the system to be readonly, and trigger LFSCK automatically to check and FIX the consistency, but we are not there yet. Yes, it should provide clearer information to indicate which MDT is not up.

Di Wang (Inactive) added a comment - 03/Aug/16 10:56 PM Particularly, in DNE recovery, before one MDT starts recovering, it first tries to get recovery update logs from all other MDTs. If one MDT is not up at the moment, it waits forever until the admin step in to abort the recovery. So the inconsistency is aware. The better way might be instead of waiting, it should mark the system to be readonly, and trigger LFSCK automatically to check and FIX the consistency, but we are not there yet. Yes, it should provide clearer information to indicate which MDT is not up.

Peter Jones added a comment - 18/Jul/16 7:50 PM

Fan Yong

Could you please advise?

Thanks

Peter

Peter Jones added a comment - 18/Jul/16 7:50 PM Fan Yong Could you please advise? Thanks Peter

Christopher Morrone (Inactive) added a comment - 18/Jul/16 6:10 PM - edited

Yes, when the test cluster is available, I'll get more information on the restarts and open a new ticket for the timer restarts. We can focus on the DNE MDTs in this ticket.

Christopher Morrone (Inactive) added a comment - 18/Jul/16 6:10 PM - edited Yes, when the test cluster is available, I'll get more information on the restarts and open a new ticket for the timer restarts. We can focus on the DNE MDTs in this ticket.

Oleg Drokin added a comment - 18/Jul/16 5:31 PM

So we have two problems on our hand, I guess.

The "recovery timer restarts" is totally unexpected - the "old clients" (IR incapable?) would just be accounted for and so the system would start with a bigger timeout from the get go. Can you please file a separate ticket about this with some samples of what goes on?

As for DNE in particular - currently we cannot guarantee FS consistency when one of the MDTs is down, so we will wait for them indefinitely. similarly MDTs are never evicted no matter what for the same reason.
If you feel like it, you can force MDT eviction/aborted recovery manually and bear the consequences.

So I guesss once the other ticket is filed, we can concentrate here about more clearly indicating that recovery is waiting until all MDTs are rejoined?

Oleg Drokin added a comment - 18/Jul/16 5:31 PM So we have two problems on our hand, I guess. The "recovery timer restarts" is totally unexpected - the "old clients" (IR incapable?) would just be accounted for and so the system would start with a bigger timeout from the get go. Can you please file a separate ticket about this with some samples of what goes on? As for DNE in particular - currently we cannot guarantee FS consistency when one of the MDTs is down, so we will wait for them indefinitely. similarly MDTs are never evicted no matter what for the same reason. If you feel like it, you can force MDT eviction/aborted recovery manually and bear the consequences. So I guesss once the other ticket is filed, we can concentrate here about more clearly indicating that recovery is waiting until all MDTs are rejoined?

People

Assignee:: nasf (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 15/Jul/16 6:10 PM

Updated:: 03/Mar/17 4:55 PM

Resolved:: 08/Sep/16 4:21 AM