Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8407

Recovery timer hangs at zero on DNE MDTs

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      The recovery timer is poorly behaved, and pretty confusing to Lustre admins.

      We have long had the odd behavior that the recovery timer counts down to zero and then starts all over again. I think that behavior was in support of older clients that didn't support new recovery semantics. Can we kill that off finally? Or maybe allow users to configure a mode where older clients aren't permitted, allowing a single reasonable countdown?

      With DNE MDTs, recovery is even more screwy. The timer counts to zero twice (at least twice...), and then it sits there forever if any single other MDT is not up. While somewhere in the console logs it says something wishy-washy about maybe this is DNE related, we really need Lustre to do better.

      Lustre should clearly state somewhere that things are hung waiting on another MDT to start up.

      Other newer developers have already been confused about recovery on our testbed. If they have been confused, then it is pretty certain that this is going to cause trouble for our admins on production systems.

      Attachments

        Issue Links

          Activity

            [LU-8407] Recovery timer hangs at zero on DNE MDTs
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21759/
            Subject: LU-8407 recovery: more clear message about recovery failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6dd0be19a97945db5da61ecdf845087b936805fa

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21759/ Subject: LU-8407 recovery: more clear message about recovery failure Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6dd0be19a97945db5da61ecdf845087b936805fa

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21759
            Subject: LU-8407 recovery: more clear message about recovery failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7cd6fa75df0c05b2fadbad8cddf391a8897332a0

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21759 Subject: LU-8407 recovery: more clear message about recovery failure Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7cd6fa75df0c05b2fadbad8cddf391a8897332a0

            Right now, the recovery timer is started since the fail MDT start (or when it receive the 1st connect req) until it is fully functional, i.e. ready to receive the new req. But after DNE, the recovery phase is actually divided 2 phase,

            1. collecting the update logs.
            2. replay the request either from replay request or update log.

            So yes, it makes sense to differentiate the phase here. so how about we add collecting_update_logs_time under /proc to indicate how long it has been taken for this phase (maybe a few more items to indicate how many or which MDTs are left), and time_remaining could start count after it gets the update log.

            di.wang Di Wang (Inactive) added a comment - Right now, the recovery timer is started since the fail MDT start (or when it receive the 1st connect req) until it is fully functional, i.e. ready to receive the new req. But after DNE, the recovery phase is actually divided 2 phase, 1. collecting the update logs. 2. replay the request either from replay request or update log. So yes, it makes sense to differentiate the phase here. so how about we add collecting_update_logs_time under /proc to indicate how long it has been taken for this phase (maybe a few more items to indicate how many or which MDTs are left), and time_remaining could start count after it gets the update log.

            It is more than the console that needs work, though. In proc, the recovery status counts down to zero and then hangs there forever. If recovery can't complete until all MDTs are in contact, then the countdown probably should not start until all MDTs have established contact.

            morrone Christopher Morrone (Inactive) added a comment - - edited It is more than the console that needs work, though. In proc, the recovery status counts down to zero and then hangs there forever. If recovery can't complete until all MDTs are in contact, then the countdown probably should not start until all MDTs have established contact.
            yong.fan nasf (Inactive) added a comment - - edited

            Currently, the DNE recovery depends on the update logs on the MDTs. If some MDT does not start, then the recovery logic cannot get update logs from such MDT as to cannot go ahead. Different from client-side recovery failure, the cross-MDT recovery failure may cause the namespace inconsistency. Because we does not want to export the inconsistent namespace to client, then we make the recovery to wait there until related update logs available.

            As for the suggestion of using LFSCK to handle the recovery trouble, that will be the last choice under the worst case. Currently, the namespace LFSCK does not understand the update logs, it just scans the namespace and fix the inconsistency based on its non-global knowledge. For example, for cross-MDT rename, the normal operation can be described as "delete name entry 'a' from 'dir_B', and insert it into the 'dir_C' with name 'd'". From the namespace LFSCK view, it only guarantees that both the 'dir_B' and the 'dir_C' do not contain dangling name entries, and the target object either referenced by the old name entry 'a', or the new name entry 'd', and matches related linkEA. Means it may be different from expected cross-MDT rename result. But the namespace is consistent.

            So unless related MDT hit hardware trouble and cannot get the update logs, it is better to try to make the normal DNE recovery done; otherwise, the users have to face the case of either possible inconsistent namespace or partly lost some former cross-DNE operations.

            Anyway, I will make patch to make the console message more accurate and clear to avoid confusing.

            yong.fan nasf (Inactive) added a comment - - edited Currently, the DNE recovery depends on the update logs on the MDTs. If some MDT does not start, then the recovery logic cannot get update logs from such MDT as to cannot go ahead. Different from client-side recovery failure, the cross-MDT recovery failure may cause the namespace inconsistency. Because we does not want to export the inconsistent namespace to client, then we make the recovery to wait there until related update logs available. As for the suggestion of using LFSCK to handle the recovery trouble, that will be the last choice under the worst case. Currently, the namespace LFSCK does not understand the update logs, it just scans the namespace and fix the inconsistency based on its non-global knowledge. For example, for cross-MDT rename, the normal operation can be described as "delete name entry 'a' from 'dir_B', and insert it into the 'dir_C' with name 'd'". From the namespace LFSCK view, it only guarantees that both the 'dir_B' and the 'dir_C' do not contain dangling name entries, and the target object either referenced by the old name entry 'a', or the new name entry 'd', and matches related linkEA. Means it may be different from expected cross-MDT rename result. But the namespace is consistent. So unless related MDT hit hardware trouble and cannot get the update logs, it is better to try to make the normal DNE recovery done; otherwise, the users have to face the case of either possible inconsistent namespace or partly lost some former cross-DNE operations. Anyway, I will make patch to make the console message more accurate and clear to avoid confusing.

            Particularly, in DNE recovery, before one MDT starts recovering, it first tries to get recovery update logs from all other MDTs.
            If one MDT is not up at the moment, it waits forever until the admin step in to abort the recovery. So the inconsistency is aware.
            The better way might be instead of waiting, it should mark the system to be readonly, and trigger LFSCK automatically to check and FIX the consistency, but we are not there yet. Yes, it should provide clearer information to indicate which MDT is not up.

            di.wang Di Wang (Inactive) added a comment - Particularly, in DNE recovery, before one MDT starts recovering, it first tries to get recovery update logs from all other MDTs. If one MDT is not up at the moment, it waits forever until the admin step in to abort the recovery. So the inconsistency is aware. The better way might be instead of waiting, it should mark the system to be readonly, and trigger LFSCK automatically to check and FIX the consistency, but we are not there yet. Yes, it should provide clearer information to indicate which MDT is not up.
            pjones Peter Jones added a comment -

            Fan Yong

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Fan Yong Could you please advise? Thanks Peter

            Yes, when the test cluster is available, I'll get more information on the restarts and open a new ticket for the timer restarts. We can focus on the DNE MDTs in this ticket.

            morrone Christopher Morrone (Inactive) added a comment - - edited Yes, when the test cluster is available, I'll get more information on the restarts and open a new ticket for the timer restarts. We can focus on the DNE MDTs in this ticket.
            green Oleg Drokin added a comment -

            So we have two problems on our hand, I guess.

            The "recovery timer restarts" is totally unexpected - the "old clients" (IR incapable?) would just be accounted for and so the system would start with a bigger timeout from the get go. Can you please file a separate ticket about this with some samples of what goes on?

            As for DNE in particular - currently we cannot guarantee FS consistency when one of the MDTs is down, so we will wait for them indefinitely. similarly MDTs are never evicted no matter what for the same reason.
            If you feel like it, you can force MDT eviction/aborted recovery manually and bear the consequences.

            So I guesss once the other ticket is filed, we can concentrate here about more clearly indicating that recovery is waiting until all MDTs are rejoined?

            green Oleg Drokin added a comment - So we have two problems on our hand, I guess. The "recovery timer restarts" is totally unexpected - the "old clients" (IR incapable?) would just be accounted for and so the system would start with a bigger timeout from the get go. Can you please file a separate ticket about this with some samples of what goes on? As for DNE in particular - currently we cannot guarantee FS consistency when one of the MDTs is down, so we will wait for them indefinitely. similarly MDTs are never evicted no matter what for the same reason. If you feel like it, you can force MDT eviction/aborted recovery manually and bear the consequences. So I guesss once the other ticket is filed, we can concentrate here about more clearly indicating that recovery is waiting until all MDTs are rejoined?

            People

              yong.fan nasf (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: