Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8407

Recovery timer hangs at zero on DNE MDTs

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      The recovery timer is poorly behaved, and pretty confusing to Lustre admins.

      We have long had the odd behavior that the recovery timer counts down to zero and then starts all over again. I think that behavior was in support of older clients that didn't support new recovery semantics. Can we kill that off finally? Or maybe allow users to configure a mode where older clients aren't permitted, allowing a single reasonable countdown?

      With DNE MDTs, recovery is even more screwy. The timer counts to zero twice (at least twice...), and then it sits there forever if any single other MDT is not up. While somewhere in the console logs it says something wishy-washy about maybe this is DNE related, we really need Lustre to do better.

      Lustre should clearly state somewhere that things are hung waiting on another MDT to start up.

      Other newer developers have already been confused about recovery on our testbed. If they have been confused, then it is pretty certain that this is going to cause trouble for our admins on production systems.

      Attachments

        Issue Links

          Activity

            [LU-8407] Recovery timer hangs at zero on DNE MDTs
            mdiep Minh Diep made changes -
            Link Original: This issue is related to JFC-17 [ JFC-17 ]
            mdiep Minh Diep made changes -
            Link New: This issue is related to JFC-20 [ JFC-20 ]
            mdiep Minh Diep made changes -
            Link Original: This issue is related to JFC-24 [ JFC-24 ]
            mdiep Minh Diep made changes -
            Link Original: This issue is related to LDEV-341 [ LDEV-341 ]
            mdiep Minh Diep made changes -
            Link New: This issue is blocked by LDEV-342 [ LDEV-342 ]
            mdiep Minh Diep made changes -
            Link New: This issue is related to JFC-24 [ JFC-24 ]
            ofaaland Olaf Faaland made changes -
            Labels Original: llnl topllnl New: llnl
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-10 [ JFC-10 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LDEV-341 [ LDEV-341 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-17 [ JFC-17 ]

            People

              yong.fan nasf (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: