Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
The recovery timer is poorly behaved, and pretty confusing to Lustre admins.
We have long had the odd behavior that the recovery timer counts down to zero and then starts all over again. I think that behavior was in support of older clients that didn't support new recovery semantics. Can we kill that off finally? Or maybe allow users to configure a mode where older clients aren't permitted, allowing a single reasonable countdown?
With DNE MDTs, recovery is even more screwy. The timer counts to zero twice (at least twice...), and then it sits there forever if any single other MDT is not up. While somewhere in the console logs it says something wishy-washy about maybe this is DNE related, we really need Lustre to do better.
Lustre should clearly state somewhere that things are hung waiting on another MDT to start up.
Other newer developers have already been confused about recovery on our testbed. If they have been confused, then it is pretty certain that this is going to cause trouble for our admins on production systems.
Attachments
Issue Links
- is related to
-
LU-6994 MDT recovery timer goes negative, recovery never ends
- Resolved