-
Improvement
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
3
-
9223372036854775807
The "lfs quota" operations are blocked when an OST is offline. I see this regularly with my home Lustre system where the OSTs are on HDDs that have spun down to save energy. The "lfs quota" operation fetches the quota usage information from each MDT and OST directly upon request, and the command does not return until the HDDs are spun up, which means there is no cache of the usage information on the QMT.
Having offline OSTs is expected to be a more common occurrence with FLR-ECRO deployments, and to the largest extent possible the filesystem operations should not block in this case.
For proper operation when an OST is (temporarily) offline, it would be useful for QMT0000 to maintain a (lazy) cache of the quota usage of each target. This has two benefits:
- it maintains filesystem accessibility for "lfs quota" commands when an OST is undergoing failover or is permanently offline.
- it allows estimating how much quota usage was on that OST before it went offline for purposes of future quota grants (i.e. the OST's quota usage is not just "lost" when it is inaccessible).
In most cases the OST quota usage does not need to be totally uptodate, since the quota enforcement is itself not exact. If the OST is online, then the QMT and client should prefer quota usage from the actual target. If the OST is offline, then the cached quota usage should be "good enough", since the space used by that OST cannot change while it is offline (excluding OST reformat or e2fsck correcting errors).
This means that having a cache of the OST usage that is asynchronously written to the QMT0000 for each ID should be enough. It does not need to be synchronous or recovered in case of failure, since it can be updated the next time the OST is online, or if the OST is permanently removed from the filesystem.