Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.4
-
None
-
RHEL-6.6, lustre-2.5.4
-
2
-
9223372036854775807
Description
We had 4 OSTs that we deactivated because of an imbalance in utilization that was causing ENOSPC messages to our users. We identified a file that was consuming a significant amount of space that we deleted while the OSTs were deactivated. The file is no longer seen in the directory structure (the MDS processed the request), but the objects on the OSTs were not marked as free. After re-activating the OSTs, it doesn't appear that the llog was flushed, which should free up those objects.
At this time, some users are not able to run jobs because they cannot allocated any space.
We understand how this is supposed to work, but as the user in LU-4295 pointed out, it is not.
Please advise.
Mike, there are two separate problems:
1) the current method for doing OST space balancing is to deactivate the OSP and then migrate files (or let users do this gradually), so the deactivated OST will not be used for new objects. However, deactivating the OSP also prevents the MDS from destroying the objects of unlinked files (since 2.4) so space is never released on the OST, which confuses users. This issue will be addressed by
LU-4825by adding a new method for disabling object allocation on an OST without fully deactivating the OSP, so that the MDS can still process object destroys.2) when the deactivated OSP is reactivated again, even after restarting the OST, it does not process the unlink llogs (and presumably Astarte logs, but that is harder to check) until the MDS is stopped and restarted. The MDS should begin processing the recovery llogs after the OSP has been reactivated. That is what this bug is for.
Even though
LU-4825will reduce the times when an OSP needs to be deactivated (i.e. Not for space balancing anymore), there are other times when this still needs to be done (e.g. OST offline for maintenance or similar) so recovery llog processing still needs to work.