Considering that we already have problems with quotas being spread across OSTs, I think that spreading quotas across all of the clients can become even worse. If each client in a 32000-node system needs 32MB to generate good RPCs, that means 1TB of quota would be needed. Even with only 1MB of quota per client this would be 32GB of quota consumed just to generate a single RPC per client.
Right, but to implement an accurate quota for chgrp & cached write, I think that's probably the only way we have. It's worthy noting that these reserved quotas can be reclaimed when server is short of quotas (usage approaching limit), and inactive client (no user/group write from that client) should have 0 reservation at the end.
I was thinking that the quota/grant acquire could be done by enqueueing the DLM lock on the quota resource FID, and the quota/grant is returned to the client with the LVB data, and the client keeps this LVB updated as quota/grant is consumed. When the lock is cancelled, any remaining quota/grant is returned with the lock.
My plan was to use single lock for all IDs (not per ID lock), and that lock will never be revoked, I just want to use it's existing scalable glimpse mechanism to reclaim 'grant' or notify limit set/cleared.
The MDS would need to track the total reserved quota for the setattr operations, not just checking each one. It would "consume" quota locally (from quota master) for the new user/group for each operation, and that quota would need to be logged in the setattr llog and transferred to the OSTs along with the setattr operations. I don't think the MDS would need to query the OSTs for their quota limits at all, but rather get its own quota. If there is a separate mechanism to reclaim space from OSTs, then that would happen in the background.
I think the major drawback of this method is that it increases quota imbalance unnecessarily, when setattr on MDT, it acquires large amount of quota limit from quota master, after a short time when setattr synced to OSTs, OSTs have to acquire the limit back? (If OSTs use the limit packed in setattr log directly, it'll introduce more complexity on limit syncing between master & salves.) If OSTs acquire limit in the first place, such kind of thrashing can be avoided.
And it requires change to quota slave to be aware of setattr log, it needs to scan the setattr log on quota reintegration or on rebalancing.
Another thing needs be mentioned is that limit reclaim on OSTs does happen in the background, but setattr has to wait for the rebalancing done (to acquire limit for MDT), so MDT needs to handle it properly to avoid blocking MDT service thread, also, MDT needs to glimpse OST objects to know current used blocks before setattr. All these work being handled by client looks better to me.
It is true that there could be a race condition, if the file size is growing quickly while the ownership is being changed, but that is not any different than quota races today for regular writes.
Yes, as I mentioned in proposal, I think we can use this opportunity to solve the current problem. It looks to me that both manners require lots of development effort, so my opinion is that we should choose a way that can solve the problem better, at the meantime, the same framework could be reused by other purposes.
BTW, this race window looks not that short to me.
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33682
Subject: Revert "
LU-5152quota: enforce block quota for chgrp"Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 3f3e9312be341981060ec1b9912e1b93645c94a8