Proposal to fix LU-5152 and others == Introduction == There is a known quota issue (LU-5152) which can be described briefly as: If a user belongs to two groups, quota won't be enforced properly when he changes file group from one to another. This is actually a long standing issue from day one the lustre quota was introduced, but it didn't attract enough attention until recently it's being mentioned by several customers, I think now it's time to offer a feasible solution. The hard part of the problem is to enforce block quota for chgrp operation, since chgrp is a two steps operation: first chgrp to the MDT object, chgrp to OST objects is deferred by OSP sync thread, obviously, we must reserve enough quotas on OSTs before the chgrp happen to MDT object, otherwise, it'll be very hard to rollback if we hit EDQUOT on later OST chgrp. The reserved quotas must be held till OST chgrp done, otherwise, the space could be used by others during this time window and leads to exceeding quota at the end. Making the reservation from client is a better choice than making it from MDT, because that can reduce workload on MDT and avoid the cascading timeout issue. More importantly, there are some other advantages which could be regard as potential incentives of this project: 1. Aother long standing quota issue (cached write can exceed quota) can be fixed altogether. Similar to grant, we can use this reserve mechanism to make quota being enforced on cached write properly. 2. Most of grant code can be reused. Instead of implementing everything from scratch, current stable grant code can be reused for quota reserve/consume on OST/client write. 3. A reliable reclaim mechanism has to be implemented to reclaim & rebalance quota reservation. The side benefit of this reclaim framework is that it can be reused for LDLM lock reclaim (see LU-7266) and grant reclaim. By extending resource types, it even could be used to implement a whole fs snapshot barrier (for both MDTs and OSTs). == Rudimental Design == Some rough ideas on how it should be implemented are listed below. The term 'grant' is used to refer reserved grant or quota, 'limit' is used to refer quota limit allocated for quota slave. I. Leverage the grant code to enforce quota over cached write. Just same as grant, client can reserve quotas to make sure cached write won't exceed quota limit, some changes are required to achieve this: - A protocol change to write RPC: Few more sets of 'grant' numbers/flags should be packed in OST write RPC to support more kinds of 'grant': original grant, user quota, group quota, project quota, etc. - There will be two levels of quota reclaim: 1. Quota master reclaim limit from slaves, we call it master reclaim. This is what we have in current quota framework. 2. Quota slave reclaim grant from clients, we call it slave reclaim. This is new stuff we need to implement for reserving quota on client. Slave reclaim will be triggered when quota slave is running short of grant, it can also be triggered on slave recieving master reclaim request. - Client needs to fetch quota global index (to know which IDs are enforced) from slave once reconnect, and slave needs to notify client when quota is enabled/disabled or limit is set/clear on certain ID. This mechansim will be very similar to the global index sync between slave and master. II. New work flow of chgrp by non-privileged user. Chgrp by non-privileged user can be regard as a kind of special cached write, the step by step depiction of a chgrp (by non-privileged user, and quota is enabled on target group) will look like: - Client tries to reserve enough grant for each OST object before sending setattr RPC to MDT. That requires a new RPC to acquire grant from slave explicitly, and the new RPC replies three kinds of status when it's used for acquiring grant: 1. success: reserved grant packed in reply; 2. -EDQUOT: the ID is over quota, acquire failed; 3. -EINPROGRESS: the server is in reclaim, client will retry till server replies -EDQUOT or success. This new RPC can also be used for client to report how much grant it have to slave on recovery. (or pack it obd_connect_data and use connect RPC?) Note: To avoid reserve & reclaim thrashing over multiple OST objects, client should try to preserve required grant (to make it not reclaimable) for all involved OSTs before sending the reserve RPCs. - Once slave recieved the reserve PRC, it'll check if there is enough available grant, if yes, reply the requested amount; if not, and there is no chance to reclaim grant and acquire limit, reply -EDQUOT, otherwise, reply -EINPROGRESS and try to get more grant in following order: 1. Reclaim grant from other clients. (slave reclaim) 2. Aquire limit from master, that may trigger master reclaim, and even slave reclaim on other slaves. - On client, if any request failed to acquire grant, reject the chgrp and return -EDQUOT, otherwise, continue chgrp. - Send setattr RPC to MDT, 'consumed grant' and 'version' will be packed, and on MDT side, these two numbers will be saved in OSP setattr log. Note: The version will used to detect if the whole setattr operation spans OST reboot. It's maintained on server, each server start will generate a new version (server time can be used for this purpose for simplicity), once client reconnect to server, it'll resync it's grant state with server and save the version from server for later chgrp operations. - OSP change log is processed, 'consumed' and 'version' will be sent to OSTs via setattr RPC. Once chgrp committed on OST, it'll check if the version from setattr matches current version on server, if it matches, the 'consumed' will be subtracted on server side, otherwise, it means a server reboot happened and the grant state has been rebuilt on server, the 'consumed' should be ignored. III. Grant reclaim & rebalance Server needs to notify client to release grant when it's short of available grant, it also needs to notify client the event of quota being enabled/disable, limit being set/cleared for certain ID. To save development effort, this notify mechanism can be implemented by each client holding an artificial global lock on server, and server sending glimpse callbacks to notify clients. The same mechanism can also be reused for LDLM lock reclaim or other similar things. Grant reclaim & rebalance needs be designed carefully to minimize imbalance and avoid reserve/reclaim thrashing. Following are some basic ideas: 1. A miminal acquire/release size MIN_GRANT (full RPC size?) to avoid too fragmented acquire/release. 2. Initial grant for a client should be zero (or MIN_GRANT?), so that grant won't be allocated to inactive clients. 3. Client shouldn't try to predict how much grant will be used, let server decide how much should be granted according to available quantity and whether reclaim is inprogress. Server should always try to grant back as much as possible (limited by client's max_dirty_mb). 4. Slave reclaim should be triggered when it fail to grant MIN_GRANT to a client, last reclaim time & result from each client is kept, so that server can know if there is any grant reclaimable. 5. A minimal reclaim interval could be used to avoid excessive reclaim requests. 6. If no reclaimable grant and slave is still short of grant, acquire more limit from master. 7. When client recieves reclaim request, an active client should retain at least MIN_GRANT to avoid sync write, an inactive client should relinquish all it owned. IV. Misc considerations - Performance Client will turn to sync write mode if it has no grant reserved, it should happen only if approaching quota limit (same to today's behavior), or grant imbalanced over clients, so I think we can expect no performance regression in most general cases. For the specific case of chgrp by non-priviliged user (and target group quota is enabled), it may require several additional RRC round-trips (acquire grant -> slave reclaim -> acquire limit -> master reclaim -> slave reclaim) to reserve quotas before sending setattr RPC, that looks to me an inevitable compromise for the quota correctness. - Recovery & Resend Once client reconnect to a rebooted server, it'll report it's current grant to server for rebuilding grant state on server side. Grant acquire & release requests are non-idempotent requests (because they transfer delta values currently), server and client can be out of sync if request is lost, or being executed twice on reciever side (because of resend). The problem can be solved by ensuring resend on sender side, and detecting resend on reciever side, but such method looks bit heavy for syncing just a simple number (grant). It is worth noting that both sides (client & server) need the completed resend mechanism, because release grant could be initiated from server side. I come up with a much simpler solution: Turn the grant acquire & release request into idempotent by transferring total value but not delta value, of course, that would require some sort of version to address out of order request issue. The algorithm can be described as: Server maintains two values: # tot_acq : Total acquired grant; # local_tot_rel : Local copy of total consumed/released grant; Client maintains two vaues: # tot_rel : Total consumed/released grant; # local_tot_acq : Local copy of total acquired grant; On grant acquire: # server: tot_acq += delta_acq; send 'tot_acq' to client; # client: if (tot_acq > local_tot_acq) { local_tot_acq = tot_acq } On grant release: # client: tot_rel += delta_rel; send 'tot_rel' to server; # server: if (tot_rel > local_tot_rel ) { add 'tot_rel' to pending list; pass 'tot_rel' to commit callback; } # commit callback: { mark the 'tot_rel' as committed; search pending list to find largest committed: 'max_committed' if (no uncommitted smaller than 'max_committed') { remove all items <= 'max_committed' from pending list; local_tot_rel = max_committed; } } Calculate total granted: # client: client_granted = local_tot_acq - tot_rel # server: server_granted = tot_acq - local_tot_rel We can see the 'tot_acq' & 'tot_rel' are implicitly used as kind of version to detect the out of order request, and it can guarantee that client_granted is less or equal to server_granted at any given time. With this manner, client and server can be resynced by any successful acquire/release request, no bidirectional resend mechanism required. - Interoprability Several protocol changes will be introcuded: 1. write/connect RPC to pack more 'grant' numbers; 2. setattr RPC to pack 'consumed grant' & 'version'; 3. OSP settar log record to store 'consumed grant' & 'version'; 4. A new RPC to acquire grant explicitly; 5. Each client always holds a global lock for each server; - Limitations 1. If OST reboot after setattr done on MDT and before setattr committed on OST, the 'consumed grant' in OSP log will be lost, so user may exceed quota in this timeframe. I don't think it's necessary to introduce complexities to address this corner case. 2. For ldiskfs backend, rewrite actually doesn't consume any space, but client always have to reserve grant because it isn't aware of the on-disk information. The same limitaion presents in current grant system.