Proposal to fix LU-5152 and others

== Introduction ==

There is a known quota issue (LU-5152) which can be described briefly as: If a
user belongs to two groups, quota won't be enforced properly when he changes
file group from one to another. This is actually a long standing issue from
day one the lustre quota was introduced, but it didn't attract enough attention
until recently it's being mentioned by several customers, I think now it's time
to offer a feasible solution.

The hard part of the problem is to enforce block quota for chgrp operation,
since chgrp is a two steps operation: first chgrp to the MDT object, chgrp to
OST objects is deferred by OSP sync thread, obviously, we must reserve enough
quotas on OSTs before the chgrp happen to MDT object, otherwise, it'll be very
hard to rollback if we hit EDQUOT on later OST chgrp. The reserved quotas must
be held till OST chgrp done, otherwise, the space could be used by others
during this time window and leads to exceeding quota at the end.

Making the reservation from client is a better choice than making it from MDT,
because that can reduce workload on MDT and avoid the cascading timeout issue.
More importantly, there are some other advantages which could be regard as
potential incentives of this project:

1. Aother long standing quota issue (cached write can exceed quota) can be
   fixed altogether. Similar to grant, we can use this reserve mechanism to
   make quota being enforced on cached write properly.

2. Most of grant code can be reused. Instead of implementing everything from
   scratch, current stable grant code can be reused for quota reserve/consume
   on OST/client write.

3. A reliable reclaim mechanism has to be implemented to reclaim & rebalance
   quota reservation. The side benefit of this reclaim framework is that it
   can be reused for LDLM lock reclaim (see LU-7266) and grant reclaim. By
   extending resource types, it even could be used to implement a whole fs
   snapshot barrier (for both MDTs and OSTs).


== Rudimental Design  ==

Some rough ideas on how it should be implemented are listed below. The term
'grant' is used to refer reserved grant or quota, 'limit' is used to refer
quota limit allocated for quota slave.

I. Leverage the grant code to enforce quota over cached write.

Just same as grant, client can reserve quotas to make sure cached write won't
exceed quota limit, some changes are required to achieve this:

- A protocol change to write RPC: Few more sets of 'grant' numbers/flags should
  be packed in OST write RPC to support more kinds of 'grant': original grant,
  user quota, group quota, project quota, etc.

- There will be two levels of quota reclaim:

  1. Quota master reclaim limit from slaves, we call it master reclaim. This
     is what we have in current quota framework.
  2. Quota slave reclaim grant from clients, we call it slave reclaim. This
     is new stuff we need to implement for reserving quota on client.

  Slave reclaim will be triggered when quota slave is running short of grant,
  it can also be triggered on slave recieving master reclaim request.

- Client needs to fetch quota global index (to know which IDs are enforced)
  from slave once reconnect, and slave needs to notify client when quota is
  enabled/disabled or limit is set/clear on certain ID. This mechansim will
  be very similar to the global index sync between slave and master.


II. New work flow of chgrp by non-privileged user.

Chgrp by non-privileged user can be regard as a kind of special cached write,
the step by step depiction of a chgrp (by non-privileged user, and quota is
enabled on target group) will look like:

- Client tries to reserve enough grant for each OST object before sending
  setattr RPC to MDT. That requires a new RPC to acquire grant from slave
  explicitly, and the new RPC replies three kinds of status when it's used
  for acquiring grant:

  1. success: reserved grant packed in reply;
  2. -EDQUOT: the ID is over quota, acquire failed;
  3. -EINPROGRESS: the server is in reclaim, client will retry till server
     replies -EDQUOT or success.

  This new RPC can also be used for client to report how much grant it have
  to slave on recovery. (or pack it obd_connect_data and use connect RPC?)

  Note: To avoid reserve & reclaim thrashing over multiple OST objects, client
        should try to preserve required grant (to make it not reclaimable) for
        all involved OSTs before sending the reserve RPCs.

- Once slave recieved the reserve PRC, it'll check if there is enough available
  grant, if yes, reply the requested amount; if not, and there is no chance
  to reclaim grant and acquire limit, reply -EDQUOT, otherwise, reply
  -EINPROGRESS and try to get more grant in following order:

  1. Reclaim grant from other clients. (slave reclaim)
  2. Aquire limit from master, that may trigger master reclaim, and even
     slave reclaim on other slaves.

- On client, if any request failed to acquire grant, reject the chgrp and return
  -EDQUOT, otherwise, continue chgrp.

- Send setattr RPC to MDT, 'consumed grant' and 'version' will be packed, and on
  MDT side, these two numbers will be saved in OSP setattr log.

  Note: The version will used to detect if the whole setattr operation spans
        OST reboot. It's maintained on server, each server start will generate
        a new version (server time can be used for this purpose for simplicity),
        once client reconnect to server, it'll resync it's grant state with
        server and save the version from server for later chgrp operations.

- OSP change log is processed, 'consumed' and 'version' will be sent to OSTs via
  setattr RPC. Once chgrp committed on OST, it'll check if the version from
  setattr matches current version on server, if it matches, the 'consumed' will
  be subtracted on server side, otherwise, it means a server reboot happened
  and the grant state has been rebuilt on server, the 'consumed' should be
  ignored.


III. Grant reclaim & rebalance

Server needs to notify client to release grant when it's short of available
grant, it also needs to notify  client the event of quota being enabled/disable,
limit being set/cleared for certain ID. To save development effort, this notify
mechanism can be implemented by each client holding an artificial global lock
on server, and server sending glimpse callbacks to notify clients. The same
mechanism can also be reused for LDLM lock reclaim or other similar things.

Grant reclaim & rebalance needs be designed carefully to minimize imbalance
and avoid reserve/reclaim thrashing. Following are some basic ideas:

1. A miminal acquire/release size MIN_GRANT (full RPC size?) to avoid too
   fragmented acquire/release.
2. Initial grant for a client should be zero (or MIN_GRANT?), so that grant
   won't be allocated to inactive clients.
3. Client shouldn't try to predict how much grant will be used, let server
   decide how much should be granted according to available quantity and
   whether reclaim is inprogress. Server should always try to grant back as
   much as possible (limited by client's max_dirty_mb).
4. Slave reclaim should be triggered when it fail to grant MIN_GRANT to a
   client, last reclaim time & result from each client is kept, so that
   server can know if there is any grant reclaimable.
5. A minimal reclaim interval could be used to avoid excessive reclaim
   requests.
6. If no reclaimable grant and slave is still short of grant, acquire more
   limit from master.
7. When client recieves reclaim request, an active client should retain
   at least MIN_GRANT to avoid sync write, an inactive client should
   relinquish all it owned.


IV. Misc considerations

- Performance

Client will turn to sync write mode if it has no grant reserved, it should
happen only if approaching quota limit (same to today's behavior), or grant
imbalanced over clients, so I think we can expect no performance regression
in most general cases.

For the specific case of chgrp by non-priviliged user (and target group quota
is enabled), it may require several additional RRC round-trips (acquire grant
-> slave reclaim -> acquire limit -> master reclaim -> slave reclaim) to
reserve quotas before sending setattr RPC, that looks to me an inevitable
compromise for the quota correctness.

- Recovery & Resend

Once client reconnect to a rebooted server, it'll report it's current grant
to server for rebuilding grant state on server side.

Grant acquire & release requests are non-idempotent requests (because they
transfer delta values currently), server and client can be out of sync if
request is lost, or being executed twice on reciever side (because of resend).

The problem can be solved by ensuring resend on sender side, and detecting
resend on reciever side, but such method looks bit heavy for syncing just a
simple number (grant). It is worth noting that both sides (client & server)
need the completed resend mechanism, because release grant could be initiated
from server side.

I come up with a much simpler solution: Turn the grant acquire & release
request into idempotent by transferring total value but not delta value,
of course, that would require some sort of version to address out of order
request issue. The algorithm can be described as:

Server maintains two values:
 # tot_acq       : Total acquired grant;
 # local_tot_rel : Local copy of total consumed/released grant;

Client maintains two vaues:
 # tot_rel       : Total consumed/released grant;
 # local_tot_acq : Local copy of total acquired grant;

On grant acquire:
 # server: tot_acq += delta_acq; send 'tot_acq' to client;
 # client: if (tot_acq > local_tot_acq) { local_tot_acq = tot_acq }

On grant release:
 # client: tot_rel += delta_rel; send 'tot_rel' to server;
 # server: if (tot_rel > local_tot_rel ) {
               add 'tot_rel' to pending list;
               pass 'tot_rel' to commit callback;
           }
 # commit callback:
          {
               mark the 'tot_rel' as committed;
               search pending list to find largest committed: 'max_committed'
               if (no uncommitted smaller than 'max_committed') {
                   remove all items <= 'max_committed' from pending list;
                   local_tot_rel = max_committed;
               }
          }

Calculate total granted:
 # client: client_granted = local_tot_acq - tot_rel
 # server: server_granted = tot_acq - local_tot_rel

We can see the 'tot_acq' & 'tot_rel' are implicitly used as kind of version
to detect the out of order request, and it can guarantee that client_granted
is less or equal to server_granted at any given time. With this manner, client
and server can be resynced by any successful acquire/release request, no
bidirectional resend mechanism required.

- Interoprability

Several protocol changes will be introcuded:
1. write/connect RPC to pack more 'grant' numbers;
2. setattr RPC to pack 'consumed grant' & 'version';
3. OSP settar log record to store 'consumed grant' & 'version';
4. A new RPC to acquire grant explicitly;
5. Each client always holds a global lock for each server;

- Limitations

1. If OST reboot after setattr done on MDT and before setattr committed on OST,
   the 'consumed grant' in OSP log will be lost, so user may exceed quota in
   this timeframe. I don't think it's necessary to introduce complexities to
   address this corner case.

2. For ldiskfs backend, rewrite actually doesn't consume any space, but client
   always have to reserve grant because it isn't aware of the on-disk
   information. The same limitaion presents in current grant system.