******************************** * Quota Protocol Specification * ******************************** The intent of this document is to define the new quota protocol to be used in orion. ****************** * 0. Motivations * ****************** The quota architecture used in lustre 2.x is inherited from lustre 1.4. As such it relies on the LVFS interface and is thus not portable to other back-end filesystems than ldiskfs. Sadly, several new features introduced since lustre 1.6 do not get along with quota at all. For instance, enabling quota is still done on a per-target basis despite the addition of a management node to handle the lustre configuration. Another example is online OST addition which requires to run a full quotacheck and to re-execute all setquota commands in order to properly propagate the quota configuration to the new OSTs. All quota commands (i.e. setquota, quota on/off, ...) as well as quota recovery require all targets to be up and running. It is indeed very common to see messages like the following "Only X/Y OSTs are active, abort quota recovery" in the log and running a quota command with some unreachable OSTs can have unpredicted effects. The goal of the new quota architecture described in this document is to address all the shortcomings of the current implementation. As the result, the key points of the new quota infrastructure are the following: - natively developed on top of the OSD API, and not LVFS. - leverage the proven scalability of the LDLM for quota communications. - quota commands can be run with missing slaves. - quota recovery can complete while some slaves are missing. - efficient handling of OST/MDT addition. - quota enforcement on/off is managed globally for the whole filesystem. - add support for DNE. - allow per-pool quota in the future (i.e. setting specific quota limit for each UID/GID in a given pool of targets). - allow per-directory quota in the future (i.e. limiting the size of a directory). ******************************* * 1. Quota Space Distribution * ******************************* Some of the major problems listed above are related to the master not tracking the quota space distribution. Indeed, the quota master stores in the administrative quota files how much quota space is granted globally to all the slaves, but it is not aware of the per-slave quota allocation which is tracked in the operational quota files on the slaves. To address this problem, the new quota architecture is going to track the per-slave quota consumption on the master. In addition to the usual global index used to store the global soft/hard limit as well as the grace time, the quota master now also maintains one index file per slave. This index file records how much quota space is allocated to the slave for each identifier. Each slave has one such index file per quota type (i.e. one for user and one for group). The slave index includes records for all IDs which have a quota limit enforced. Thanks to the new mechanism described in the next section, slaves can read their personal index file from the master and store a copy on disk. That said, this copy is just used as a cache which is invalidated when a slave is no longer synchronized with the master. In practice, slaves always try to fetch the whole index from the master and crush the local copy. Besides, a pool identifier is also added to the quota infrastructure. Although we only support the default pool (i.e. identifier equals to 0 for both data and metadata) for the time being, this field will allow us to implement per-pool quota in the future without changing the quota protocol again. As a result, the quota tree on the master would look like the following: └─/ ├─ROOT ├─last_rcvd ├─... └─quota └─0 /* pool ID, 0 is the default and only one supported for now */ ├─0 /* USRQUOTA */ | ├─admin_quotafile.usr /* global index for user quota */ | ├─data | | ├─0000 /* slave index for OST0000 */ | | ├─0001 /* slave index for OST0001 */ | | ├─0002 /* slave index for OST0002 */ | | ├─0003 /* slave index for OST0003 */ | | └─0004 /* slave index for OST0004 */ | └─metadata | ├─0000 /* slave index for MDT0000 */ | └─0001 /* slave index for MDT0001 */ | └─1 /* GRPQUOTA */ ├─admin_quotafile.grp /* global index for group quota */ ├─data | ├─0000 /* slave index for OST0000 */ | ├─0001 /* slave index for OST0001 */ | ├─0002 /* slave index for OST0002 */ | ├─0003 /* slave index for OST0003 */ | └─0004 /* slave index for OST0004 */ └─metadata ├─0000 /* slave index for MDT0000 */ └─0001 /* slave index for MDT0001 */ As far as ldiskfs is concerned, the administrative quota files remain the same and represent the global indexes on disk. Slave index files would be new IAM directories created and populated as slaves connect to the master. For ZFS, new ZAPs will be created to store both the global and per-slave indexes. A slave index file can be identified uniquely thanks to the following structures: struct quota_pool { __u32 qp_pool_id; /* only poolid 0 is supported */ __u16 qp_quota_type; /* USRQUOTA or GRPQUOTA */ __u16 qp_pool_type; /* data or metadata */ }; struct slave_idx_file { struct quota_pool qi_pool; __u32 qi_tgt_idx; /* target index */ __u32 qi_pad0; }; It is worth noting that the size of the slave_idx_file structure is 16 bytes, like a FID. ********************* * 2. Index Transfer * ********************* Slaves need a way to fetch the per-slave index stored on the master efficiently. To do this, a new generic mechanism for reading index files over the network is going to be introduced. Such a mechanism could be useful in other areas as well: * to export the global quota index maintained by the master to clients. This way, we could implement repquota in a very efficient manner. * to handle directory split in DNE. The way it works is that the "server" serializes the index into a byte stream which is sent to the "client" (i.e. a quota slave in this case) via a bulk transfer. The client then deserializes the stream and inserts all of the records into an index object locally. Similarly to readdir, the wire format is independent of the on-disk index format. However, unlike readdir, the wire format isn't tied to a specific entry/page format and can be different for each index. Another difference with readdir is that entries are grouped in a container of a fixed size (i.e. 4KB) which doesn't vary with the page size. As a result, depending on the index file (quota slave index, quota global index, ...), different record and container types could be specified. As far as the quota slave index is concerned, a record would typically have the following format when transferred over the wire: union quota_id { struct lu_fid qid_fid; obd_uid qid_uid; obd_gid qid_gid; }; struct quota_slv_rec { /* 24 bytes */ union quota_id qsr_id; __u64 qsr_space; }; The ID structure is an union of all the possible identifier types that can/could be used with quota: * quota_id::qid_uid is for user quota; * quota_id::qid_gid is for group quota; * quota_id::qid_fid is a FID which can be used in the future for per-directory quota (i.e. limiting the size of a directory, the size being du -s dir). The qsr_space field stores how much space is already granted to the slave for the quota_slv_rec::qsr_id identifier. As mentioned earlier, records of the slave index are grouped in a 4KB container which is the smallest transfer unit. The record containers are of the following type: struct quota_slv_container { /* 16-byte header */ __u32 qsc_magic; obd_flag qsc_flags; __u8 qsc_ver; __u8 qsc_nr; __u16 qsc_pad0; __u32 qsc_pad1; /* record array */ struct quota_slv_rec qsc_recs[0]; }; The header is composed of a magic number, some flags which aren't used for now, a version number in case we need to change this format in the future, the number of records stored in this container, some padding fields to reach the size of 16 bytes and finally the array of records. As a result, a 4KB container can store up to (4096 - 16) / 24 = 170 entries. This means that a 1MB bulk transfer will be able to carry up to 256 * 170 = 43,520 records. A new RPC type, namely OBD_FETCH_IDX, is going to be introduced to deal with the index network transfer. The format of this new RPC will be as follows: struct req_format OBD_FETCH_READ = DEFINE_REQ_FMT0("OBD_FETCH_READ", idx_read_client, idx_read_server); static const struct req_msg_field *idx_read_server[] = { &RMF_PTLRPC_BODY, }; static const struct req_msg_field *idx_read_client[] = { &RMF_PTLRPC_BODY, &RMF_IDX_TR_INFO, }; struct req_msg_field RMF_IDX_TR_INFO = DEFINE_MSGF("idx_tr_info", 0, sizeof(struct idx_tr_info), lustre_swab_idx_tr_info, NULL); /* Index transfer information */ struct idx_tr_info { /* Index type */ __u32 iti_type; /* number of 4KB containers */ __u16 iti_count; __u16 iti_pad0; /* index to start with */ __u64 ii_start; /* Index identifier within a given index type */ union { struct lu_fid iti_fid; struct slave_idx_file iti_quota; } u; }; iti_type defines the index type which can be set to: * IDX_QUOTA_SLV: quota slave index file managed by the master; * IDX_QUOTA_GLB: global index file exported by the quota master, useful for repquota in the future; * IDX_ACCT_SLV: per-uid/gid accounting information exported by the slave, useful for repquota; * IDX_DIR: directory index. Specific record and container structures are associated with each index type. As mentioned above, the quota_slv_container & quota_slv_rec structures are the ones used in conjunction with IDX_QUOTA_SLV. As for IDX_DIR, we would naturally rely on lu_dirpage for the container & lu_dirent for the record. iti_count is the number of containers that the client would like to read. if multiple bulk transfers are required to read the whole index, ii_start is used to specify the offset index to start with. The last field is an union used to identify the index file to access. For IDX_DIR, a standard FID is used. As for IDX_QUOTA_SLV, a slave_idx_file structure (defined in the previous section) is used to uniquely identify the slave index file. *************************** * 3. Master-Slave Locking * *************************** As explained in the local enforcement design document, quota slaves (MDTs & OSTs) now set up a real connection to the quota master - although the master can only run on MDT0 for now - instead of using the reverse import. All quota RPCs between the slaves and the master will now use this new connection. i. Quota Locks -------------- Thanks to this new connection, slaves are now capable of enqueueing locks. Hence new quota locks are going to be introduced in the architecture and those locks will be the groundwork for all quota communications. The quota locks actually belongs to a new class of lock - namely target resource lock - used to manage resources (e.g. quota/grant space, locks, permission to send RPCs, ...) allocated by a "server" to a "client" (a client could be a target connecting to another target here). A new lock type is going to be introduced to manage this new class of lock and quota (like grant) is just one component of this new infrastructure. A new LDLM namespace to manage this new type of lock will be exported by each target and a dedicated range of LDLM resource IDs (i.e. ldlm_resource) will be allocated to each resource type (i.e. grant, quota, permission to send RPC, ...). As a result, here is how the namespace will be organized: * ldlm_res_id::name[0] is used to encode the resource type, which can be: enum ldlm_tgt_res_type { /* target lock bits associated with resource types */ TGT_RES_GRANT = 0x0001, /* manage grant space allocation */ TGT_RES_RPC_LOCK = 0x0002, /* permission to send RPCs & own locks */ TGT_RES_QUOTA = 0x0004, /* manage quota enforcement */ }; * ldlm_res_id::name[1] is used for different purposes, depending on the resource type: - quota would store a quota_pool structure in ldlm_res_id::name[1] to have a different LDLM resource for each target pool. As mentioned above, we only support quota on the default pool (assumed to be pool ID 0 for both data & metadata) for now, so we would just use 4 different resources for the time being: * inode limit enforcement for user; * inode limit enforcement for group; * block limit enforcement for user; * block limit enforcement for group. - grant would use two different resources: * one for "block" space granted to clients, used for data writeback; * one for "inode" space granted to clients which can be used in the future for the metadata writeback cache. - locks & RPCs would just use one single resource ID. * ldlm_res_id::name[2,3] is used by quota to store a quota identifier (namely a quota_id structure), allowing slaves to enqueue a quota lock on a given ID. A new ldlm wire policy will also be defined for this new type of lock without increasing the size of the ldlm_wire_policy_data_t union: typedef union { struct ldlm_extent l_extent; struct ldlm_flock_wire l_flock; struct ldlm_inodebits l_inodebits; + union ldlm_tgt_wire l_tgt; } ldlm_wire_policy_data_t; union ldlm_tgt_wire { struct ldlm_grant ltw_grant; struct ldlm_quota ltw_quota; struct ldlm_rpc ltw_rpc; }; More information can be found in the design document dedicated to the target resource lock. Let's now focused on the structure associated with the quota lock. There are actually two types of quota locks: - a global lock called the quota index lock which is used to distribute the list of IDs with quota enforced; - per-ID quota locks which must be acquired by slave in order to hold unused quota space for a given ID. ii. Quota Index Lock -------------------- The quota index lock guarantees that the list of IDs with quota enforced is in sync between the master and the slave. There is one such lock for each quota type, namely user and group (new quota types can be added in the future as done by Fujitsu). The quota index lock is enqueued by the slaves when those latter are notified by the MGS that quota enforcement is now enabled (as a reminder, space accounting is always active and enforcement is now enabled/disabled globally through the configuration log). Similarly, the index lock is released when enforcement is turned off. If quota enforcement is only enabled on users (resp. group), then slaves would only enqueue the user (resp. group) index quota lock. Once the index lock is acquired, the slave can fetch the list of IDs (i.e. the slave index in fact, which also includes the amount of quota space granted for each ID) via a bulk transfer (see section 2 for more information about this mechanism). Moreover, the slave is notified via glimpse callback of any changes made to the list by setquota commands. iii. Per-ID Locks ----------------- In addition to the quota index lock, one lock per identifier is also added to manage the allocation of the available quota space. This per-ID lock is used to query, grant and cancel quota on that ID. The rule is that slaves must first acquire the per-ID lock in order to hold unused quota space for a given ID. In practice, the per-ID lock is granted along with QUOTA_DQACQ/REL RPCs. New request buffers will be added to the format of those RPCs to pack a lock enqueue request. Moreover, the master can claim quota space back from slaves by sending either: * glimpse callback to ask the slave to release a fraction of the unused quota space, if any. * blocking callback to ask the slave to release all the unused quota space and to drop the ID lock. This mechanism replaces the qunit broadcast which blindly sends an OST_QUOTA_ADJUST_QUNIT RPC to every single OSTs. This new approach leverages the proven scalability of the LDLM and is more selective since it sends callback only to slaves which are likely to hold unused quota space. Besides, while the quota index lock can't be dropped without going through a reintegration cycle (see next section), per-ID locks are added to the LRU list and can be released like any other locks. iv. Structures associated with the Quota Locks ---------------------------------------------- As noted in section 3.i, a new ldlm_quota structure is going to be introduced to manage quota locks (both index & ID locks). It should be noted that the new structure is passed by the slaves on lock enqueue, but also sent by the master along with a glimpse callback. This ldlm_quota structure is going to be defined as follows: struct ldlm_quota { /* below fields are used on LDLM_ENQUEUE */ __u32 lq_idx; /* below fields are used on LDLM_GL_CALLBACK */ obd_flag lq_flags; obd_size lq_qunit; union quota_id lq_id; }; Here is how all those fields will used: * lq_idx stores the target index. * lq_flags informs the slave of the purpose of the glimpse request. For the quota index lock, it is either an ID addition or deletion to/from the list. As for an ID lock, the master might just query the current usage or ask the slave to release some unused space (according to the new qunit value). * lq_qunit: this field is set to the new qunit value when a glimpse callback is sent on a per-ID lock. * lq_id: this field stores the ID which is subject to the change when a glimpse callback is sent on a quota index lock. Moreover, when glimpsing an ID lock, a new LVB structure specific to quota is packed in the glimpse reply: struct ldlm_quota_lvb { obd_size lvb_usage; obd_size lvb_count; __u64 lvb_pad0; __u64 lvb_pad1; __u64 lvb_pad2; }; Those fields are used as follows: * lvb_usage reports the current disk usage on the slave; * lvb_count is the amount of quota space released by the slave. **************************** * 4. Slave (re)Integration * **************************** The slave reintegration cycle replaces the existing quota recovery. Unlike the current quota recovery, this new process takes place between a master & a slave and does not require the involvement of any other slaves to complete. i. (re)Integration Procedure ---------------------------- A quota slave is considered as connected once it has completed the following procedure: * Step #1: Enqueue CR quota index lock - for a MDT, against the following LDLM resource ID: [TGT_RES_QUOTA, [poolid=0, type=USR|GRP, pooltype=MD]] - as for an OST: [TGT_RES_QUOTA, [poolid=0, type=USR|GRP, pooltype=DT]] Once the index lock is acquired, the slave will be notified of IDs added/removed to/from the enforced list via glimpse callback. * Step #2: The slave fetches its private index from the master. This is done by sending a OBD_FETCH_IDX RPC with idx_tr_info::u.iti_quota=[poolid=0, type=USR|GRP, pooltype=MD|DT, tgt_idx=idx]. This index includes the list of IDs with quota enforced as well as how much quota space is granted to this slave for each ID. A copy of this slave index file is stored on disk by the slave. * Step #3: The slave parses the slave index file and re-acquire quota space. The slave compares how much quota space it owns for each ID with the current on-disk usage and proceeds as follows: - if current usage == granted, then there is no need to send any QUOTA_DQACQ RPC to the master. - if current usage != granted, then a QUOTA_DQACQ RPC is sent to the master to report the usage. This QUOTA_DQACQ RPC includes a lock enqueue request in case the master decides to grant us more space back. Those acquire RPCs issued during reintegration carry a special flag to inform the master that the slave is reporting usage and thus cannot fail the request with EDQUOT. This way, only locks for active IDs are re-acquired. The reason why the QUOTA_DQACQ RPC only reports the usage is because there might already be a quota overrun at this point. This can happen if the quota limit was changed while the slave was disconnected. Once this process is successfully completed, the slave is considered as integrated. Let's now imagine that the administrator sets a quota limit with setquota for an ID which had no quota limit previously: * a setquota(ID,limit) quotactl is sent by the client node to the quota master. * the master starts a transaction, modifies the global index to insert the new limit for this ID and set the global granted space counter to 0. Then, it inserts a record with key=ID and granted space equals to 0 in each slave index files and finally stops the transaction. The transno associated with this transaction will be packed in the setquota reply later for replay. * A glimpse lock with ldlm_quota::lq_id set to the ID subject to the setquota is enqueued * A glimpse callback is issued to all the slaves which are currently connected to the quota master. * Slaves acknowledge immediately the glimpse callback and insert a record for this ID in the local copy of the slave index file. * When the master receives the glimpse callback reply, it replies to the client with the transno for setquota replay. * Meanwhile, slaves which are now aware that quota is enforced for that ID sends a QUOTA_DQACQ request to the master to acquire quota space up to the current on-disk usage. The purpose of this acquire RPC is to report current usage for this ID to the master as soon as possible, before any available quota space is granted to other slaves. This acquire RPC might include an ID lock enqueue request if the slave already has requests in-fight waiting for quota space for this ID. Otherwise, no lock request is packed and a new DQACQ/REL RPC including a lock request will be issued by the slave next time the on-disk usage might be modified by an operation (write/truncate/create/unlink). Let's now consider a slave that wasn't connected to the master at the time of the setquota and thus couldn't participate in the process described above. This slave will go through the reintegration cycle and realize that it does not own any space for the ID that was just added to its index file. It will thus issue a QUOTA_DQACQ request up to its on-disk usage. Another tricky case to consider is if quota enforcement is disabled for some time and the filesystem altered. Then once quota enforcement is re-enabled again, the disk usage on some slaves can be way higher than the quota space they were granted. The reintegration procedure will also take care of this case. ii. Online OST addition ----------------------- When a new OST is added while the filesystem is online, this OST follows the same integration procedure as described in the previous section, except that this new slave has no index file created yet. The master thus generates one on the fly from the global index file. One record with granted space set to 0 is added for each ID having quota enforced. Then this brand new index is transferred to the new OST which is now aware of the list of ID having quota enforced, but won't acquire any space from the master yet since it is supposed to be empty. iii. Slave Eviction ------------------- A slave eviction can happen when a glimpse callback on a quota lock is not acknowledged in a timely manner. From the master perspective, the slave has just gone "disconnected". This means that new IDs can be added/removed to/from the slave index without issuing glimpse callback. This also means that quota space granted to this slave cannot be claimed back until the slave reconnects. From the slave point of view, the quota locks must be re-queued as soon as possible. Meanwhile, the slave can continue to operate with the on-disk copy of the index although it can become stale. If one ID runs out of quota space on a disconnected slave, the requests must be failed with -EINPROGRESS until the slave is successfully reintegrated. Once the slave reconnects and completes quota recovery, all in-memory structures as well as the on-disk copy of the index are cleaned up and recreated with fresh data fetched from the master. ***************************************** * 4. Quota Space Allocated & Revocation * ***************************************** i. Space Acquisition & Release ------------------------------ As mentioned earlier, we will continue using the QUOTA_DQACQ & QUOTA_DQREL RPCs for quota space acquisition & release. The format of those RPCs will be modified to support the new generic ID type (i.e. quota_id) and also to pack a lock enqueue request. The RQF_QUOTA format will be used for both QUOTA_DQACQ/REL and be defined as follows: struct req_format RQF_QUOTA = DEFINE_REQ_FMT0("QUOTA_DQACQREL", quota_request, quota_reply); static const struct req_msg_field *quota_request[] = { &RMF_PTLRPC_BODY, &RMF_QUOTA_BODY, &RMF_DLM_REQ }; static const struct req_msg_field *quota_reply[] = { &RMF_PTLRPC_BODY, &RMF_QUOTA_BODY, &RMF_DLM_REP }; struct req_msg_field RMF_QUOTA_BODY = DEFINE_MSGF("quota_body", 0, sizeof(struct quota_body), lustre_swab_quota_body, NULL); And a new quota_body will be defined as follows: struct quota_body { struct quota_pool qb_pool; __u32 qb_tgt_idx; union quota_id qb_id; obd_flag qd_flags; __u32 qb_padding; struct lustre_handle qb_handle; obd_size qb_count; obd_size qb_qunit; __u64 qb_pad0; __u64 qb_pad1; }; The size of the new quota body is 70 bytes. Each field will be used as follows: * qb_pool identifies the target pool on the master; * qb_tgt_idx is the target index (please note that the pool type in qb_pool already determines whether this is a MDT or an OST); * qb_id is the identifier concerned by the request; * qd_flags packs some request flags like: - whether this is quota space acquisition for existing on-disk usage or additional space requested to process incoming requests - if there is a lock enqueue request packed in this RPC. * qb_handle is set the lock handle of the ID lock owned by the server. This field must be set if no lock enqueue is packed in the RPC and if the slave isn't reporting initial usage. * qb_count is the amount of space the slave would like to acquire/release/report. * qb_qunit is set to the current qunit value, packed by the master in the reply. ii. Rebalancing --------------- By using the lustre DLM, quota has now gained a reliable mechanism to revoke quota space granted to slaves. All connected slaves that potentially own available quota space for a given ID must have a quota lock on that ID. The master could thus issue glimpse or blocking callbacks on this ID lock in order to claim quota space back. Glimpse callbacks are sent by the master to ask slaves to release a fraction of the unused quota space. This replaces the existing qunit broadcast mechanism which isn't scalable. The glimpse callback includes the new qunit value and slaves are supposed to release unused quota space above this new threshold. The quota space is released in the glimpse reply. At this point, the slave can also decide to release all the quota space it owns (e.g. if it has not seen any request for this ID for a while) and to pack the cancellation of the ID lock in the glimpse reply (as done with extent lock). When the amount of available quota space on the master reaches a critical level, the master sends blocking callback as a last resort to get all the available quota space back. At this point, slave will only be granted small amount of quota space and will be asked to release the spare space as soon as the request is completed (done by granting blocking ID locks). ****************** * 5. DNE support * ****************** Inode quota will be managed in the same way as block quota. All MDTs will be considered as slaves and will acquire quota space from the master. It is worth mentioning that MDTs won't acquire block space any more and will only deal with inode usage & limit. Since ZFS does not maintain per-UIG/GID inode usage, there won't be any inode quota support with ZFS OSD. ******************************** * 6. Upgrade and Compatibility * ******************************** i. Interoperability ------------------- It is worth noting that only the master/slave protocol is concerned by this change. As a result, the client/server protocol isn't modified and interoperability with old clients is thus still granted. Besides, orion-based servers would only understand the new quota protocol and won't be able to fallback to the old quota protocol if one of its peer hasn't been upgraded to an orion-based release yet. This means that there is no interoperability support with old servers and all the servers have thus to be upgraded together. ii. On-disk compatibility ------------------------- To upgrade from a non-orion-based release, a write-conf procedure will have to be done in order to generate the new MDP records in the configuration log. In addition, a 'tunefs.lustre --quota' command will have to be run against all the target. On slaves, this command runs tune2fs which performs a quotacheck and updates some superblock fields to allow e2fsck to fix accounting information when run. On the master, this tunefs.lustre command also creates the tree detailed in section 1 and copies the operational quota files inside this tree. While global limit and grace time won't be modified, the total amount of granted space will be reset to 0. Once this done, the filesystem can be started. A lctl conf_param must be run on the MGS to enable quota. This will notify all the targets that quota enforcement is now on and slaves will start the integration procedure. Since there is no slave index created yet, a new slave index will be created automatically for each target which will then report & acquire quota space according to the on-disk usage. Once the integration of all the slaves is completed, the upgrade is done and the filesystem should be usable again. As for downgrade, the original operational and administrative quota files will remain untouched and can be re-used if downgrading to an non-orion-based release. A full quotacheck will be required. Moreover, all setquota commands done since the upgrade will be lost and would have to be re-executed.