Quota Local Enforcement Section One: Quota Enforcement in OSD Quota enforcement is going to be implemented in OSD layer, and enforcement will be done in lustre but not in underlying filesystem. 1.1 Quota Context Quota context contains necessary information for quota enforcement, each quota slave has it's own quota context. A pointer in osd_device (osd_device->od_qctxt) will be used for the OSD to access the quota context, and all quota context on the same node will be linked in a global list (lquota_context_list) of the quota module. struct lquota_ctxt { cfs_list_t lqc_link; char lqc_name[MTI_NAME_MAXLEN]; struct obd_device *lqc_obd; /* LQS device */ struct dt_object *lqc_acct_usr; /* user accounting object */ struct dt_object *lqc_acct_grp; /* group accounting object */ int lqc_type; /* other information, could be put in quota lock */ pending bytes/inode; quota unit related data; inflight dqacq/dqrel; } lqc_link: link to the global list of the quota module; lqc_name: the corresponding OSD name, used to identify quota context; lqc_obd: LQS device; (see Section Two) lqc_acct_usr: user accounting object; lqc_acct_grp: group accounting object; lqc_type: bitmask of which quota type is enabled: bit 0: user quota; bit 1: group quota; 1.1.1 Create Quota Context On OSD initialization (osd_device_init0()), lquota_ctxt_create() will be called to create the quota context, and set the context pointer in od_qctxt. struct lquota_ctxt *lquota_ctxt_create(char *name, char *pool) @name: OSD name, which used to identify the quota context; return value: quota context on success, appropriate error on failure; This function creates the quota context, open quota accounting objects, and links the created context into the global list of quota module. 1.1.2 Destroy Quota Context On OSD finialization (osd_device_fini()), lquota_ctxt_destroy() will be called to free the quota context stored in od_qctxt. void lquota_ctxt_destroy(struct lquota_ctxt *qctxt) @qctxt: the quota context to be destroyed; return value: none; This function unlinks the quota context from the global list of quota module, close quota accounting objects, and free the context at the end. If the context has LQS device attached (lqc_obd isn't NULL), LQS device cleanup will be done as well. 1.2 Enforce Quota in OSD OSD is now responsible for the quota enforcement work, so in operation declare stage (osd_declare_xxx()), OSD will estimate required bytes or inodes first, then lquota_op_begin() will be called to verify if there is enough quota: - If there is enough quota, proceed with the operation; - If there isn't enough quota, fail the operation with -EDQUOT; - For write operation, if any retry-able error returned when acquiring quota from master, return -EINPROGRESS to inform client to retry the write; - For other operations or other errors, fail the operation with appropriate error. After the operation transaction stopped (osd_trans_stop()), lquota_op_end() will be called to do the quota enforcement post work: - Release the pending bytes/inodes recorded in declare stage; - Pre-acquire more quota or release quota from/to master if necessary; It's easy to get the operation affected uids/gids in osd_declare_xxx(), since all the declare functions have the target object (dt_object) in their parameters, however, we can't tell the quota changed for which uids/gids during the whole transaction in osd_trans_stop(). To address this problem, the uid/gid information will be added in a list of the osd_thandle (ot_qcredit_list) during declare stage, thus, we can process the list in osd_trans_stop() to do the quota enforcement post work. struct lquota_credit { cfs_lists_t q_link; /* link in the ot_qcredit_list */ qid_t q_id; /* uid or gid */ int q_idtype; /* USRQUOTA or GRPQUOTA */ __u64 q_count; /* bytes or inodes */ int q_type; /* block quota or file quota */ } lquota_credit is used to transfer the uid/gid information from osd_declare_xxx() to osd_trans_stop(). int lquota_op_begin(struct lquota_ctxt *qctxt, cfs_list_head *credits, int *sync) @qctxt: quota context; @credits: list of required bytes/inodes (lquota_credit); @sync: output: if the sync is set to 1 in lquota_op_begin(), the caller should turn the operation into a sync operation; return value: 0: there is enough quota; -EDQUOT: short of quota; -EINPROGRESS: retry-able error; other error: non-retryable error; This function takes following steps to do the enforcement work: 1. if the local limits can satisfy the required bytes/inodes, increase pending bytes/inodes, check the qunit size, if it's already being shrinked to the minimum value, set @sync to 1 (see comments for lquota_op_end()), return 0; 2. try to acquire more quota from master; 3. - if acquire succeed, goto step 1; - if acquire failed for network error, or the master isn't able to revoke quota from other slaves in time, and this is declare_write_commit, return -EINPROGRESS to notify client to retry the write later; - otherwise, return -EDQUOT or other appropriate error; One thing needs to be mentioned is that changing owner/group for objects on OSS should always ignore quota limits, which is different with the objects on MDS, consequently, an additional parameter will be added for the dt_declare_attr_set() to identiy if the current owner/group changing needs to ignore quota. The ost write code on client side needs be revised to handle the -EINPROGRESS properly: if the client get -EINPROGRESS error for OST_WRITE RPC, it'll keep resending the RPC until the RPC succeed or fail for other kind of error. void lquota_op_end(struct lquota_ctxt *qctxt, cfs_list_head *credits) @qctxt: quota context; @credits: list of required bytes/inodes for the whole transaction; return value: none; This function does the quota enforcement post work in following steps: 1. process the @credits list to decrease pending bytes/inodes; 2. pre-acquire/release quota from/to master if necessary; 3. free all the lquota_credit created in prior osd declare functions; For ldiskfs, we can follow the old way to do the pre-acquire/release (check quota usage after operation, schedule dqacq/rel RPC if the remaining local limit has already exceeded the threshold of pre-acquire/release), however, for zfs, the quota accounting is updated after commit, so there might be a delay before the dqacq/rel is sent, and which could cause trouble when the dynamic qunit size is close to the minumum value. To address this problem, we check the qunit size in lquota_op_begin() and turn the operation into sync operation if necessary. Section Two: Quota Slave to Master Connections Lustre quota master is running on the MDT, it's responsible for allocating or revoking quota to/from each slave, quota slaves are running on each MDT/OST, and they are responsible for acquiring or releasing quota from/to quota master. Quota slaves were originally using the reverse import of OSC on MDS to communicate with quota master, however, the LOV/OSC on MDS are replaced by LOD/OSP in Orion, and the asynchronous infrastructure of LOD/OSP is not suitable for the requirement of quota acquire/release RPCs, hence we plan to introduce an independent target for the quota master, and let the quota slaves to connect with the master target by themselves, then quota RPCs will be sent over those new connections. In the new connection scheme, quota slaves have to know the nids of quota master in first place, this problem will be addressed by config log: When MDT register to MGS, it'll write the quota master (MDT) information into a dedicated per filesystem quota config log, the master information will be duplicated in all the existing MDT/OST config logs at the same time, when OST (quota slave) register to MGS, it'll traverse the quota config log to find out the master information, then copy it into it's own log, at the end, the slaves to master connections can be established when each MDT/OST it's config log during mount time. Note: register to MGS means the first time registration. 2.1 Quota Master Device To manage the connection between quota master and slave, an new server target device will be introduced for the quota master, it's similar to the mgs device, and we call it LQM. The name of LQM is "lqm": #define LUSTRE_LQM_NAME "lqm". The LQM device type: struct lu_device_type lqm_device_type = { .ldt_tags = LU_DEVICE_DT, .ldt_name = LUSTRE_LQM_NAME, .ldt_ops = &lqm_device_type_ops, .ldt_ctx_tags = LCT_MG_THREAD }; struct lu_device_type_operations lqs_device_type_ops = { .ldto_init = lqm_type_init, .ldto_fini = lqm_type_fini, .ldto_start = lqm_type_start, .ldto_stop = lqm_type_stop, .ldto_device_alloc = lqm_device_alloc, .ldto_device_free = lqm_device_free, .ldto_device_fini = lqm_device_fini }; The LQM device: struct lqm_device { struct dt_device lqm_dt_dev; struct ptlrpc_service *lqm_service; struct dt_object *lqm_admin_usr; /* user administrative object */ struct dt_object *lqm_admin_grp; /* group administrative object*/ }; static struct obd_ops lqm_obd_ops = { .o_owner = THIS_MODULE, .o_connect = lqm_connect, .o_reconnect = lqm_reconnect, .o_disconnect = lqm_disconnect, .o_init_export = lqm_init_export, .o_destroy_export = lqm_destroy_export, }; The LQM service handler (to handle the quota acquire/release): int lqm_handle(struct ptlrpc_request *req) 2.2 Quota Slave Device An new obd device will be introduced for quota slaves, it's used for managing connection with quota master. This device is similar to MGC, and we call it LQS. The name of LQS is 'lqs': #define LUSTRE_LQS_NAME 'lqs' The LQS obd ops: struct obd_ops lqs_obd_ops = { .o_owner = THIS_MODULE, .o_setup = lqs_obd_setup, .o_precleanup = lqs_obd_precleanup, .o_cleanup = lqs_obd_cleanup, .o_add_conn = client_import_add_conn, .o_del_conn = client_import_del_conn, .o_connect = client_connect_import, .o_disconnect = client_disconnect_export, .o_process_config = lqs_process_config, } 2.3 Quota Config Records The name of dedicated per filesystem quota log is $FSNAME-quota, for instance, if the filesystem name is 'lustre', then the quota log file name is 'lustre-quota'. There will be two types of quota config records added in the quota log file: - 'add lqs': Setup LQS device; - 'quota': Enable what type of quota (user or group); 2.3.1 'add lqs' Record The 'add lqs' record consists of following llog records: - record marker start (target svname, 'add lqs'); - add uuid for all nids of MDT node (LCFG_ADD_UUID, nid, mdt node uuid); - lqs attach (LCFG_ATTACH, lqs, lqs name, lqs uuid); - lqs setup (LCFG_SETUP, lqs name, lqm name, mdt node uuid); - add uuid for failover node if failover node is specified (LCFG_ADD_UUID); - add connection if failover node is specified (LCFG_ADD_CONN); - record marker end (target svname, 'add lqs'); Assuming the mdt svname is lustre-MDT0000, and the registering target is lustre-OST0000: - the lqs name will be lustre-OST0000-lqs; - the lqs uuid will be lustre-OST0000-lqs_UUID; - the lqm uuid will be lustre-MDT0000-lqm_UUID; A new mkfs option '--quota' will be introduced, the MDT device can be formated (or tuned by tunefs.lustre) with the '--quota' option. When a MDT with '--quota' register to MGS, a flag LDD_F_SV_TYPE_LQM will be carried to indicate that a quota target will be running on the MDT node, and following changes to the config logs should be taken: - Add the 'add lqs' record in the quota config log; - Copy the 'add lqs' record in all existing MDT/OST config logs, the LQS device name & uuid should be amended during copy. When OST register to MGS, it'll traverse the quota config log and copy the 'add lqs' into it's own config log Note: When the MDT failover information is changed, the failover information in the 'add lqs' of all config files should be updated accordingly. 2.3.2 'quota' Record The 'quota' record consists of following llog records: - record marker start (poolname, 'quota'); - quota=xxx (LCFG_PARAM, devname, 'quota=xxx'); - record marker end (poolname, 'quota'); The value of quota parameter is the bitmask of what quota type should be enabled: - 0 bit: user quota; - 1 bit: group quota; The poolname can be 'metadata' or 'data', which indicates what type of quota should be enabled on metadata pool (MDT) or data data pool (OSTs). 'quota' parameter can tuned by 'lctl conf_param $FSNAME.$POOLNAME.quota=xxx', for instance, 'lctl conf_param lustre.metadata.quota=1' or 'lctl conf_param lustre.data.quota=3'. 'lctl conf_param' is responsible for adding the 'quota' record in the quota config log, and if the POOLNAME is specified as 'metadata', this record will be copied to the existing MDT config log; if the POOLNAME is 'data', the 'quota' record will be copied to all existing OST config logs. 2.4 LQM Setup and Cleanup If the MDT is formated with '--quota' option, the LQM device setup will be triggered on MDT mount (after OSD setup, and before MDT registers to MGS). LQM device uuid should be mdt svname plus '-lqm_UUID', for example, lustre-MDT0000-lqm_UUID. The LQM device cleanup is triggered on MDT umount (in mdt_fini(), after cleanup all mdt exports). 2.5 Access LQM from MDT Since the quota control RPCs (setquota) will be sent over the reverse import of the LQM, LQM should be accessiable from MDT. A LQM device pointer (mdt_lqm) will be added in the mdt_device, and on MDT device initialization (mdt_init0()), lquota_get_master() will be called to set the mdt_lqm; on MDT finalization (mdt_fini()), lquota_put_master() will be called to clear the mdt_lqm and cleanup the LQM device. struct dt_device *lquota_get_master(char *devname) @devname: device name of LQM; return value: LQM device on success, appropriate error on failure; This function finds the LQM device by @devname, open quota administrative objects, and returns LQM device. struct void *lquota_put_master(struct dt_device *lqm_dev) This function close quota administrative objects, and cleanup the LQM device. 2.6 LQS Setup and Cleanup The LQS device setup is triggered when MDT/OST processing the config log on mount, and the LQS device cleanup is triggered on OSD finalization (osd_fini()). 2.7 Access LQS from Quota Context On LQS setup (lqs_obd_setup()), the corresponding quota context will be found in the global context list by maching the context name, then the lqc_obd will be set to the LQS device. 2.8 LQS Connect/Disconnect LQS connects & disconnects to LQM when each MDT/OST processing the 'quota' record, if there is any quota enabled, LQS will connect to the LQM, otherwise, LQS will disconnect to the LQM. The corresponding quota context could be found by maching context name, and the lqc_type should be changed accordingly. Quota context will always be created on OSD start, LQM & LQS will be setup only when the MDT is formated with '--quota' option, and the connection between LQS and LQM is only established when quota is enabled. 2.9 Enable/Disable Quota User will be able to separately enable/disable quota for MDT (metadata) and all OSTs (data) by: 'lctl conf_param $FSNAME.metadata.quota=val', or 'lctl conf_param $FSNAME.data.quota=val' - val = 1, only user quota enabled; - val = 2, only group quota enabled; - val = 3, user & group quota enabled; - val = 0, disable quota;