[LU-6382] quota : inconsistence between master & slave Created: 18/Mar/15  Updated: 08/Feb/23  Resolved: 09/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3, Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: JS Landry Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

We are running lustre 2.5.3 on all our servers, with zfs 0.6.3 on the OSS and ldiskfs/ext4 on the MDS. (all 18 servers are running centos 6.5)
The client nodes are running lustre 2.4.3 on centos 6.6


Issue Links:
Related
is related to LU-4404 sanity-quota test_0: FAIL: SLOW IO fo... Closed
Epic/Theme: Quota, zfs
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have a quota problem on one of our OST.

Here's the error logs:

LustreError: 11-0: lustre1-MDT0000-lwp-OST0008: Communicating with 10.225.8.3@o2ib, operation ldlm_enqueue failed with -3.

LustreError: 12476:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -3, flags:0x9 qsd:lustre1-OST0008 qtype:grp
id:10011 enforced:1 granted:1276244380 pending:0 waiting:128 req:1 usage:1276244415 qunit:0 qtune:0 edquot:0

LustreError: 12476:0:(qsd_handler.c:767:qsd_op_begin0()) $$$ ID isn't enforced on master, it probably due to a legeal race, if this
message is showing up constantly, there could be some inconsistence between master & slave, and quota reintegration needs be
re-triggered. qsd:lustre1-OST0008 qtype:grp id:10011 enforced:1 granted:1276244380 pending:0 waiting:0 req:0 usage:1276244415
qunit:0 qtune:0 edquot:0

the errors occurs only on this OST. (and for that groupid only)

We set the quotas with these commands:

lfs setquota -g $gid --block-softlimit 40t --block-hardlimit 40t /lustre1
lfs setquota -u $uid --inode-softlimit 1000000 --inode-hardlimit 1000000 /lustre1

and for the group 10011, we have disabled the quotas 1 or 2 days before the errors occur, using:

lfs setquota -g 10011 --block-softlimit 0 --block-hardlimit 0 /lustre1

What does mean "quota reintegration needs be re-triggered"? I guess it's to run an "lfs quotacheck" on the filesystem, right?

Thanks
JS



 Comments   
Comment by Etienne Aujames [ 08/Feb/23 ]

The CEA hits this issue in production on a ClusterStor Lustre version (server side, 2.12.4...).
Some users have quota id enforced on OSTs (slave, QSD), but not on MDT0000 (master, QMT). If the slave quota limits are exceeded (on OST), the clients fallback from BIO to sync IO:

int vvp_io_write_commit(const struct lu_env *env, struct cl_io *io)
{
......
        /* out of quota, try sync write */                          
        if (rc == -EDQUOT && !cl_io_is_mkwrite(io)) {               
                struct ll_inode_info *lli = ll_i2info(inode);       
                                                                    
                rc = vvp_io_commit_sync(env, io, queue,            

This causes a lot of small write IOs from those user jobs on OSTs and increases quickly the load on OSS (raid6 parity calculations) and the disk usages (raid6 on rotational disk with no OST write cache). The overall filesystem was really slow.

This issue has been resolved by forcing a quota reintegration on the OSS:

lctl set_param osd-ldiskfs.*.quota_slave.force_reint=1
Generated at Sat Feb 10 01:59:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.