[LU-6382] quota : inconsistence between master & slave - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.4.3, Lustre 2.5.3
Labels:
None
Environment:
We are running lustre 2.5.3 on all our servers, with zfs 0.6.3 on the OSS and ldiskfs/ext4 on the MDS. (all 18 servers are running centos 6.5)
The client nodes are running lustre 2.4.3 on centos 6.6

Epic/Theme:
- Quota
- zfs
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have a quota problem on one of our OST.

Here's the error logs:

LustreError: 11-0: lustre1-MDT0000-lwp-OST0008: Communicating with 10.225.8.3@o2ib, operation ldlm_enqueue failed with -3.

LustreError: 12476:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -3, flags:0x9 qsd:lustre1-OST0008 qtype:grp
id:10011 enforced:1 granted:1276244380 pending:0 waiting:128 req:1 usage:1276244415 qunit:0 qtune:0 edquot:0

LustreError: 12476:0:(qsd_handler.c:767:qsd_op_begin0()) $$$ ID isn't enforced on master, it probably due to a legeal race, if this
message is showing up constantly, there could be some inconsistence between master & slave, and quota reintegration needs be
re-triggered. qsd:lustre1-OST0008 qtype:grp id:10011 enforced:1 granted:1276244380 pending:0 waiting:0 req:0 usage:1276244415
qunit:0 qtune:0 edquot:0

the errors occurs only on this OST. (and for that groupid only)

We set the quotas with these commands:

lfs setquota -g $gid --block-softlimit 40t --block-hardlimit 40t /lustre1
lfs setquota -u $uid --inode-softlimit 1000000 --inode-hardlimit 1000000 /lustre1

and for the group 10011, we have disabled the quotas 1 or 2 days before the errors occur, using:

lfs setquota -g 10011 --block-softlimit 0 --block-hardlimit 0 /lustre1

What does mean "quota reintegration needs be re-triggered"? I guess it's to run an "lfs quotacheck" on the filesystem, right?

Thanks
JS

Attachments

Issue Links

is related to

LU-4404 sanity-quota test_0: FAIL: SLOW IO for quota_usr (user): 50 KB/sec

Closed

Activity

[LU-6382] quota : inconsistence between master & slave

Etienne Aujames added a comment - 08/Feb/23 10:27 AM

The CEA hits this issue in production on a ClusterStor Lustre version (server side, 2.12.4...).
Some users have quota id enforced on OSTs (slave, QSD), but not on MDT0000 (master, QMT). If the slave quota limits are exceeded (on OST), the clients fallback from BIO to sync IO:

int vvp_io_write_commit(const struct lu_env *env, struct cl_io *io)
{
......
        /* out of quota, try sync write */                          
        if (rc == -EDQUOT && !cl_io_is_mkwrite(io)) {               
                struct ll_inode_info *lli = ll_i2info(inode);       
                                                                    
                rc = vvp_io_commit_sync(env, io, queue,

This causes a lot of small write IOs from those user jobs on OSTs and increases quickly the load on OSS (raid6 parity calculations) and the disk usages (raid6 on rotational disk with no OST write cache). The overall filesystem was really slow.

This issue has been resolved by forcing a quota reintegration on the OSS:

lctl set_param osd-ldiskfs.*.quota_slave.force_reint=1

Etienne Aujames added a comment - 08/Feb/23 10:27 AM The CEA hits this issue in production on a ClusterStor Lustre version (server side, 2.12.4...). Some users have quota id enforced on OSTs (slave, QSD), but not on MDT0000 (master, QMT). If the slave quota limits are exceeded (on OST), the clients fallback from BIO to sync IO: int vvp_io_write_commit( const struct lu_env *env, struct cl_io *io) { ...... /* out of quota, try sync write */ if (rc == -EDQUOT && !cl_io_is_mkwrite(io)) { struct ll_inode_info *lli = ll_i2info(inode); rc = vvp_io_commit_sync(env, io, queue, This causes a lot of small write IOs from those user jobs on OSTs and increases quickly the load on OSS (raid6 parity calculations) and the disk usages (raid6 on rotational disk with no OST write cache). The overall filesystem was really slow. This issue has been resolved by forcing a quota reintegration on the OSS: lctl set_param osd-ldiskfs.*.quota_slave.force_reint=1

People

Assignee:: WC Triage

Reporter:: JS Landry

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Mar/15 6:27 PM

Updated:: 08/Feb/23 10:31 AM

Resolved:: 09/Oct/21 6:43 AM