[LU-6382] quota : inconsistence between master & slave Created: 18/Mar/15 Updated: 08/Feb/23 Resolved: 09/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3, Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | JS Landry | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
We are running lustre 2.5.3 on all our servers, with zfs 0.6.3 on the OSS and ldiskfs/ext4 on the MDS. (all 18 servers are running centos 6.5) |
||
| Issue Links: |
|
||||||||
| Epic/Theme: | Quota, zfs | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We have a quota problem on one of our OST. Here's the error logs: LustreError: 11-0: lustre1-MDT0000-lwp-OST0008: Communicating with 10.225.8.3@o2ib, operation ldlm_enqueue failed with -3. LustreError: 12476:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -3, flags:0x9 qsd:lustre1-OST0008 qtype:grp id:10011 enforced:1 granted:1276244380 pending:0 waiting:128 req:1 usage:1276244415 qunit:0 qtune:0 edquot:0 LustreError: 12476:0:(qsd_handler.c:767:qsd_op_begin0()) $$$ ID isn't enforced on master, it probably due to a legeal race, if this message is showing up constantly, there could be some inconsistence between master & slave, and quota reintegration needs be re-triggered. qsd:lustre1-OST0008 qtype:grp id:10011 enforced:1 granted:1276244380 pending:0 waiting:0 req:0 usage:1276244415 qunit:0 qtune:0 edquot:0 the errors occurs only on this OST. (and for that groupid only) We set the quotas with these commands: lfs setquota -g $gid --block-softlimit 40t --block-hardlimit 40t /lustre1 lfs setquota -u $uid --inode-softlimit 1000000 --inode-hardlimit 1000000 /lustre1 and for the group 10011, we have disabled the quotas 1 or 2 days before the errors occur, using: lfs setquota -g 10011 --block-softlimit 0 --block-hardlimit 0 /lustre1 What does mean "quota reintegration needs be re-triggered"? I guess it's to run an "lfs quotacheck" on the filesystem, right? Thanks |
| Comments |
| Comment by Etienne Aujames [ 08/Feb/23 ] |
|
The CEA hits this issue in production on a ClusterStor Lustre version (server side, 2.12.4...). int vvp_io_write_commit(const struct lu_env *env, struct cl_io *io) { ...... /* out of quota, try sync write */ if (rc == -EDQUOT && !cl_io_is_mkwrite(io)) { struct ll_inode_info *lli = ll_i2info(inode); rc = vvp_io_commit_sync(env, io, queue, This causes a lot of small write IOs from those user jobs on OSTs and increases quickly the load on OSS (raid6 parity calculations) and the disk usages (raid6 on rotational disk with no OST write cache). The overall filesystem was really slow. This issue has been resolved by forcing a quota reintegration on the OSS: lctl set_param osd-ldiskfs.*.quota_slave.force_reint=1 |