Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.4.3, Lustre 2.5.3
-
None
-
We are running lustre 2.5.3 on all our servers, with zfs 0.6.3 on the OSS and ldiskfs/ext4 on the MDS. (all 18 servers are running centos 6.5)
The client nodes are running lustre 2.4.3 on centos 6.6
Description
We have a quota problem on one of our OST.
Here's the error logs:
LustreError: 11-0: lustre1-MDT0000-lwp-OST0008: Communicating with 10.225.8.3@o2ib, operation ldlm_enqueue failed with -3. LustreError: 12476:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -3, flags:0x9 qsd:lustre1-OST0008 qtype:grp id:10011 enforced:1 granted:1276244380 pending:0 waiting:128 req:1 usage:1276244415 qunit:0 qtune:0 edquot:0 LustreError: 12476:0:(qsd_handler.c:767:qsd_op_begin0()) $$$ ID isn't enforced on master, it probably due to a legeal race, if this message is showing up constantly, there could be some inconsistence between master & slave, and quota reintegration needs be re-triggered. qsd:lustre1-OST0008 qtype:grp id:10011 enforced:1 granted:1276244380 pending:0 waiting:0 req:0 usage:1276244415 qunit:0 qtune:0 edquot:0
the errors occurs only on this OST. (and for that groupid only)
We set the quotas with these commands:
lfs setquota -g $gid --block-softlimit 40t --block-hardlimit 40t /lustre1 lfs setquota -u $uid --inode-softlimit 1000000 --inode-hardlimit 1000000 /lustre1
and for the group 10011, we have disabled the quotas 1 or 2 days before the errors occur, using:
lfs setquota -g 10011 --block-softlimit 0 --block-hardlimit 0 /lustre1
What does mean "quota reintegration needs be re-triggered"? I guess it's to run an "lfs quotacheck" on the filesystem, right?
Thanks
JS
Attachments
Issue Links
- is related to
-
LU-4404 sanity-quota test_0: FAIL: SLOW IO for quota_usr (user): 50 KB/sec
-
- Closed
-
The CEA hits this issue in production on a ClusterStor Lustre version (server side, 2.12.4...).
Some users have quota id enforced on OSTs (slave, QSD), but not on MDT0000 (master, QMT). If the slave quota limits are exceeded (on OST), the clients fallback from BIO to sync IO:
This causes a lot of small write IOs from those user jobs on OSTs and increases quickly the load on OSS (raid6 parity calculations) and the disk usages (raid6 on rotational disk with no OST write cache). The overall filesystem was really slow.
This issue has been resolved by forcing a quota reintegration on the OSS: