[LU-2289] still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.8
Labels:
- ptr
Environment:
Lustre 1.8.8 + ~~LU-1720~~

Severity:
3
Rank (Obsolete):
5479

Description

Even after adding the patch from ~~LU-1720~~, we are still seeing messages like:
Lustre: 18271:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

At this site, we haven't added any other patches to 1.8.8. What do these messages mean? Is it possible that some of the other patches (like ~~LU-1438~~) could fix these?

The customer hasn't noticed any functional issues, but of course that doesn't mean there aren't any. Quotas >4TB work on this system.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

kern.log-mds
1.90 MB
06/Nov/12 12:05 PM
kern.log-mds-Aug30
97 kB
07/Nov/12 8:39 AM
kern.log-oss
0.2 kB
06/Nov/12 12:05 PM
kern.log-oss-Aug30
305 kB
07/Nov/12 8:39 AM

Activity

[LU-2289] still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

Andreas Dilger added a comment - 29/Sep/15 8:22 AM

Closing old bug.

Andreas Dilger added a comment - 29/Sep/15 8:22 AM Closing old bug.

Niu Yawei (Inactive) added a comment - 20/May/13 2:31 AM

Yes, when approaching limit, write performance will be lower because quota unit is very small when approaching limit, and there will be much more quota acquire/release operations between ost & mdt, even worse, the write operation has to be blocked on the quota acquire operation sometimes. We didn't measure the performance in such unusual case.

Niu Yawei (Inactive) added a comment - 20/May/13 2:31 AM Yes, when approaching limit, write performance will be lower because quota unit is very small when approaching limit, and there will be much more quota acquire/release operations between ost & mdt, even worse, the write operation has to be blocked on the quota acquire operation sometimes. We didn't measure the performance in such unusual case.

Kit Westneat (Inactive) added a comment - 17/May/13 4:10 PM

Is it possible that this could cause performance issues when the UID/GID is approaching the quota limit? Do you know if there have ever been performance tests run for this case?

Thanks.

Kit Westneat (Inactive) added a comment - 17/May/13 4:10 PM Is it possible that this could cause performance issues when the UID/GID is approaching the quota limit? Do you know if there have ever been performance tests run for this case? Thanks.

Niu Yawei (Inactive) added a comment - 08/Apr/13 7:08 AM

Now I think it isn't a problem after I checked the large log for all processes (the small one I checked before is only for single process).

When some uid/gid is approaching quota limit, the qunit size will be shrinked to the minimal size (1MB), and each quota acquire can only acquire 1MB limit, if there are many OST threads writing to the same uid/gid at this time, the acquired limit will be consumed very soon, so some writing threads have to recycle several times in the quota_chk_acq_common() until get required quota limit or fail with -EDQUOT at the end.

Niu Yawei (Inactive) added a comment - 08/Apr/13 7:08 AM Now I think it isn't a problem after I checked the large log for all processes (the small one I checked before is only for single process). When some uid/gid is approaching quota limit, the qunit size will be shrinked to the minimal size (1MB), and each quota acquire can only acquire 1MB limit, if there are many OST threads writing to the same uid/gid at this time, the acquired limit will be consumed very soon, so some writing threads have to recycle several times in the quota_chk_acq_common() until get required quota limit or fail with -EDQUOT at the end.

Niu Yawei (Inactive) added a comment - 07/Apr/13 2:49 PM

Thank you, Kit. I checked the OSS log:

00040000:00000400:6:1364224183.636169:0:1558:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
00040000:04000000:6:1364224183.636179:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 10
00040000:04000000:6:1364224183.636181:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00040000:04000000:6:1364224183.636190:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768
00040000:04000000:6:1364224183.636193:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00002000:04000000:6:1364224183.636199:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3
00040000:04000000:6:1364224183.636205:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00040000:04000000:6:1364224183.636209:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0.
00040000:04000000:6:1364224183.636214:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 11
00040000:04000000:6:1364224183.636217:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00040000:04000000:6:1364224183.636224:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768
00040000:04000000:6:1364224183.636227:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00002000:04000000:6:1364224183.636231:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3
00040000:04000000:6:1364224183.636239:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00040000:04000000:6:1364224183.636243:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0.

Looks there is pending acquire request (record: 2097152), and local limit will be able to satisfy current write request when the pending acquire done, so OST should wait for the pending reqeust, but it actually didn't wait, so quota_chk_acq_common() recycled many times until the pending request done.

I didn't see why OST didn't wait on pending acquire so far, maybe there is something wrong with the lqs_blk_rec calculation? I'll look into it further.

Niu Yawei (Inactive) added a comment - 07/Apr/13 2:49 PM Thank you, Kit. I checked the OSS log: 00040000:00000400:6:1364224183.636169:0:1558:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) 00040000:04000000:6:1364224183.636179:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 10 00040000:04000000:6:1364224183.636181:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636190:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768 00040000:04000000:6:1364224183.636193:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00002000:04000000:6:1364224183.636199:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3 00040000:04000000:6:1364224183.636205:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636209:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0. 00040000:04000000:6:1364224183.636214:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 11 00040000:04000000:6:1364224183.636217:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636224:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768 00040000:04000000:6:1364224183.636227:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00002000:04000000:6:1364224183.636231:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3 00040000:04000000:6:1364224183.636239:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636243:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0. Looks there is pending acquire request (record: 2097152), and local limit will be able to satisfy current write request when the pending acquire done, so OST should wait for the pending reqeust, but it actually didn't wait, so quota_chk_acq_common() recycled many times until the pending request done. I didn't see why OST didn't wait on pending acquire so far, maybe there is something wrong with the lqs_blk_rec calculation? I'll look into it further.

Kit Westneat (Inactive) added a comment - 03/Apr/13 8:52 PM

Hi Niu,

I've uploaded the logs here:
http://eu.ddn.com:8080/lustre/LU-2289.tar.bz2

It's about a 3 second window on each server, but it's still about 20GB uncompressed. It looks like for the message at 1364224183, it tries 19 times before it finally acquires it. Is there anything in the MDT logs that would indicate why it took so long? I didn't see anything in particular, but it's tough to read.

Kit Westneat (Inactive) added a comment - 03/Apr/13 8:52 PM Hi Niu, I've uploaded the logs here: http://eu.ddn.com:8080/lustre/LU-2289.tar.bz2 It's about a 3 second window on each server, but it's still about 20GB uncompressed. It looks like for the message at 1364224183, it tries 19 times before it finally acquires it. Is there anything in the MDT logs that would indicate why it took so long? I didn't see anything in particular, but it's tough to read.

Niu Yawei (Inactive) added a comment - 02/Apr/13 2:24 AM

Hi, Kit

If you have debug log, you can see if the quota is acquired eventually by checking the pid. (find the pid in the message of "still haven't managed ...", and check if this thread is constantly printing such message or it moves on eventually).

When you attach the log, I suggest you truncate out the useful part, otherwise, it could be too large. The useful part should be the messages nearby "still haven't managed ..." OST and the corresponding messages on MDS.

I didn't find anything abnormal in the output of "lfs quota -u -v", thank you for providing this.

Niu Yawei (Inactive) added a comment - 02/Apr/13 2:24 AM Hi, Kit If you have debug log, you can see if the quota is acquired eventually by checking the pid. (find the pid in the message of "still haven't managed ...", and check if this thread is constantly printing such message or it moves on eventually). When you attach the log, I suggest you truncate out the useful part, otherwise, it could be too large. The useful part should be the messages nearby "still haven't managed ..." OST and the corresponding messages on MDS. I didn't find anything abnormal in the output of "lfs quota -u -v", thank you for providing this.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Oz Rentas (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Nov/12 12:05 PM

Updated:: 29/Sep/15 8:22 AM

Resolved:: 29/Sep/15 8:22 AM