Description
Even after adding the patch from LU-1720, we are still seeing messages like:
Lustre: 18271:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
At this site, we haven't added any other patches to 1.8.8. What do these messages mean? Is it possible that some of the other patches (like LU-1438) could fix these?
The customer hasn't noticed any functional issues, but of course that doesn't mean there aren't any. Quotas >4TB work on this system.
Attachments
- kern.log-mds
- 1.90 MB
- kern.log-mds-Aug30
- 97 kB
- kern.log-oss
- 0.2 kB
- kern.log-oss-Aug30
- 305 kB
Activity
Yes, when approaching limit, write performance will be lower because quota unit is very small when approaching limit, and there will be much more quota acquire/release operations between ost & mdt, even worse, the write operation has to be blocked on the quota acquire operation sometimes. We didn't measure the performance in such unusual case.
Is it possible that this could cause performance issues when the UID/GID is approaching the quota limit? Do you know if there have ever been performance tests run for this case?
Thanks.
Now I think it isn't a problem after I checked the large log for all processes (the small one I checked before is only for single process).
When some uid/gid is approaching quota limit, the qunit size will be shrinked to the minimal size (1MB), and each quota acquire can only acquire 1MB limit, if there are many OST threads writing to the same uid/gid at this time, the acquired limit will be consumed very soon, so some writing threads have to recycle several times in the quota_chk_acq_common() until get required quota limit or fail with -EDQUOT at the end.
Thank you, Kit. I checked the OSS log:
00040000:00000400:6:1364224183.636169:0:1558:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) 00040000:04000000:6:1364224183.636179:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 10 00040000:04000000:6:1364224183.636181:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636190:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768 00040000:04000000:6:1364224183.636193:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00002000:04000000:6:1364224183.636199:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3 00040000:04000000:6:1364224183.636205:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636209:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0. 00040000:04000000:6:1364224183.636214:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 11 00040000:04000000:6:1364224183.636217:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636224:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768 00040000:04000000:6:1364224183.636227:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00002000:04000000:6:1364224183.636231:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3 00040000:04000000:6:1364224183.636239:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636243:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0.
Looks there is pending acquire request (record: 2097152), and local limit will be able to satisfy current write request when the pending acquire done, so OST should wait for the pending reqeust, but it actually didn't wait, so quota_chk_acq_common() recycled many times until the pending request done.
I didn't see why OST didn't wait on pending acquire so far, maybe there is something wrong with the lqs_blk_rec calculation? I'll look into it further.
Hi Niu,
I've uploaded the logs here:
http://eu.ddn.com:8080/lustre/LU-2289.tar.bz2
It's about a 3 second window on each server, but it's still about 20GB uncompressed. It looks like for the message at 1364224183, it tries 19 times before it finally acquires it. Is there anything in the MDT logs that would indicate why it took so long? I didn't see anything in particular, but it's tough to read.
Hi, Kit
If you have debug log, you can see if the quota is acquired eventually by checking the pid. (find the pid in the message of "still haven't managed ...", and check if this thread is constantly printing such message or it moves on eventually).
When you attach the log, I suggest you truncate out the useful part, otherwise, it could be too large. The useful part should be the messages nearby "still haven't managed ..." OST and the corresponding messages on MDS.
I didn't find anything abnormal in the output of "lfs quota -u -v", thank you for providing this.
Here are the quotas with -v:
[root@Jet:fe6 ~]# lfs quota -v -g hfipprd /lfs1
Disk quotas for group hfipprd (gid 10074):
Filesystem kbytes quota limit grace files quota limit
grace
/lfs1 0 31457280 33030144 - 0 0
0 -
lfs1-MDT0000_UUID
0 - 65536 - 0 - 0
-
lfs1-OST0000_UUID
0 - 65536 - - - -
-
lfs1-OST0001_UUID
0 - 65536 - - - -
-
lfs1-OST0002_UUID
0 - 65536 - - - -
-
lfs1-OST0003_UUID
0 - 65536 - - - -
-
lfs1-OST0004_UUID
0 - 65536 - - - -
-
lfs1-OST0005_UUID
0 - 65536 - - - -
-
lfs1-OST0006_UUID
0 - 65536 - - - -
-
lfs1-OST0007_UUID
0 - 65536 - - - -
-
lfs1-OST0008_UUID
0 - 65536 - - - -
-
lfs1-OST0009_UUID
0 - 65536 - - - -
-
lfs1-OST000a_UUID
0 - 65536 - - - -
-
lfs1-OST000b_UUID
0 - 65536 - - - -
-
lfs1-OST000c_UUID
0 - 65536 - - - -
-
lfs1-OST000d_UUID
0 - 65536 - - - -
-
lfs1-OST000e_UUID
0 - 65536 - - - -
-
lfs1-OST000f_UUID
0 - 65536 - - - -
-
lfs1-OST0010_UUID
0 - 65536 - - - -
-
lfs1-OST0011_UUID
0 - 65536 - - - -
-
lfs1-OST0012_UUID
0 - 65536 - - - -
-
lfs1-OST0013_UUID
0 - 65536 - - - -
-
lfs1-OST0014_UUID
0 - 65536 - - - -
-
lfs1-OST0015_UUID
0 - 65536 - - - -
-
lfs1-OST0016_UUID
0 - 65536 - - - -
-
lfs1-OST0017_UUID
0 - 65536 - - - -
-
lfs1-OST0018_UUID
0 - 65536 - - - -
-
lfs1-OST0019_UUID
0 - 65536 - - - -
-
lfs1-OST001a_UUID
0 - 65536 - - - -
-
lfs1-OST001b_UUID
0 - 65536 - - - -
-
lfs1-OST001c_UUID
0 - 65536 - - - -
-
lfs1-OST001d_UUID
0 - 65536 - - - -
-
lfs1-OST001e_UUID
0 - 65536 - - - -
-
lfs1-OST001f_UUID
0 - 65536 - - - -
-
lfs1-OST0020_UUID
0 - 65536 - - - -
-
lfs1-OST0021_UUID
0 - 65536 - - - -
-
lfs1-OST0022_UUID
0 - 65536 - - - -
-
lfs1-OST0023_UUID
0 - 65536 - - - -
-
lfs1-OST0024_UUID
0 - 65536 - - - -
-
lfs1-OST0025_UUID
0 - 65536 - - - -
-
lfs1-OST0026_UUID
0 - 65536 - - - -
-
lfs1-OST0027_UUID
0 - 65536 - - - -
-
lfs1-OST0028_UUID
0 - 65536 - - - -
-
lfs1-OST0029_UUID
0 - 65536 - - - -
-
lfs1-OST002a_UUID
0 - 65536 - - - -
-
lfs1-OST002b_UUID
0 - 65536 - - - -
-
lfs1-OST002c_UUID
0 - 65536 - - - -
-
lfs1-OST002d_UUID
0 - 65536 - - - -
-
lfs1-OST002e_UUID
0 - 65536 - - - -
-
lfs1-OST002f_UUID
0 - 65536 - - - -
-
lfs1-OST0030_UUID
0 - 65536 - - - -
-
lfs1-OST0031_UUID
0 - 65536 - - - -
-
lfs1-OST0032_UUID
0 - 65536 - - - -
-
lfs1-OST0033_UUID
0 - 65536 - - - -
-
lfs1-OST0034_UUID
0 - 65536 - - - -
-
lfs1-OST0035_UUID
0 - 65536 - - - -
-
lfs1-OST0036_UUID
0 - 65536 - - - -
-
lfs1-OST0037_UUID
0 - 65536 - - - -
-
lfs1-OST0038_UUID
0 - 65536 - - - -
-
lfs1-OST0039_UUID
0 - 65536 - - - -
-
lfs1-OST003a_UUID
0 - 65536 - - - -
-
lfs1-OST003b_UUID
0 - 65536 - - - -
-
lfs1-OST003c_UUID
0 - 65536 - - - -
-
lfs1-OST003d_UUID
0 - 65536 - - - -
-
lfs1-OST003e_UUID
0 - 65536 - - - -
-
lfs1-OST003f_UUID
0 - 65536 - - - -
-
lfs1-OST0040_UUID
0 - 65536 - - - -
-
lfs1-OST0041_UUID
0 - 65536 - - - -
-
lfs1-OST0042_UUID
0 - 65536 - - - -
-
lfs1-OST0043_UUID
0 - 65536 - - - -
-
lfs1-OST0044_UUID
0 - 65536 - - - -
-
lfs1-OST0045_UUID
0 - 65536 - - - -
-
lfs1-OST0046_UUID
0 - 65536 - - - -
-
lfs1-OST0047_UUID
0 - 65536 - - - -
-
[root@Jet:fe6 ~]# lfs quota -v -g hfip-gfdl /lfs1
Disk quotas for group hfip-gfdl (gid 19103):
Filesystem kbytes quota limit grace files quota limit
grace
/lfs1 19179828108 20971520000 22020096000 - 3567465
0 0 -
lfs1-MDT0000_UUID
1408244 - 1572864 - 3567465 - 0
-
lfs1-OST0000_UUID
266307944 - 266469376 - - -
- -
lfs1-OST0001_UUID
264490536 - 264634368 - - - - -
lfs1-OST0002_UUID
252475224 - 252575744 - - - - -
lfs1-OST0003_UUID
267990876 - 268173312 - - - - -
lfs1-OST0004_UUID
246162144 - 246284288 - - - - -
lfs1-OST0005_UUID
267940220 - 268042240 - - - - -
lfs1-OST0006_UUID
290532840 - 290717696 - - - - -
lfs1-OST0007_UUID
283466396 - 283639808 - - - - -
lfs1-OST0008_UUID
272318500 - 272498688 - - - - -
lfs1-OST0009_UUID
250366252 - 250478592 - - - - -
lfs1-OST000a_UUID
268765900 - 268959744 - - - - -
lfs1-OST000b_UUID
241840752 - 241958912 - - - - -
lfs1-OST000c_UUID
279475784 - 279576576 - - - - -
lfs1-OST000d_UUID
295138616 - 295305216 - - - - -
lfs1-OST000e_UUID
245769816 - 245891072 - - - - -
lfs1-OST000f_UUID
258449640 - 258605056 - - - - -
lfs1-OST0010_UUID
228665128 - 228851712 - - - - -
lfs1-OST0011_UUID
278386580 - 278528000 - - - - -
lfs1-OST0012_UUID
277253652 - 277348352 - - - - -
lfs1-OST0013_UUID
185330656 - 185466880 - - - - -
lfs1-OST0014_UUID
281571296 - 281673728 - - - - -
lfs1-OST0015_UUID
246132636 - 246284288 - - - - -
lfs1-OST0016_UUID
274774304 - 274857984 - - - - -
lfs1-OST0017_UUID
279671348 - 279838720 - - - - -
lfs1-OST0018_UUID
278511144 - 278659072 - - - - -
lfs1-OST0019_UUID
279721016 - 279838720 - - - - -
lfs1-OST001a_UUID
286078396 - 286261248 - - - - -
lfs1-OST001b_UUID
284810104 - 284950528 - - - - -
lfs1-OST001c_UUID
266730624 - 266862592 - - - - -
lfs1-OST001d_UUID
267699528 - 267780096 - - - - -
lfs1-OST001e_UUID
271926964 - 272105472 - - - - -
lfs1-OST001f_UUID
279551644 - 279707648 - - - - -
lfs1-OST0020_UUID
262360628 - 262537216 - - - - -
lfs1-OST0021_UUID
286415060 - 286523392 - - - - -
lfs1-OST0022_UUID
227815636 - 227934208 - - - - -
lfs1-OST0023_UUID
271352884 - 271450112 - - - - -
lfs1-OST0024_UUID
272460280 - 272629760 - - - - -
lfs1-OST0025_UUID
198046512 - 198180864 - - - - -
lfs1-OST0026_UUID
289555288 - 289669120 - - - - -
lfs1-OST0027_UUID
258436324 - 258605056 - - - - -
lfs1-OST0028_UUID
270941900 - 271056896 - - - - -
lfs1-OST0029_UUID
251377320 - 251527168 - - - - -
lfs1-OST002a_UUID
282350136 - 282460160 - - - - -
lfs1-OST002b_UUID
280608276 - 280756224 - - - - -
lfs1-OST002c_UUID
284707076 - 284819456 - - - - -
lfs1-OST002d_UUID
283481900 - 283639808 - - - - -
lfs1-OST002e_UUID
283875928 - 284033024 - - - - -
lfs1-OST002f_UUID
241472472 - 241565696 - - - - -
lfs1-OST0030_UUID
265680488 - 265814016 - - - - -
lfs1-OST0031_UUID
273082788 - 273154048 - - - - -
lfs1-OST0032_UUID
227469276 - 227540992 - - - - -
lfs1-OST0033_UUID
301283952 - 301465600 - - - - -
lfs1-OST0034_UUID
282200664 - 282329088 - - - - -
lfs1-OST0035_UUID
236723324 - 236847104 - - - - -
lfs1-OST0036_UUID
244762536 - 244842496 - - - - -
lfs1-OST0037_UUID
290396060 - 290586624 - - - - -
lfs1-OST0038_UUID
247838916 - 247988224 - - - - -
lfs1-OST0039_UUID
289519172 - 289669120 - - - - -
lfs1-OST003a_UUID
237240432 - 237371392 - - - - -
lfs1-OST003b_UUID
281243232 - 281411584 - - - - -
lfs1-OST003c_UUID
255128092 - 255197184 - - - - -
lfs1-OST003d_UUID
267801692 - 267911168 - - - - -
lfs1-OST003e_UUID
268842936 - 268959744 - - - - -
lfs1-OST003f_UUID
259531064 - 259653632 - - - - -
lfs1-OST0040_UUID
270812788 - 270925824 - - - - -
lfs1-OST0041_UUID
256516360 - 256638976 - - - - -
lfs1-OST0042_UUID
289071672 - 289144832 - - - - -
lfs1-OST0043_UUID
290809280 - 290979840 - - - - -
lfs1-OST0044_UUID
265673448 - 265814016 - - - - -
lfs1-OST0045_UUID
288587388 - 288751616 - - - - -
lfs1-OST0046_UUID
291939312 - 292028416 - - - - -
lfs1-OST0047_UUID
232700912 - 232783872 - - - - -
[root@Jet:fe6 ~]# lfs quota -v -g rthfip-gfdl /lfs1
Disk quotas for group rthfip-gfdl (gid 19104):
Filesystem kbytes quota limit grace files quota limit
grace
/lfs1 0 31457280 33030144 - 0 0
0 -
lfs1-MDT0000_UUID
0 - 65536 - 0 - 0
-
lfs1-OST0000_UUID
0 - 65536 - - - -
-
lfs1-OST0001_UUID
0 - 65536 - - - -
-
lfs1-OST0002_UUID
0 - 65536 - - - -
-
lfs1-OST0003_UUID
0 - 65536 - - - -
-
lfs1-OST0004_UUID
0 - 65536 - - - -
-
lfs1-OST0005_UUID
0 - 65536 - - - -
-
lfs1-OST0006_UUID
0 - 65536 - - - -
-
lfs1-OST0007_UUID
0 - 65536 - - - -
-
lfs1-OST0008_UUID
0 - 65536 - - - -
-
lfs1-OST0009_UUID
0 - 65536 - - - -
-
lfs1-OST000a_UUID
0 - 65536 - - - -
-
lfs1-OST000b_UUID
0 - 65536 - - - -
-
lfs1-OST000c_UUID
0 - 65536 - - - -
-
lfs1-OST000d_UUID
0 - 65536 - - - -
-
lfs1-OST000e_UUID
0 - 65536 - - - -
-
lfs1-OST000f_UUID
0 - 65536 - - - -
-
lfs1-OST0010_UUID
0 - 65536 - - - -
-
lfs1-OST0011_UUID
0 - 65536 - - - -
-
lfs1-OST0012_UUID
0 - 65536 - - - -
-
lfs1-OST0013_UUID
0 - 65536 - - - -
-
lfs1-OST0014_UUID
0 - 65536 - - - -
-
lfs1-OST0015_UUID
0 - 65536 - - - -
-
lfs1-OST0016_UUID
0 - 65536 - - - -
-
lfs1-OST0017_UUID
0 - 65536 - - - -
-
lfs1-OST0018_UUID
0 - 65536 - - - -
-
lfs1-OST0019_UUID
0 - 65536 - - - -
-
lfs1-OST001a_UUID
0 - 65536 - - - -
-
lfs1-OST001b_UUID
0 - 65536 - - - -
-
lfs1-OST001c_UUID
0 - 65536 - - - -
-
lfs1-OST001d_UUID
0 - 65536 - - - -
-
lfs1-OST001e_UUID
0 - 65536 - - - -
-
lfs1-OST001f_UUID
0 - 65536 - - - -
-
lfs1-OST0020_UUID
0 - 65536 - - - -
-
lfs1-OST0021_UUID
0 - 65536 - - - -
-
lfs1-OST0022_UUID
0 - 65536 - - - -
-
lfs1-OST0023_UUID
0 - 65536 - - - -
-
lfs1-OST0024_UUID
0 - 65536 - - - -
-
lfs1-OST0025_UUID
0 - 65536 - - - -
-
lfs1-OST0026_UUID
0 - 65536 - - - -
-
lfs1-OST0027_UUID
0 - 65536 - - - -
-
lfs1-OST0028_UUID
0 - 65536 - - - -
-
lfs1-OST0029_UUID
0 - 65536 - - - -
-
lfs1-OST002a_UUID
0 - 65536 - - - -
-
lfs1-OST002b_UUID
0 - 65536 - - - -
-
lfs1-OST002c_UUID
0 - 65536 - - - -
-
lfs1-OST002d_UUID
0 - 65536 - - - -
-
lfs1-OST002e_UUID
0 - 65536 - - - -
-
lfs1-OST002f_UUID
0 - 65536 - - - -
-
lfs1-OST0030_UUID
0 - 65536 - - - -
-
lfs1-OST0031_UUID
0 - 65536 - - - -
-
lfs1-OST0032_UUID
0 - 65536 - - - -
-
lfs1-OST0033_UUID
0 - 65536 - - - -
-
lfs1-OST0034_UUID
0 - 65536 - - - -
-
lfs1-OST0035_UUID
0 - 65536 - - - -
-
lfs1-OST0036_UUID
0 - 65536 - - - -
-
lfs1-OST0037_UUID
0 - 65536 - - - -
-
lfs1-OST0038_UUID
0 - 65536 - - - -
-
lfs1-OST0039_UUID
0 - 65536 - - - -
-
lfs1-OST003a_UUID
0 - 65536 - - - -
-
lfs1-OST003b_UUID
0 - 65536 - - - -
-
lfs1-OST003c_UUID
0 - 65536 - - - -
-
lfs1-OST003d_UUID
0 - 65536 - - - -
-
lfs1-OST003e_UUID
0 - 65536 - - - -
-
lfs1-OST003f_UUID
0 - 65536 - - - -
-
lfs1-OST0040_UUID
0 - 65536 - - - -
-
lfs1-OST0041_UUID
0 - 65536 - - - -
-
lfs1-OST0042_UUID
0 - 65536 - - - -
-
lfs1-OST0043_UUID
0 - 65536 - - - -
-
lfs1-OST0044_UUID
0 - 65536 - - - -
-
lfs1-OST0045_UUID
0 - 65536 - - - -
-
lfs1-OST0046_UUID
0 - 65536 - - - -
-
lfs1-OST0047_UUID
0 - 65536 - - - -
-
Hi Niu,
I was able to run with quota debug enabled. The debug logs are very large, but I think we were able to see which groups were having the logs messages printed. There doesn't appear to be anything strange about them:
Disk quotas for group hfip-gfdl (gid 19103):
Filesystem kbytes quota limit grace files quota
limit grace
/lfs1 18016033720 20971520000 22020096000 - 3544247 0 0 -
Would it be possible to see whether or not the quota space was eventually acquired? Do you think I would be able to trace it by pid? Or do different threads operate on the same quota task? Is there a transaction ID or anything? I'll look at the code too to see if I can tell.
Also I could make the logs available. They are each about ~300MB compressed, ~11GB uncompressed.
Thanks
If the quota acquire constantly fail like this, the io will not finished.
Could you run "lfs quota -v uid/gid" to see what's the quota limits of the uid/gid which is causing the problem? Thanks.
Closing old bug.