[LU-2289] still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) Created: 06/Nov/12  Updated: 29/Sep/15  Resolved: 29/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oz Rentas Assignee: Niu Yawei (Inactive)
Resolution: Not a Bug Votes: 0
Labels: ptr
Environment:

Lustre 1.8.8 + LU-1720


Attachments: File kern.log-mds     File kern.log-mds-Aug30     Text File kern.log-oss     File kern.log-oss-Aug30    
Severity: 3
Rank (Obsolete): 5479

 Description   

Even after adding the patch from LU-1720, we are still seeing messages like:
Lustre: 18271:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

At this site, we haven't added any other patches to 1.8.8. What do these messages mean? Is it possible that some of the other patches (like LU-1438) could fix these?

The customer hasn't noticed any functional issues, but of course that doesn't mean there aren't any. Quotas >4TB work on this system.



 Comments   
Comment by Peter Jones [ 06/Nov/12 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 06/Nov/12 ]

I see lots of following errors in mds log:

Jul 12 20:34:19 lfs-mds-2-1 kernel: Lustre: 7972:0:(lproc_quota.c:453:lprocfs_quota_wr_type()) lfs2-MDT0000: quotaon failed because quota files don't exist, please run quotacheck firstly

Seems there is something wrong on MDT, so that quota was not turned on for MDT, but it's turned on for OST, so OST can't acquire quota from MDT and report the messages "still haven't managed to acquire quota space from the quota master after 10 retries".

Could you check what's the quota configuration was used? (ug3 for both MDT & OST?) If rerun "lfs quotacheck" can fix your problem? I not, I think we'd collect debuglog for quotacheck to see why the amdin or local quota file wasn't created. Thanks.

Comment by Kit Westneat (Inactive) [ 07/Nov/12 ]

Hi Niu,

Sorry I didn't truncate the logs to the relevant dates, those are actually old messages. I will reupload the logs with only the relevant parts.

In preparing these logs, I noticed that when the OSS was displaying the message, the MDT was having problems connecting:
Aug 30 22:28:11 lfs-mds-2-1 kernel: Lustre: 9067:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1411737505069379 sent from lfs2-OST0002-osc to NID 10.179.16.124@o2ib 0s ago has failed due to network error (15s prior to deadline).

This actually looks like:
http://jira.whamcloud.com/browse/LU-1809

I will try to get more information from the customer about when they are seeing the message and if they are running with the workaround suggested in LU-1809.

Thanks,
Kit

Comment by Kit Westneat (Inactive) [ 07/Nov/12 ]

logs only for Aug 30

Comment by Niu Yawei (Inactive) [ 07/Nov/12 ]

Thank you, Kit.

Yes, the new log shows quite a lot of connection errors, I'm not sure if the filesystem was healty at that time.

The most possible reason of OST constantly showing "still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" is that there are some kind of inconsistency between master (MDT) and slave (OST):

  • Quota is already disabled on master (MDT), but not disabled on slave (OST);
  • Quota limit for some certain uid/gid has been cleared on MDT (master), but not cleared on slave (OST);

If the quota is supposed to be disabled, but writting files can cause the "still haven't...(err=0, rc=0)" message on OSS, you can rerun "lfs quotaoff" when all OSTs are up to make sure quota is disabled on each OST;
If some uid/gid is supposed to have 0 limit, but writting to the uid/gid owned files can cause the message on OSS, you can rerun "lfs setquota ..." when all OSTs are up to make sure the limit is cleared on each OST;

That's what I can think of so far.

Comment by Kit Westneat (Inactive) [ 05/Dec/12 ]

We've seen this issue again without any network issues. Is there any way to debug what is going on? The error only appears intermittently, so I think it's probably some kind of sync issue. Is there a way to tell what uid/gid is causing problems, or is it better to just redo the lfs quotacheck?

Comment by Kit Westneat (Inactive) [ 12/Mar/13 ]

Hi Niu,

The customer ran lfs quotaoff, quotaon, and quotacheck. The problem reappeared fairly quickly on one filesystem, and after a few weeks to the other filesystem.

The customer asks:
What is the impact of NOT acquiring quota space?

I noticed that the quota debug level prints information about the UID/GID that is causing problems. I will try to get quota debug logs during the event. Is there anything else we can get, or any other ideas?

Thanks,
Kit

Comment by Niu Yawei (Inactive) [ 12/Mar/13 ]

If the quota acquire constantly fail like this, the io will not finished.

Could you run "lfs quota -v uid/gid" to see what's the quota limits of the uid/gid which is causing the problem? Thanks.

Comment by Kit Westneat (Inactive) [ 01/Apr/13 ]

Hi Niu,

I was able to run with quota debug enabled. The debug logs are very large, but I think we were able to see which groups were having the logs messages printed. There doesn't appear to be anything strange about them:

Disk quotas for group hfip-gfdl (gid 19103):
Filesystem kbytes quota limit grace files quota
limit grace
/lfs1 18016033720 20971520000 22020096000 - 3544247 0 0 -

Would it be possible to see whether or not the quota space was eventually acquired? Do you think I would be able to trace it by pid? Or do different threads operate on the same quota task? Is there a transaction ID or anything? I'll look at the code too to see if I can tell.

Also I could make the logs available. They are each about ~300MB compressed, ~11GB uncompressed.

Thanks

Comment by Kit Westneat (Inactive) [ 01/Apr/13 ]

Here are the quotas with -v:

[root@Jet:fe6 ~]# lfs quota -v -g hfipprd /lfs1
Disk quotas for group hfipprd (gid 10074):
Filesystem kbytes quota limit grace files quota limit
grace
/lfs1 0 31457280 33030144 - 0 0
0 -
lfs1-MDT0000_UUID
0 - 65536 - 0 - 0
-
lfs1-OST0000_UUID
0 - 65536 - - - -
-
lfs1-OST0001_UUID
0 - 65536 - - - -
-
lfs1-OST0002_UUID
0 - 65536 - - - -
-
lfs1-OST0003_UUID
0 - 65536 - - - -
-
lfs1-OST0004_UUID
0 - 65536 - - - -
-
lfs1-OST0005_UUID
0 - 65536 - - - -
-
lfs1-OST0006_UUID
0 - 65536 - - - -
-
lfs1-OST0007_UUID
0 - 65536 - - - -
-
lfs1-OST0008_UUID
0 - 65536 - - - -
-
lfs1-OST0009_UUID
0 - 65536 - - - -
-
lfs1-OST000a_UUID
0 - 65536 - - - -
-
lfs1-OST000b_UUID
0 - 65536 - - - -
-
lfs1-OST000c_UUID
0 - 65536 - - - -
-
lfs1-OST000d_UUID
0 - 65536 - - - -
-
lfs1-OST000e_UUID
0 - 65536 - - - -
-
lfs1-OST000f_UUID
0 - 65536 - - - -
-
lfs1-OST0010_UUID
0 - 65536 - - - -
-
lfs1-OST0011_UUID
0 - 65536 - - - -
-
lfs1-OST0012_UUID
0 - 65536 - - - -
-
lfs1-OST0013_UUID
0 - 65536 - - - -
-
lfs1-OST0014_UUID
0 - 65536 - - - -
-
lfs1-OST0015_UUID
0 - 65536 - - - -
-
lfs1-OST0016_UUID
0 - 65536 - - - -
-
lfs1-OST0017_UUID
0 - 65536 - - - -
-
lfs1-OST0018_UUID
0 - 65536 - - - -
-
lfs1-OST0019_UUID
0 - 65536 - - - -
-
lfs1-OST001a_UUID
0 - 65536 - - - -
-
lfs1-OST001b_UUID
0 - 65536 - - - -
-
lfs1-OST001c_UUID
0 - 65536 - - - -
-
lfs1-OST001d_UUID
0 - 65536 - - - -
-
lfs1-OST001e_UUID
0 - 65536 - - - -
-
lfs1-OST001f_UUID
0 - 65536 - - - -
-
lfs1-OST0020_UUID
0 - 65536 - - - -
-
lfs1-OST0021_UUID
0 - 65536 - - - -
-
lfs1-OST0022_UUID
0 - 65536 - - - -
-
lfs1-OST0023_UUID
0 - 65536 - - - -
-
lfs1-OST0024_UUID
0 - 65536 - - - -
-
lfs1-OST0025_UUID
0 - 65536 - - - -
-
lfs1-OST0026_UUID
0 - 65536 - - - -
-
lfs1-OST0027_UUID
0 - 65536 - - - -
-
lfs1-OST0028_UUID
0 - 65536 - - - -
-
lfs1-OST0029_UUID
0 - 65536 - - - -
-
lfs1-OST002a_UUID
0 - 65536 - - - -
-
lfs1-OST002b_UUID
0 - 65536 - - - -
-
lfs1-OST002c_UUID
0 - 65536 - - - -
-
lfs1-OST002d_UUID
0 - 65536 - - - -
-
lfs1-OST002e_UUID
0 - 65536 - - - -
-
lfs1-OST002f_UUID
0 - 65536 - - - -
-
lfs1-OST0030_UUID
0 - 65536 - - - -
-
lfs1-OST0031_UUID
0 - 65536 - - - -
-
lfs1-OST0032_UUID
0 - 65536 - - - -
-
lfs1-OST0033_UUID
0 - 65536 - - - -
-
lfs1-OST0034_UUID
0 - 65536 - - - -
-
lfs1-OST0035_UUID
0 - 65536 - - - -
-
lfs1-OST0036_UUID
0 - 65536 - - - -
-
lfs1-OST0037_UUID
0 - 65536 - - - -
-
lfs1-OST0038_UUID
0 - 65536 - - - -
-
lfs1-OST0039_UUID
0 - 65536 - - - -
-
lfs1-OST003a_UUID
0 - 65536 - - - -
-
lfs1-OST003b_UUID
0 - 65536 - - - -
-
lfs1-OST003c_UUID
0 - 65536 - - - -
-
lfs1-OST003d_UUID
0 - 65536 - - - -
-
lfs1-OST003e_UUID
0 - 65536 - - - -
-
lfs1-OST003f_UUID
0 - 65536 - - - -
-
lfs1-OST0040_UUID
0 - 65536 - - - -
-
lfs1-OST0041_UUID
0 - 65536 - - - -
-
lfs1-OST0042_UUID
0 - 65536 - - - -
-
lfs1-OST0043_UUID
0 - 65536 - - - -
-
lfs1-OST0044_UUID
0 - 65536 - - - -
-
lfs1-OST0045_UUID
0 - 65536 - - - -
-
lfs1-OST0046_UUID
0 - 65536 - - - -
-
lfs1-OST0047_UUID
0 - 65536 - - - -
-

[root@Jet:fe6 ~]# lfs quota -v -g hfip-gfdl /lfs1
Disk quotas for group hfip-gfdl (gid 19103):
Filesystem kbytes quota limit grace files quota limit
grace
/lfs1 19179828108 20971520000 22020096000 - 3567465
0 0 -
lfs1-MDT0000_UUID
1408244 - 1572864 - 3567465 - 0
-
lfs1-OST0000_UUID
266307944 - 266469376 - - -

  • -
    lfs1-OST0001_UUID
    264490536 - 264634368 - - -
  • -
    lfs1-OST0002_UUID
    252475224 - 252575744 - - -
  • -
    lfs1-OST0003_UUID
    267990876 - 268173312 - - -
  • -
    lfs1-OST0004_UUID
    246162144 - 246284288 - - -
  • -
    lfs1-OST0005_UUID
    267940220 - 268042240 - - -
  • -
    lfs1-OST0006_UUID
    290532840 - 290717696 - - -
  • -
    lfs1-OST0007_UUID
    283466396 - 283639808 - - -
  • -
    lfs1-OST0008_UUID
    272318500 - 272498688 - - -
  • -
    lfs1-OST0009_UUID
    250366252 - 250478592 - - -
  • -
    lfs1-OST000a_UUID
    268765900 - 268959744 - - -
  • -
    lfs1-OST000b_UUID
    241840752 - 241958912 - - -
  • -
    lfs1-OST000c_UUID
    279475784 - 279576576 - - -
  • -
    lfs1-OST000d_UUID
    295138616 - 295305216 - - -
  • -
    lfs1-OST000e_UUID
    245769816 - 245891072 - - -
  • -
    lfs1-OST000f_UUID
    258449640 - 258605056 - - -
  • -
    lfs1-OST0010_UUID
    228665128 - 228851712 - - -
  • -
    lfs1-OST0011_UUID
    278386580 - 278528000 - - -
  • -
    lfs1-OST0012_UUID
    277253652 - 277348352 - - -
  • -
    lfs1-OST0013_UUID
    185330656 - 185466880 - - -
  • -
    lfs1-OST0014_UUID
    281571296 - 281673728 - - -
  • -
    lfs1-OST0015_UUID
    246132636 - 246284288 - - -
  • -
    lfs1-OST0016_UUID
    274774304 - 274857984 - - -
  • -
    lfs1-OST0017_UUID
    279671348 - 279838720 - - -
  • -
    lfs1-OST0018_UUID
    278511144 - 278659072 - - -
  • -
    lfs1-OST0019_UUID
    279721016 - 279838720 - - -
  • -
    lfs1-OST001a_UUID
    286078396 - 286261248 - - -
  • -
    lfs1-OST001b_UUID
    284810104 - 284950528 - - -
  • -
    lfs1-OST001c_UUID
    266730624 - 266862592 - - -
  • -
    lfs1-OST001d_UUID
    267699528 - 267780096 - - -
  • -
    lfs1-OST001e_UUID
    271926964 - 272105472 - - -
  • -
    lfs1-OST001f_UUID
    279551644 - 279707648 - - -
  • -
    lfs1-OST0020_UUID
    262360628 - 262537216 - - -
  • -
    lfs1-OST0021_UUID
    286415060 - 286523392 - - -
  • -
    lfs1-OST0022_UUID
    227815636 - 227934208 - - -
  • -
    lfs1-OST0023_UUID
    271352884 - 271450112 - - -
  • -
    lfs1-OST0024_UUID
    272460280 - 272629760 - - -
  • -
    lfs1-OST0025_UUID
    198046512 - 198180864 - - -
  • -
    lfs1-OST0026_UUID
    289555288 - 289669120 - - -
  • -
    lfs1-OST0027_UUID
    258436324 - 258605056 - - -
  • -
    lfs1-OST0028_UUID
    270941900 - 271056896 - - -
  • -
    lfs1-OST0029_UUID
    251377320 - 251527168 - - -
  • -
    lfs1-OST002a_UUID
    282350136 - 282460160 - - -
  • -
    lfs1-OST002b_UUID
    280608276 - 280756224 - - -
  • -
    lfs1-OST002c_UUID
    284707076 - 284819456 - - -
  • -
    lfs1-OST002d_UUID
    283481900 - 283639808 - - -
  • -
    lfs1-OST002e_UUID
    283875928 - 284033024 - - -
  • -
    lfs1-OST002f_UUID
    241472472 - 241565696 - - -
  • -
    lfs1-OST0030_UUID
    265680488 - 265814016 - - -
  • -
    lfs1-OST0031_UUID
    273082788 - 273154048 - - -
  • -
    lfs1-OST0032_UUID
    227469276 - 227540992 - - -
  • -
    lfs1-OST0033_UUID
    301283952 - 301465600 - - -
  • -
    lfs1-OST0034_UUID
    282200664 - 282329088 - - -
  • -
    lfs1-OST0035_UUID
    236723324 - 236847104 - - -
  • -
    lfs1-OST0036_UUID
    244762536 - 244842496 - - -
  • -
    lfs1-OST0037_UUID
    290396060 - 290586624 - - -
  • -
    lfs1-OST0038_UUID
    247838916 - 247988224 - - -
  • -
    lfs1-OST0039_UUID
    289519172 - 289669120 - - -
  • -
    lfs1-OST003a_UUID
    237240432 - 237371392 - - -
  • -
    lfs1-OST003b_UUID
    281243232 - 281411584 - - -
  • -
    lfs1-OST003c_UUID
    255128092 - 255197184 - - -
  • -
    lfs1-OST003d_UUID
    267801692 - 267911168 - - -
  • -
    lfs1-OST003e_UUID
    268842936 - 268959744 - - -
  • -
    lfs1-OST003f_UUID
    259531064 - 259653632 - - -
  • -
    lfs1-OST0040_UUID
    270812788 - 270925824 - - -
  • -
    lfs1-OST0041_UUID
    256516360 - 256638976 - - -
  • -
    lfs1-OST0042_UUID
    289071672 - 289144832 - - -
  • -
    lfs1-OST0043_UUID
    290809280 - 290979840 - - -
  • -
    lfs1-OST0044_UUID
    265673448 - 265814016 - - -
  • -
    lfs1-OST0045_UUID
    288587388 - 288751616 - - -
  • -
    lfs1-OST0046_UUID
    291939312 - 292028416 - - -
  • -
    lfs1-OST0047_UUID
    232700912 - 232783872 - - -
  • -

[root@Jet:fe6 ~]# lfs quota -v -g rthfip-gfdl /lfs1
Disk quotas for group rthfip-gfdl (gid 19104):
Filesystem kbytes quota limit grace files quota limit
grace
/lfs1 0 31457280 33030144 - 0 0
0 -
lfs1-MDT0000_UUID
0 - 65536 - 0 - 0
-
lfs1-OST0000_UUID
0 - 65536 - - - -
-
lfs1-OST0001_UUID
0 - 65536 - - - -
-
lfs1-OST0002_UUID
0 - 65536 - - - -
-
lfs1-OST0003_UUID
0 - 65536 - - - -
-
lfs1-OST0004_UUID
0 - 65536 - - - -
-
lfs1-OST0005_UUID
0 - 65536 - - - -
-
lfs1-OST0006_UUID
0 - 65536 - - - -
-
lfs1-OST0007_UUID
0 - 65536 - - - -
-
lfs1-OST0008_UUID
0 - 65536 - - - -
-
lfs1-OST0009_UUID
0 - 65536 - - - -
-
lfs1-OST000a_UUID
0 - 65536 - - - -
-
lfs1-OST000b_UUID
0 - 65536 - - - -
-
lfs1-OST000c_UUID
0 - 65536 - - - -
-
lfs1-OST000d_UUID
0 - 65536 - - - -
-
lfs1-OST000e_UUID
0 - 65536 - - - -
-
lfs1-OST000f_UUID
0 - 65536 - - - -
-
lfs1-OST0010_UUID
0 - 65536 - - - -
-
lfs1-OST0011_UUID
0 - 65536 - - - -
-
lfs1-OST0012_UUID
0 - 65536 - - - -
-
lfs1-OST0013_UUID
0 - 65536 - - - -
-
lfs1-OST0014_UUID
0 - 65536 - - - -
-
lfs1-OST0015_UUID
0 - 65536 - - - -
-
lfs1-OST0016_UUID
0 - 65536 - - - -
-
lfs1-OST0017_UUID
0 - 65536 - - - -
-
lfs1-OST0018_UUID
0 - 65536 - - - -
-
lfs1-OST0019_UUID
0 - 65536 - - - -
-
lfs1-OST001a_UUID
0 - 65536 - - - -
-
lfs1-OST001b_UUID
0 - 65536 - - - -
-
lfs1-OST001c_UUID
0 - 65536 - - - -
-
lfs1-OST001d_UUID
0 - 65536 - - - -
-
lfs1-OST001e_UUID
0 - 65536 - - - -
-
lfs1-OST001f_UUID
0 - 65536 - - - -
-
lfs1-OST0020_UUID
0 - 65536 - - - -
-
lfs1-OST0021_UUID
0 - 65536 - - - -
-
lfs1-OST0022_UUID
0 - 65536 - - - -
-
lfs1-OST0023_UUID
0 - 65536 - - - -
-
lfs1-OST0024_UUID
0 - 65536 - - - -
-
lfs1-OST0025_UUID
0 - 65536 - - - -
-
lfs1-OST0026_UUID
0 - 65536 - - - -
-
lfs1-OST0027_UUID
0 - 65536 - - - -
-
lfs1-OST0028_UUID
0 - 65536 - - - -
-
lfs1-OST0029_UUID
0 - 65536 - - - -
-
lfs1-OST002a_UUID
0 - 65536 - - - -
-
lfs1-OST002b_UUID
0 - 65536 - - - -
-
lfs1-OST002c_UUID
0 - 65536 - - - -
-
lfs1-OST002d_UUID
0 - 65536 - - - -
-
lfs1-OST002e_UUID
0 - 65536 - - - -
-
lfs1-OST002f_UUID
0 - 65536 - - - -
-
lfs1-OST0030_UUID
0 - 65536 - - - -
-
lfs1-OST0031_UUID
0 - 65536 - - - -
-
lfs1-OST0032_UUID
0 - 65536 - - - -
-
lfs1-OST0033_UUID
0 - 65536 - - - -
-
lfs1-OST0034_UUID
0 - 65536 - - - -
-
lfs1-OST0035_UUID
0 - 65536 - - - -
-
lfs1-OST0036_UUID
0 - 65536 - - - -
-
lfs1-OST0037_UUID
0 - 65536 - - - -
-
lfs1-OST0038_UUID
0 - 65536 - - - -
-
lfs1-OST0039_UUID
0 - 65536 - - - -
-
lfs1-OST003a_UUID
0 - 65536 - - - -
-
lfs1-OST003b_UUID
0 - 65536 - - - -
-
lfs1-OST003c_UUID
0 - 65536 - - - -
-
lfs1-OST003d_UUID
0 - 65536 - - - -
-
lfs1-OST003e_UUID
0 - 65536 - - - -
-
lfs1-OST003f_UUID
0 - 65536 - - - -
-
lfs1-OST0040_UUID
0 - 65536 - - - -
-
lfs1-OST0041_UUID
0 - 65536 - - - -
-
lfs1-OST0042_UUID
0 - 65536 - - - -
-
lfs1-OST0043_UUID
0 - 65536 - - - -
-
lfs1-OST0044_UUID
0 - 65536 - - - -
-
lfs1-OST0045_UUID
0 - 65536 - - - -
-
lfs1-OST0046_UUID
0 - 65536 - - - -
-
lfs1-OST0047_UUID
0 - 65536 - - - -
-

Comment by Niu Yawei (Inactive) [ 02/Apr/13 ]

Hi, Kit

If you have debug log, you can see if the quota is acquired eventually by checking the pid. (find the pid in the message of "still haven't managed ...", and check if this thread is constantly printing such message or it moves on eventually).

When you attach the log, I suggest you truncate out the useful part, otherwise, it could be too large. The useful part should be the messages nearby "still haven't managed ..." OST and the corresponding messages on MDS.

I didn't find anything abnormal in the output of "lfs quota -u -v", thank you for providing this.

Comment by Kit Westneat (Inactive) [ 03/Apr/13 ]

Hi Niu,

I've uploaded the logs here:
http://eu.ddn.com:8080/lustre/LU-2289.tar.bz2

It's about a 3 second window on each server, but it's still about 20GB uncompressed. It looks like for the message at 1364224183, it tries 19 times before it finally acquires it. Is there anything in the MDT logs that would indicate why it took so long? I didn't see anything in particular, but it's tough to read.

Comment by Niu Yawei (Inactive) [ 07/Apr/13 ]

Thank you, Kit. I checked the OSS log:

00040000:00000400:6:1364224183.636169:0:1558:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
00040000:04000000:6:1364224183.636179:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 10
00040000:04000000:6:1364224183.636181:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00040000:04000000:6:1364224183.636190:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768
00040000:04000000:6:1364224183.636193:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00002000:04000000:6:1364224183.636199:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3
00040000:04000000:6:1364224183.636205:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00040000:04000000:6:1364224183.636209:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0.
00040000:04000000:6:1364224183.636214:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 11
00040000:04000000:6:1364224183.636217:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00040000:04000000:6:1364224183.636224:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768
00040000:04000000:6:1364224183.636227:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00002000:04000000:6:1364224183.636231:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3
00040000:04000000:6:1364224183.636239:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
00040000:04000000:6:1364224183.636243:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0.

Looks there is pending acquire request (record: 2097152), and local limit will be able to satisfy current write request when the pending acquire done, so OST should wait for the pending reqeust, but it actually didn't wait, so quota_chk_acq_common() recycled many times until the pending request done.

I didn't see why OST didn't wait on pending acquire so far, maybe there is something wrong with the lqs_blk_rec calculation? I'll look into it further.

Comment by Niu Yawei (Inactive) [ 08/Apr/13 ]

Now I think it isn't a problem after I checked the large log for all processes (the small one I checked before is only for single process).

When some uid/gid is approaching quota limit, the qunit size will be shrinked to the minimal size (1MB), and each quota acquire can only acquire 1MB limit, if there are many OST threads writing to the same uid/gid at this time, the acquired limit will be consumed very soon, so some writing threads have to recycle several times in the quota_chk_acq_common() until get required quota limit or fail with -EDQUOT at the end.

Comment by Kit Westneat (Inactive) [ 17/May/13 ]

Is it possible that this could cause performance issues when the UID/GID is approaching the quota limit? Do you know if there have ever been performance tests run for this case?

Thanks.

Comment by Niu Yawei (Inactive) [ 20/May/13 ]

Yes, when approaching limit, write performance will be lower because quota unit is very small when approaching limit, and there will be much more quota acquire/release operations between ost & mdt, even worse, the write operation has to be blocked on the quota acquire operation sometimes. We didn't measure the performance in such unusual case.

Comment by Andreas Dilger [ 29/Sep/15 ]

Closing old bug.

Generated at Sat Feb 10 01:23:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.