[LU-1438] quota_chk_acq_common() still haven't managed to acquire quota Created: 24/May/12  Updated: 22/Feb/13  Resolved: 18/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: Lustre 2.3.0, Lustre 2.1.4, Lustre 1.8.9

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

lustre-1.8.7-wc1, RHEL5.7 for servers, RHEL6.2 for clients


Attachments: File 20120524.tgz     Text File LU-1438-debug.patch    
Severity: 3
Rank (Obsolete): 4584

 Description   

we are getting some of quota related problem. the quota feature is enabled on the filesystem and the customer changed group name to the many big files, then ran lfs quotacheck command.
After that, even the group didn't exceed the quota limitation, they got the disk quota exceeded messages.

OSS/MDS side, the following messages showed up since changed group name.
(quota_interface.c:473:quota_chk_acq_common()) still haven't managed to
acquire quota space from the quota master after 20 retries (err=0, rc=0)

It seems to be close to LU-428.



 Comments   
Comment by Peter Jones [ 24/May/12 ]

Niu

Could you please comment on this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 25/May/12 ]

Yes, it looks similar to the LU-428, but we can't reproduced it in LU-428 unfortunately.

Ihara, if it's reproduceable, could you try to collect the debug log on both MDS & OST with D_QUOTA enabled? Thanks.

Comment by Niu Yawei (Inactive) [ 25/May/12 ]

Did customer run 'lfs setquota' to clear limit for some user/group when they see this problem? Is it easy to be reproduced? I found a race condition in the code which could lead to such situation, but I'm not sure if it's the real root cause. Anyway, I'll post a patch to fix that race.

Comment by Shuichi Ihara (Inactive) [ 25/May/12 ]

Niu,
Let me ask the customer, even if they clear the quota by "lfs set quota", they see same issue.
should we test this first without debug?

Comment by Niu Yawei (Inactive) [ 25/May/12 ]

Hi, Ihara

The race I mentioned is hard to be triggerred, so we'd better get the debug log if it can be reproduced.

Comment by Niu Yawei (Inactive) [ 28/May/12 ]

The patch to fix race in quota_chk_acq_common(): http://review.whamcloud.com/#change,2927

Comment by Shuichi Ihara (Inactive) [ 31/May/12 ]

Niu,

I upaded the debug information when debug flags turned on.
please check on /uploads/LU-1438/20120529.tar.bz2 on the whamcloud's ftp site.

Thanks
Ihara

Comment by Niu Yawei (Inactive) [ 31/May/12 ]

Thank you, Ihara. There are lots of "still haven't managed to acquire quota.." in the messages, however, seems the debug log was truncated, I can't find any of them in the debug log.

Could you try the patch above to see if it resolves cursomer's problem?

Comment by Shuichi Ihara (Inactive) [ 01/Jun/12 ]

Niu,
ok, let me try patch to see if the problem is fixed.
Is this needed to apply to both servers and clients?

Comment by Niu Yawei (Inactive) [ 01/Jun/12 ]

This is a server patch, you can apply it on server only.

Comment by Peter Jones [ 01/Jun/12 ]

Is this fix needed on master too?

Comment by Niu Yawei (Inactive) [ 01/Jun/12 ]

Hi, Peter, yes, I'll cook a patch for master.

Comment by Niu Yawei (Inactive) [ 01/Jun/12 ]

patch for master: http://review.whamcloud.com/#change,2996

Comment by Shuichi Ihara (Inactive) [ 01/Jun/12 ]

Niu,
Thanks, we are applying patches.
what exactly problem is? what's situation why we enter infinite loop?

Comment by Niu Yawei (Inactive) [ 01/Jun/12 ]

This is a race in the quota code, which could be triggered when user execute 'lfs setquota' to clear limit for some user/group on an active system (quota enabled, and there is ongoing io).

Comment by Shuichi Ihara (Inactive) [ 01/Jun/12 ]

The customer did "lfs setquota", but it was not to clear. they changed quota limit (exceed original quota limit, then up the quota size) for some of user/group. After that, we have been seeing the these messages.
So, we didn't clear quota, but during the changing operation for quota size, the quota size is cleared once, then set new quota size?

Comment by Niu Yawei (Inactive) [ 03/Jun/12 ]

The limit size changing will not cause clear -> set operations. So it maybe not caused by this race, anyway, let's see the result frist.

Comment by Shuichi Ihara (Inactive) [ 04/Jun/12 ]

so, what do you excactly mean "lfs setquota" to clear, is this setting to 0 for all quota limits for user/group?

Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ]

Niu,

We only applied patches to an OSS of all of them, but it didn't help. still messages are showing up.

Please investigate further and let me if you need more information.

Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ]

uploaded patch applied OSS's /var/log/messages on ftp site. /uploads/LU-1438/messages.nos14.tgz
Please have a look at it.

Comment by Niu Yawei (Inactive) [ 05/Jun/12 ]

Hi, Ihara

This time the warnning message is little bit different:

Jun 5 13:27:30 nos141i kernel: Lustre: 28834:0:(quota_interface.c:473:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 51 retries (err=0, rc=-5)

The return value is -5 now, which means some error happened during ost acquiring quota from mds. In the previous messages, the rc is always 0.

Can this OSS work properly even without quota enabled?

Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ]

Niu,

Yes, quota has been turned off on all servers, but still showing this messages on the OSS.

btw, we only applied patch to an OSS, but not to MDS yet.

Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ]

Niu,

the following error messages is showing up on MDS forever..

LustreError: 9831:0:(ldlm_lib.c:2123:target_handle_dqacq_callback()) dqacq failed! (rc:-5)

posted MDS's /var/log/messages as /uploads/LU-1438/nmd05.tgz

Comment by Niu Yawei (Inactive) [ 05/Jun/12 ]

hmm, I'm afraid that the quota isn't turned off on the OSS, could you check the /proc/fs/lustre/obdfilter/$OSTNAME/quota_type to see if it's really turned off?

Apply the patch to one OSS is fine.

Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ]

some storage here.. it indicates "quota is disabled" on the client, and nothing "u" and "g" on MDS and OSS, but quota_type is set to 1 on OSSs, but MDS's quota_type is 2.

# lfs quota -u -t /nshare1
user quotas are not enabled.
# lfs quota -g -t /nshare1
group quotas are not enabled.

on OSSs
# lctl get_param obdfilter.*.quota_type
obdfilter.nshare1-OST0015.quota_type=1
obdfilter.nshare1-OST0016.quota_type=1
obdfilter.nshare1-OST0017.quota_type=1
obdfilter.nshare1-OST0018.quota_type=1
obdfilter.nshare1-OST0019.quota_type=1
obdfilter.nshare1-OST001a.quota_type=1
obdfilter.nshare1-OST001b.quota_type=1

on MDS
# cat /proc/fs/lustre/lquota/nshare3-MDT0000/quota_type
2

Comment by Shuichi Ihara (Inactive) [ 06/Jun/12 ]

Niu,

any advices? it still can't stop error messages on MDS.. the customer is nervous about this..

Ihara

Comment by Niu Yawei (Inactive) [ 06/Jun/12 ]

Hi, Ihara

The 'quota_type' on both MDS & OSSs are looks good, seems no user or group quota is enabled. I don't where the dqacq requests come from, I think we have to enable D_QUOTA on both MDS & OSSs, and collect some debug log for analysis. Thanks.

Comment by Shuichi Ihara (Inactive) [ 07/Jun/12 ]

Niu,

we collected debug logs on MDS and two OSSs and filed them on /uploads/LU-1438/2012-06-07/
We had been getting debug log until the following error messages on OSS.
kernel: Lustre: 23619:0:(quota_interface.c:473:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 78 retries (err=67, rc=-5)

Please investigate futher.

Thank you very much!

Comment by Niu Yawei (Inactive) [ 08/Jun/12 ]

Thank you, Ihara. Seems it's a race of turn off quota vs. acquire quota, I will post a patch soon.

Comment by Niu Yawei (Inactive) [ 08/Jun/12 ]

patch for b18: http://review.whamcloud.com/3060

Comment by Shuichi Ihara (Inactive) [ 08/Jun/12 ]

Thank you very much for this patch and we will try this patch as well.
This problem is that the quota slave is infinitely acquiring quota even quota is turned off on master. Because there is missing of check if quota is enabled, is it correct?
So, we should apply this patch on both MDS and OSS, right?

Comment by Niu Yawei (Inactive) [ 08/Jun/12 ]

yes, exactly. The change is only for slave, but it should be applied on both MDS & OSS, since MDS has a slave running on it as well.

Comment by Niu Yawei (Inactive) [ 08/Jun/12 ]

Hi, Ihara

After looking closer to the log, I realized that your case is not caused by the race, it probably caused by that when you disable quota, some OST is offline (or the clear quota command to some OST was failed for some reason), so the local fs quota isn't disabled on some OST.

To resolve the problem, you can return the 'lfs quotaoff' to see if it's helpful. Thanks.

Comment by Shuichi Ihara (Inactive) [ 08/Jun/12 ]

Niu,

Thanks for updates. the original problem was that some groups exceeded quota despite these groups didn't achieve quota limit. what do you think of this problem?

The following messages showed up after disabled quota, but the primary problem was quota limit didn't work well.
kernel: Lustre: 23619:0:(quota_interface.c:473:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 78 retries (err=67, rc=-5)

Comment by Niu Yawei (Inactive) [ 08/Jun/12 ]

the original problem was that some groups exceeded quota despite these groups didn't achieve quota limit. what do you think of this problem?

The original problem is that slave can't acquire quota from master for that group id, so it exceeded quota. My first patch might be helpful for this. You can also ask user to reset the quota limit for the problematic group id by 'lfs setquota 0' -> 'lfs setquota xxx' to see if it's helpful.

The following messages showed up after disabled quota, but the primary problem was quota limit didn't work well.
kernel: Lustre: 23619:0:(quota_interface.c:473:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 78 retries (err=67, rc=-5)

The second problem might be caused by the inconsistence of on/off information between slave and master, you could ask user to rerun the 'lfs quotaoff' to see if it works.

Anyway, the quota design (in master & b18) requires all slaves online while setting quota or turn on/off quota, if the setquota or enable/disable quota failed on some slave (OST), there will be trouble, you have to rerun the command to make sure the new setting is synchronized on each slave(OST). Such kind of design drawback has been addressed in the new quota design (in orion).

Comment by Shuichi Ihara (Inactive) [ 12/Jun/12 ]

Niu,

It seems that the patches didn't help fixing problem..
we stopped all OSSs and MDSs and restarted the Lustre with applied RPMs.
THe applied patches are below.
http://review.whamcloud.com/#change,3060
http://review.whamcloud.com/#change,2927

However, a user got "quota exceeded messages" even he didn't get quota limit.
At that time, we got the following messages on the OSS.

Jun 12 18:19:36 nos131i kernel: Lustre: 29689:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
Jun 12 18:19:36 nos131i kernel: Lustre: 29689:0:(quota_interface.c:481:quota_chk_acq_common()) Skipped 10 previous similar messages

And not same time, but we are seeing a lot of same messages on the even another OSSs.

The resent logfiles are filed on /uploads/LU-1438/20120613.

Please investigate further. thank you very much again.

Comment by Niu Yawei (Inactive) [ 12/Jun/12 ]

patch for master: http://review.whamcloud.com/#change,3097

Comment by Niu Yawei (Inactive) [ 12/Jun/12 ]

Hi, Ihara

Did you try reset quota by 'lfs setquota' for the user?

Comment by Shuichi Ihara (Inactive) [ 12/Jun/12 ]

You mean set zero to all (e.g. "lfs setquota -u <user> -b 0 -B 0 -i 0 -I 0 <fsname>") to clear, then set real quota size again?

Comment by Niu Yawei (Inactive) [ 13/Jun/12 ]

Yes, for the user who has quota limit, reset quota by "lfs setquota -b 0 -B 0 -i 0 -I 0 $fsname" -> "lfs setquota -b limit ...", for the user who has no quota limit but got the error message (still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)) on OSS, re-clear the quota limit by 'lfs setquota -b limit ..." -> 'lfs setquota -b 0 ...".

There isn't much information in the dmesg, could you explain current status of the system? is user or group quota enabled? what's the output of 'lfs quota -v -u $user $fs" for the user who has no quota limit but got error message on OSS? Thanks.

Comment by Shuichi Ihara (Inactive) [ 13/Jun/12 ]

Niu,

The customer did "lfs setquota" to clear the all quota size to some users.
After that, as far as we can see, messages "(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" are gone so far.

As you mentioned, there was an inconsistency between master and slave - user is not exceeded to the global limit, but some local quota were exceeded the limit. Once the local quota on all MDS/OSSs are cleared by "lfs setquota -b 0 -B 0...", then set quota again, now the quota works normally. Is this what you said a root cause and solution to fix current situation, right?

So, for change the quota size, "1. clear quota size, 2. set quota size" is better and safe solution rather than just change the quota size by "lfs setquota -b X"?

Anyway, very appreciate you detailed analysis and explanation.

Comment by Niu Yawei (Inactive) [ 13/Jun/12 ]

To change quota size, "clear -> set new size" isn't necessary safer than 'set new size' directly.

Whenever 'lfs setquota' changes quota limit from zero to non-zero (or from non-zero to zero), quota master (MDS) will notify all slaves (OSTs) to change their local fs quota limit, however, if some slave is offline at that time (or the notification to some slave failed for some reason), inconsistence will raise between the master and the slaves which didn't receive quota change notification.

Changing quota from old limit to new limit (all non-zero values) will not trigger the quota change notification to slaves, so it'll not cause the inconsistence, of course, it can't fix the inconsistence as well.

In the new quota design, any later joined slave can sync the quota setting with master automatically, so user needn't worry about the offline slave anymore.

Comment by Shuichi Ihara (Inactive) [ 25/Jun/12 ]

Hello Niu,

The problem is not fixed yet even after reset group quota. The last week, we did zero reset quota for all groups, but a couple of groups were failed to set quota above.

nos13: Jun 20 18:02:19 nos131i kernel: Lustre: 18627:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
nos13: Jun 20 21:40:47 nos131i kernel: Lustre: 29660:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18554:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10856
nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18575:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10857
nos14: Jun 20 12:38:41 nos141i kernel: Lustre: 28944:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

Then, we set quota again to these groups, but today, we got same messages on OSS.

Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

I don't know why still failing to acquire quota space from master. Any ideas of workaround to avoid this issue? or we can add debug to address this issue and to understand what happens?

Comment by Niu Yawei (Inactive) [ 25/Jun/12 ]

Hi, Ihara

If it's easy to be reproduced, we can collect the debug log on OSS with D_TRACE & D_QUOTA enabled (echo +trace > /proc/sys/lnet/debug, echo +quota > /proc/sys/lnet/debug), then we can see where the acquire quota returns zero. Thanks.

Comment by Shuichi Ihara (Inactive) [ 26/Jun/12 ]

Niu, I meant you could make debug patch to see more detail information during the production system. e.g) Who exceed quota limit by this messages. surely, we didn't do any operations (set/clear quota) when the following messages showed up.

Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

We don't know how reproduce this problem at this morment, but let me ask we can enable debug flags. BTW, what is "28760:0" of above messages?

Comment by Niu Yawei (Inactive) [ 26/Jun/12 ]

Hi, Ihara

The debug patch would be similar to enable D_TRACE & D_QUOTA for debug log, If the cusotmer can't affort D_TRACE debug log, we can only enable D_QUOTA first to collect some debug log.

The 28760 is pid, and the 0 is 'extern pid' (looks it's always 0 for now, you can just ignore it).

Comment by Shuichi Ihara (Inactive) [ 28/Jun/12 ]

Hi Niu,

uploaded debug files on uploads/LU-1438/debugfile.20120628.gz
However, it's not easy to reproduce this messages though the messages still showed up on the system log file irregularly.
The lustre debug file doesn't contain messages. perhaps, the maximum debug size (100MB) exceeded quickly?
Any ideas to keep track and debug information in this situation?

Comment by Niu Yawei (Inactive) [ 28/Jun/12 ]

Hi, Ihara

Could you apply this debug patch? then we'll see lots more deubug information in the syslong along with the "still can't acquire..." messages. Thanks.

Comment by Shuichi Ihara (Inactive) [ 29/Jun/12 ]

Hi Niu,

Thanks for this debug patch. I will ask customer if we can apply patch.

Comment by Shuichi Ihara (Inactive) [ 04/Jul/12 ]

This might be related to 32bit quota setting limitation?

Comment by Niu Yawei (Inactive) [ 04/Jul/12 ]

probably, is there anything useful from the debug log?

Comment by Shuichi Ihara (Inactive) [ 04/Jul/12 ]

We didn't apply patches yet, but will apply them on the next scheduled maintenance time.

Comment by Shuichi Ihara (Inactive) [ 18/Aug/12 ]

this might be caused by LU-1584 and LU-1720. Please close this ticket.
I will open new ticket if we see same issue after apply LU-1438 and LU-1720 patches.

Comment by Peter Jones [ 18/Aug/12 ]

ok thanks Ihara

Generated at Sat Feb 10 01:16:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.