[LU-1438] quota_chk_acq_common() still haven't managed to acquire quota - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: Lustre 2.3.0, Lustre 2.1.4, Lustre 1.8.9
Affects Version/s: Lustre 1.8.7
Labels:
None
Environment:
lustre-1.8.7-wc1, RHEL5.7 for servers, RHEL6.2 for clients

Severity:
3
Rank (Obsolete):
4584

Description

we are getting some of quota related problem. the quota feature is enabled on the filesystem and the customer changed group name to the many big files, then ran lfs quotacheck command.
After that, even the group didn't exceed the quota limitation, they got the disk quota exceeded messages.

OSS/MDS side, the following messages showed up since changed group name.
(quota_interface.c:473:quota_chk_acq_common()) still haven't managed to
acquire quota space from the quota master after 20 retries (err=0, rc=0)

It seems to be close to ~~LU-428~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

20120524.tgz
1.50 MB
24/May/12 5:54 AM
LU-1438-debug.patch
5 kB
28/Jun/12 11:56 PM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-1438] quota_chk_acq_common() still haven't managed to acquire quota

Niu Yawei (Inactive) added a comment - 04/Jul/12 3:04 AM

probably, is there anything useful from the debug log?

Niu Yawei (Inactive) added a comment - 04/Jul/12 3:04 AM probably, is there anything useful from the debug log?

Shuichi Ihara (Inactive) added a comment - 04/Jul/12 2:49 AM

This might be related to 32bit quota setting limitation?

Shuichi Ihara (Inactive) added a comment - 04/Jul/12 2:49 AM This might be related to 32bit quota setting limitation?

Shuichi Ihara (Inactive) added a comment - 29/Jun/12 12:34 AM

Hi Niu,

Thanks for this debug patch. I will ask customer if we can apply patch.

Shuichi Ihara (Inactive) added a comment - 29/Jun/12 12:34 AM Hi Niu, Thanks for this debug patch. I will ask customer if we can apply patch.

Niu Yawei (Inactive) added a comment - 28/Jun/12 11:56 PM

Hi, Ihara

Could you apply this debug patch? then we'll see lots more deubug information in the syslong along with the "still can't acquire..." messages. Thanks.

Niu Yawei (Inactive) added a comment - 28/Jun/12 11:56 PM Hi, Ihara Could you apply this debug patch? then we'll see lots more deubug information in the syslong along with the "still can't acquire..." messages. Thanks.

Shuichi Ihara (Inactive) added a comment - 28/Jun/12 12:40 PM

Hi Niu,

uploaded debug files on uploads/~~LU-1438~~/debugfile.20120628.gz
However, it's not easy to reproduce this messages though the messages still showed up on the system log file irregularly.
The lustre debug file doesn't contain messages. perhaps, the maximum debug size (100MB) exceeded quickly?
Any ideas to keep track and debug information in this situation?

Shuichi Ihara (Inactive) added a comment - 28/Jun/12 12:40 PM Hi Niu, uploaded debug files on uploads/ LU-1438 /debugfile.20120628.gz However, it's not easy to reproduce this messages though the messages still showed up on the system log file irregularly. The lustre debug file doesn't contain messages. perhaps, the maximum debug size (100MB) exceeded quickly? Any ideas to keep track and debug information in this situation?

Niu Yawei (Inactive) added a comment - 26/Jun/12 3:18 AM

Hi, Ihara

The debug patch would be similar to enable D_TRACE & D_QUOTA for debug log, If the cusotmer can't affort D_TRACE debug log, we can only enable D_QUOTA first to collect some debug log.

The 28760 is pid, and the 0 is 'extern pid' (looks it's always 0 for now, you can just ignore it).

Niu Yawei (Inactive) added a comment - 26/Jun/12 3:18 AM Hi, Ihara The debug patch would be similar to enable D_TRACE & D_QUOTA for debug log, If the cusotmer can't affort D_TRACE debug log, we can only enable D_QUOTA first to collect some debug log. The 28760 is pid, and the 0 is 'extern pid' (looks it's always 0 for now, you can just ignore it).

Shuichi Ihara (Inactive) added a comment - 26/Jun/12 1:54 AM

Niu, I meant you could make debug patch to see more detail information during the production system. e.g) Who exceed quota limit by this messages. surely, we didn't do any operations (set/clear quota) when the following messages showed up.

Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

We don't know how reproduce this problem at this morment, but let me ask we can enable debug flags. BTW, what is "28760:0" of above messages?

Shuichi Ihara (Inactive) added a comment - 26/Jun/12 1:54 AM Niu, I meant you could make debug patch to see more detail information during the production system. e.g) Who exceed quota limit by this messages. surely, we didn't do any operations (set/clear quota) when the following messages showed up. Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) We don't know how reproduce this problem at this morment, but let me ask we can enable debug flags. BTW, what is "28760:0" of above messages?

Niu Yawei (Inactive) added a comment - 25/Jun/12 11:59 PM

Hi, Ihara

If it's easy to be reproduced, we can collect the debug log on OSS with D_TRACE & D_QUOTA enabled (echo +trace > /proc/sys/lnet/debug, echo +quota > /proc/sys/lnet/debug), then we can see where the acquire quota returns zero. Thanks.

Niu Yawei (Inactive) added a comment - 25/Jun/12 11:59 PM Hi, Ihara If it's easy to be reproduced, we can collect the debug log on OSS with D_TRACE & D_QUOTA enabled (echo +trace > /proc/sys/lnet/debug, echo +quota > /proc/sys/lnet/debug), then we can see where the acquire quota returns zero. Thanks.

Shuichi Ihara (Inactive) added a comment - 25/Jun/12 11:54 AM

Hello Niu,

The problem is not fixed yet even after reset group quota. The last week, we did zero reset quota for all groups, but a couple of groups were failed to set quota above.

nos13: Jun 20 18:02:19 nos131i kernel: Lustre: 18627:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
nos13: Jun 20 21:40:47 nos131i kernel: Lustre: 29660:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18554:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10856
nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18575:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10857
nos14: Jun 20 12:38:41 nos141i kernel: Lustre: 28944:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

Then, we set quota again to these groups, but today, we got same messages on OSS.

Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

I don't know why still failing to acquire quota space from master. Any ideas of workaround to avoid this issue? or we can add debug to address this issue and to understand what happens?

Shuichi Ihara (Inactive) added a comment - 25/Jun/12 11:54 AM Hello Niu, The problem is not fixed yet even after reset group quota. The last week, we did zero reset quota for all groups, but a couple of groups were failed to set quota above. nos13: Jun 20 18:02:19 nos131i kernel: Lustre: 18627:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) nos13: Jun 20 21:40:47 nos131i kernel: Lustre: 29660:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18554:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10856 nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18575:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10857 nos14: Jun 20 12:38:41 nos141i kernel: Lustre: 28944:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) Then, we set quota again to these groups, but today, we got same messages on OSS. Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) I don't know why still failing to acquire quota space from master. Any ideas of workaround to avoid this issue? or we can add debug to address this issue and to understand what happens?

Niu Yawei (Inactive) added a comment - 13/Jun/12 10:18 PM

To change quota size, "clear -> set new size" isn't necessary safer than 'set new size' directly.

Whenever 'lfs setquota' changes quota limit from zero to non-zero (or from non-zero to zero), quota master (MDS) will notify all slaves (OSTs) to change their local fs quota limit, however, if some slave is offline at that time (or the notification to some slave failed for some reason), inconsistence will raise between the master and the slaves which didn't receive quota change notification.

Changing quota from old limit to new limit (all non-zero values) will not trigger the quota change notification to slaves, so it'll not cause the inconsistence, of course, it can't fix the inconsistence as well.

In the new quota design, any later joined slave can sync the quota setting with master automatically, so user needn't worry about the offline slave anymore.

Niu Yawei (Inactive) added a comment - 13/Jun/12 10:18 PM To change quota size, "clear -> set new size" isn't necessary safer than 'set new size' directly. Whenever 'lfs setquota' changes quota limit from zero to non-zero (or from non-zero to zero), quota master (MDS) will notify all slaves (OSTs) to change their local fs quota limit, however, if some slave is offline at that time (or the notification to some slave failed for some reason), inconsistence will raise between the master and the slaves which didn't receive quota change notification. Changing quota from old limit to new limit (all non-zero values) will not trigger the quota change notification to slaves, so it'll not cause the inconsistence, of course, it can't fix the inconsistence as well. In the new quota design, any later joined slave can sync the quota setting with master automatically, so user needn't worry about the offline slave anymore.

Shuichi Ihara (Inactive) added a comment - 13/Jun/12 10:41 AM

Niu,

The customer did "lfs setquota" to clear the all quota size to some users.
After that, as far as we can see, messages "(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" are gone so far.

As you mentioned, there was an inconsistency between master and slave - user is not exceeded to the global limit, but some local quota were exceeded the limit. Once the local quota on all MDS/OSSs are cleared by "lfs setquota -b 0 -B 0...", then set quota again, now the quota works normally. Is this what you said a root cause and solution to fix current situation, right?

So, for change the quota size, "1. clear quota size, 2. set quota size" is better and safe solution rather than just change the quota size by "lfs setquota -b X"?

Anyway, very appreciate you detailed analysis and explanation.

Shuichi Ihara (Inactive) added a comment - 13/Jun/12 10:41 AM Niu, The customer did "lfs setquota" to clear the all quota size to some users. After that, as far as we can see, messages "(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" are gone so far. As you mentioned, there was an inconsistency between master and slave - user is not exceeded to the global limit, but some local quota were exceeded the limit. Once the local quota on all MDS/OSSs are cleared by "lfs setquota -b 0 -B 0...", then set quota again, now the quota works normally. Is this what you said a root cause and solution to fix current situation, right? So, for change the quota size, "1. clear quota size, 2. set quota size" is better and safe solution rather than just change the quota size by "lfs setquota -b X"? Anyway, very appreciate you detailed analysis and explanation.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Shuichi Ihara (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/May/12 5:54 AM

Updated:: 22/Feb/13 11:14 AM

Resolved:: 18/Aug/12 10:34 AM