[LU-1438] quota_chk_acq_common() still haven't managed to acquire quota Created: 24/May/12 Updated: 22/Feb/13 Resolved: 18/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.4, Lustre 1.8.9 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-1.8.7-wc1, RHEL5.7 for servers, RHEL6.2 for clients |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4584 |
| Description |
|
we are getting some of quota related problem. the quota feature is enabled on the filesystem and the customer changed group name to the many big files, then ran lfs quotacheck command. OSS/MDS side, the following messages showed up since changed group name. It seems to be close to |
| Comments |
| Comment by Peter Jones [ 24/May/12 ] |
|
Niu Could you please comment on this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 25/May/12 ] |
|
Yes, it looks similar to the Ihara, if it's reproduceable, could you try to collect the debug log on both MDS & OST with D_QUOTA enabled? Thanks. |
| Comment by Niu Yawei (Inactive) [ 25/May/12 ] |
|
Did customer run 'lfs setquota' to clear limit for some user/group when they see this problem? Is it easy to be reproduced? I found a race condition in the code which could lead to such situation, but I'm not sure if it's the real root cause. Anyway, I'll post a patch to fix that race. |
| Comment by Shuichi Ihara (Inactive) [ 25/May/12 ] |
|
Niu, |
| Comment by Niu Yawei (Inactive) [ 25/May/12 ] |
|
Hi, Ihara The race I mentioned is hard to be triggerred, so we'd better get the debug log if it can be reproduced. |
| Comment by Niu Yawei (Inactive) [ 28/May/12 ] |
|
The patch to fix race in quota_chk_acq_common(): http://review.whamcloud.com/#change,2927 |
| Comment by Shuichi Ihara (Inactive) [ 31/May/12 ] |
|
Niu, I upaded the debug information when debug flags turned on. Thanks |
| Comment by Niu Yawei (Inactive) [ 31/May/12 ] |
|
Thank you, Ihara. There are lots of "still haven't managed to acquire quota.." in the messages, however, seems the debug log was truncated, I can't find any of them in the debug log. Could you try the patch above to see if it resolves cursomer's problem? |
| Comment by Shuichi Ihara (Inactive) [ 01/Jun/12 ] |
|
Niu, |
| Comment by Niu Yawei (Inactive) [ 01/Jun/12 ] |
|
This is a server patch, you can apply it on server only. |
| Comment by Peter Jones [ 01/Jun/12 ] |
|
Is this fix needed on master too? |
| Comment by Niu Yawei (Inactive) [ 01/Jun/12 ] |
|
Hi, Peter, yes, I'll cook a patch for master. |
| Comment by Niu Yawei (Inactive) [ 01/Jun/12 ] |
|
patch for master: http://review.whamcloud.com/#change,2996 |
| Comment by Shuichi Ihara (Inactive) [ 01/Jun/12 ] |
|
Niu, |
| Comment by Niu Yawei (Inactive) [ 01/Jun/12 ] |
|
This is a race in the quota code, which could be triggered when user execute 'lfs setquota' to clear limit for some user/group on an active system (quota enabled, and there is ongoing io). |
| Comment by Shuichi Ihara (Inactive) [ 01/Jun/12 ] |
|
The customer did "lfs setquota", but it was not to clear. they changed quota limit (exceed original quota limit, then up the quota size) for some of user/group. After that, we have been seeing the these messages. |
| Comment by Niu Yawei (Inactive) [ 03/Jun/12 ] |
|
The limit size changing will not cause clear -> set operations. So it maybe not caused by this race, anyway, let's see the result frist. |
| Comment by Shuichi Ihara (Inactive) [ 04/Jun/12 ] |
|
so, what do you excactly mean "lfs setquota" to clear, is this setting to 0 for all quota limits for user/group? |
| Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ] |
|
Niu, We only applied patches to an OSS of all of them, but it didn't help. still messages are showing up. Please investigate further and let me if you need more information. |
| Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ] |
|
uploaded patch applied OSS's /var/log/messages on ftp site. /uploads/ |
| Comment by Niu Yawei (Inactive) [ 05/Jun/12 ] |
|
Hi, Ihara This time the warnning message is little bit different: Jun 5 13:27:30 nos141i kernel: Lustre: 28834:0:(quota_interface.c:473:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 51 retries (err=0, rc=-5) The return value is -5 now, which means some error happened during ost acquiring quota from mds. In the previous messages, the rc is always 0. Can this OSS work properly even without quota enabled? |
| Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ] |
|
Niu, Yes, quota has been turned off on all servers, but still showing this messages on the OSS. btw, we only applied patch to an OSS, but not to MDS yet. |
| Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ] |
|
Niu, the following error messages is showing up on MDS forever.. LustreError: 9831:0:(ldlm_lib.c:2123:target_handle_dqacq_callback()) dqacq failed! (rc:-5) posted MDS's /var/log/messages as /uploads/ |
| Comment by Niu Yawei (Inactive) [ 05/Jun/12 ] |
|
hmm, I'm afraid that the quota isn't turned off on the OSS, could you check the /proc/fs/lustre/obdfilter/$OSTNAME/quota_type to see if it's really turned off? Apply the patch to one OSS is fine. |
| Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ] |
|
some storage here.. it indicates "quota is disabled" on the client, and nothing "u" and "g" on MDS and OSS, but quota_type is set to 1 on OSSs, but MDS's quota_type is 2. # lfs quota -u -t /nshare1 on OSSs on MDS |
| Comment by Shuichi Ihara (Inactive) [ 06/Jun/12 ] |
|
Niu, any advices? it still can't stop error messages on MDS.. the customer is nervous about this.. Ihara |
| Comment by Niu Yawei (Inactive) [ 06/Jun/12 ] |
|
Hi, Ihara The 'quota_type' on both MDS & OSSs are looks good, seems no user or group quota is enabled. I don't where the dqacq requests come from, I think we have to enable D_QUOTA on both MDS & OSSs, and collect some debug log for analysis. Thanks. |
| Comment by Shuichi Ihara (Inactive) [ 07/Jun/12 ] |
|
Niu, we collected debug logs on MDS and two OSSs and filed them on /uploads/ Please investigate futher. Thank you very much! |
| Comment by Niu Yawei (Inactive) [ 08/Jun/12 ] |
|
Thank you, Ihara. Seems it's a race of turn off quota vs. acquire quota, I will post a patch soon. |
| Comment by Niu Yawei (Inactive) [ 08/Jun/12 ] |
|
patch for b18: http://review.whamcloud.com/3060 |
| Comment by Shuichi Ihara (Inactive) [ 08/Jun/12 ] |
|
Thank you very much for this patch and we will try this patch as well. |
| Comment by Niu Yawei (Inactive) [ 08/Jun/12 ] |
|
yes, exactly. The change is only for slave, but it should be applied on both MDS & OSS, since MDS has a slave running on it as well. |
| Comment by Niu Yawei (Inactive) [ 08/Jun/12 ] |
|
Hi, Ihara After looking closer to the log, I realized that your case is not caused by the race, it probably caused by that when you disable quota, some OST is offline (or the clear quota command to some OST was failed for some reason), so the local fs quota isn't disabled on some OST. To resolve the problem, you can return the 'lfs quotaoff' to see if it's helpful. Thanks. |
| Comment by Shuichi Ihara (Inactive) [ 08/Jun/12 ] |
|
Niu, Thanks for updates. the original problem was that some groups exceeded quota despite these groups didn't achieve quota limit. what do you think of this problem? The following messages showed up after disabled quota, but the primary problem was quota limit didn't work well. |
| Comment by Niu Yawei (Inactive) [ 08/Jun/12 ] |
The original problem is that slave can't acquire quota from master for that group id, so it exceeded quota. My first patch might be helpful for this. You can also ask user to reset the quota limit for the problematic group id by 'lfs setquota 0' -> 'lfs setquota xxx' to see if it's helpful.
The second problem might be caused by the inconsistence of on/off information between slave and master, you could ask user to rerun the 'lfs quotaoff' to see if it works. Anyway, the quota design (in master & b18) requires all slaves online while setting quota or turn on/off quota, if the setquota or enable/disable quota failed on some slave (OST), there will be trouble, you have to rerun the command to make sure the new setting is synchronized on each slave(OST). Such kind of design drawback has been addressed in the new quota design (in orion). |
| Comment by Shuichi Ihara (Inactive) [ 12/Jun/12 ] |
|
Niu, It seems that the patches didn't help fixing problem.. However, a user got "quota exceeded messages" even he didn't get quota limit. Jun 12 18:19:36 nos131i kernel: Lustre: 29689:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) And not same time, but we are seeing a lot of same messages on the even another OSSs. The resent logfiles are filed on /uploads/ Please investigate further. thank you very much again. |
| Comment by Niu Yawei (Inactive) [ 12/Jun/12 ] |
|
patch for master: http://review.whamcloud.com/#change,3097 |
| Comment by Niu Yawei (Inactive) [ 12/Jun/12 ] |
|
Hi, Ihara Did you try reset quota by 'lfs setquota' for the user? |
| Comment by Shuichi Ihara (Inactive) [ 12/Jun/12 ] |
|
You mean set zero to all (e.g. "lfs setquota -u <user> -b 0 -B 0 -i 0 -I 0 <fsname>") to clear, then set real quota size again? |
| Comment by Niu Yawei (Inactive) [ 13/Jun/12 ] |
|
Yes, for the user who has quota limit, reset quota by "lfs setquota -b 0 -B 0 -i 0 -I 0 $fsname" -> "lfs setquota -b limit ...", for the user who has no quota limit but got the error message (still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)) on OSS, re-clear the quota limit by 'lfs setquota -b limit ..." -> 'lfs setquota -b 0 ...". There isn't much information in the dmesg, could you explain current status of the system? is user or group quota enabled? what's the output of 'lfs quota -v -u $user $fs" for the user who has no quota limit but got error message on OSS? Thanks. |
| Comment by Shuichi Ihara (Inactive) [ 13/Jun/12 ] |
|
Niu, The customer did "lfs setquota" to clear the all quota size to some users. As you mentioned, there was an inconsistency between master and slave - user is not exceeded to the global limit, but some local quota were exceeded the limit. Once the local quota on all MDS/OSSs are cleared by "lfs setquota -b 0 -B 0...", then set quota again, now the quota works normally. Is this what you said a root cause and solution to fix current situation, right? So, for change the quota size, "1. clear quota size, 2. set quota size" is better and safe solution rather than just change the quota size by "lfs setquota -b X"? Anyway, very appreciate you detailed analysis and explanation. |
| Comment by Niu Yawei (Inactive) [ 13/Jun/12 ] |
|
To change quota size, "clear -> set new size" isn't necessary safer than 'set new size' directly. Whenever 'lfs setquota' changes quota limit from zero to non-zero (or from non-zero to zero), quota master (MDS) will notify all slaves (OSTs) to change their local fs quota limit, however, if some slave is offline at that time (or the notification to some slave failed for some reason), inconsistence will raise between the master and the slaves which didn't receive quota change notification. Changing quota from old limit to new limit (all non-zero values) will not trigger the quota change notification to slaves, so it'll not cause the inconsistence, of course, it can't fix the inconsistence as well. In the new quota design, any later joined slave can sync the quota setting with master automatically, so user needn't worry about the offline slave anymore. |
| Comment by Shuichi Ihara (Inactive) [ 25/Jun/12 ] |
|
Hello Niu, The problem is not fixed yet even after reset group quota. The last week, we did zero reset quota for all groups, but a couple of groups were failed to set quota above. nos13: Jun 20 18:02:19 nos131i kernel: Lustre: 18627:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) nos13: Jun 20 21:40:47 nos131i kernel: Lustre: 29660:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18554:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10856 nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18575:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10857 nos14: Jun 20 12:38:41 nos141i kernel: Lustre: 28944:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) Then, we set quota again to these groups, but today, we got same messages on OSS. Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) I don't know why still failing to acquire quota space from master. Any ideas of workaround to avoid this issue? or we can add debug to address this issue and to understand what happens? |
| Comment by Niu Yawei (Inactive) [ 25/Jun/12 ] |
|
Hi, Ihara If it's easy to be reproduced, we can collect the debug log on OSS with D_TRACE & D_QUOTA enabled (echo +trace > /proc/sys/lnet/debug, echo +quota > /proc/sys/lnet/debug), then we can see where the acquire quota returns zero. Thanks. |
| Comment by Shuichi Ihara (Inactive) [ 26/Jun/12 ] |
|
Niu, I meant you could make debug patch to see more detail information during the production system. e.g) Who exceed quota limit by this messages. surely, we didn't do any operations (set/clear quota) when the following messages showed up. Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) We don't know how reproduce this problem at this morment, but let me ask we can enable debug flags. BTW, what is "28760:0" of above messages? |
| Comment by Niu Yawei (Inactive) [ 26/Jun/12 ] |
|
Hi, Ihara The debug patch would be similar to enable D_TRACE & D_QUOTA for debug log, If the cusotmer can't affort D_TRACE debug log, we can only enable D_QUOTA first to collect some debug log. The 28760 is pid, and the 0 is 'extern pid' (looks it's always 0 for now, you can just ignore it). |
| Comment by Shuichi Ihara (Inactive) [ 28/Jun/12 ] |
|
Hi Niu, uploaded debug files on uploads/ |
| Comment by Niu Yawei (Inactive) [ 28/Jun/12 ] |
|
Hi, Ihara Could you apply this debug patch? then we'll see lots more deubug information in the syslong along with the "still can't acquire..." messages. Thanks. |
| Comment by Shuichi Ihara (Inactive) [ 29/Jun/12 ] |
|
Hi Niu, Thanks for this debug patch. I will ask customer if we can apply patch. |
| Comment by Shuichi Ihara (Inactive) [ 04/Jul/12 ] |
|
This might be related to 32bit quota setting limitation? |
| Comment by Niu Yawei (Inactive) [ 04/Jul/12 ] |
|
probably, is there anything useful from the debug log? |
| Comment by Shuichi Ihara (Inactive) [ 04/Jul/12 ] |
|
We didn't apply patches yet, but will apply them on the next scheduled maintenance time. |
| Comment by Shuichi Ihara (Inactive) [ 18/Aug/12 ] |
|
this might be caused by |
| Comment by Peter Jones [ 18/Aug/12 ] |
|
ok thanks Ihara |