[LU-2289] still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.8
Labels:
- ptr
Environment:
Lustre 1.8.8 + ~~LU-1720~~

Severity:
3
Rank (Obsolete):
5479

Description

Even after adding the patch from ~~LU-1720~~, we are still seeing messages like:
Lustre: 18271:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

At this site, we haven't added any other patches to 1.8.8. What do these messages mean? Is it possible that some of the other patches (like ~~LU-1438~~) could fix these?

The customer hasn't noticed any functional issues, but of course that doesn't mean there aren't any. Quotas >4TB work on this system.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

kern.log-mds
1.90 MB
06/Nov/12 12:05 PM
kern.log-mds-Aug30
97 kB
07/Nov/12 8:39 AM
kern.log-oss
0.2 kB
06/Nov/12 12:05 PM
kern.log-oss-Aug30
305 kB
07/Nov/12 8:39 AM

Activity

[LU-2289] still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

Kit Westneat (Inactive) added a comment - 12/Mar/13 11:37 AM

Hi Niu,

The customer ran lfs quotaoff, quotaon, and quotacheck. The problem reappeared fairly quickly on one filesystem, and after a few weeks to the other filesystem.

The customer asks:
What is the impact of NOT acquiring quota space?

I noticed that the quota debug level prints information about the UID/GID that is causing problems. I will try to get quota debug logs during the event. Is there anything else we can get, or any other ideas?

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 12/Mar/13 11:37 AM Hi Niu, The customer ran lfs quotaoff, quotaon, and quotacheck. The problem reappeared fairly quickly on one filesystem, and after a few weeks to the other filesystem. The customer asks: What is the impact of NOT acquiring quota space? I noticed that the quota debug level prints information about the UID/GID that is causing problems. I will try to get quota debug logs during the event. Is there anything else we can get, or any other ideas? Thanks, Kit

Kit Westneat (Inactive) added a comment - 05/Dec/12 12:11 PM

We've seen this issue again without any network issues. Is there any way to debug what is going on? The error only appears intermittently, so I think it's probably some kind of sync issue. Is there a way to tell what uid/gid is causing problems, or is it better to just redo the lfs quotacheck?

Kit Westneat (Inactive) added a comment - 05/Dec/12 12:11 PM We've seen this issue again without any network issues. Is there any way to debug what is going on? The error only appears intermittently, so I think it's probably some kind of sync issue. Is there a way to tell what uid/gid is causing problems, or is it better to just redo the lfs quotacheck?

Niu Yawei (Inactive) added a comment - 07/Nov/12 11:00 PM

Thank you, Kit.

Yes, the new log shows quite a lot of connection errors, I'm not sure if the filesystem was healty at that time.

The most possible reason of OST constantly showing "still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" is that there are some kind of inconsistency between master (MDT) and slave (OST):

Quota is already disabled on master (MDT), but not disabled on slave (OST);
Quota limit for some certain uid/gid has been cleared on MDT (master), but not cleared on slave (OST);

If the quota is supposed to be disabled, but writting files can cause the "still haven't...(err=0, rc=0)" message on OSS, you can rerun "lfs quotaoff" when all OSTs are up to make sure quota is disabled on each OST;
If some uid/gid is supposed to have 0 limit, but writting to the uid/gid owned files can cause the message on OSS, you can rerun "lfs setquota ..." when all OSTs are up to make sure the limit is cleared on each OST;

That's what I can think of so far.

Niu Yawei (Inactive) added a comment - 07/Nov/12 11:00 PM Thank you, Kit. Yes, the new log shows quite a lot of connection errors, I'm not sure if the filesystem was healty at that time. The most possible reason of OST constantly showing "still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" is that there are some kind of inconsistency between master (MDT) and slave (OST): Quota is already disabled on master (MDT), but not disabled on slave (OST); Quota limit for some certain uid/gid has been cleared on MDT (master), but not cleared on slave (OST); If the quota is supposed to be disabled, but writting files can cause the "still haven't...(err=0, rc=0)" message on OSS, you can rerun "lfs quotaoff" when all OSTs are up to make sure quota is disabled on each OST; If some uid/gid is supposed to have 0 limit, but writting to the uid/gid owned files can cause the message on OSS, you can rerun "lfs setquota ..." when all OSTs are up to make sure the limit is cleared on each OST; That's what I can think of so far.

Kit Westneat (Inactive) added a comment - 07/Nov/12 8:39 AM

logs only for Aug 30

Kit Westneat (Inactive) added a comment - 07/Nov/12 8:39 AM logs only for Aug 30

Kit Westneat (Inactive) added a comment - 07/Nov/12 8:38 AM

Hi Niu,

Sorry I didn't truncate the logs to the relevant dates, those are actually old messages. I will reupload the logs with only the relevant parts.

In preparing these logs, I noticed that when the OSS was displaying the message, the MDT was having problems connecting:
Aug 30 22:28:11 lfs-mds-2-1 kernel: Lustre: 9067:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1411737505069379 sent from lfs2-OST0002-osc to NID 10.179.16.124@o2ib 0s ago has failed due to network error (15s prior to deadline).

This actually looks like:
http://jira.whamcloud.com/browse/LU-1809

I will try to get more information from the customer about when they are seeing the message and if they are running with the workaround suggested in ~~LU-1809~~.

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 07/Nov/12 8:38 AM Hi Niu, Sorry I didn't truncate the logs to the relevant dates, those are actually old messages. I will reupload the logs with only the relevant parts. In preparing these logs, I noticed that when the OSS was displaying the message, the MDT was having problems connecting: Aug 30 22:28:11 lfs-mds-2-1 kernel: Lustre: 9067:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1411737505069379 sent from lfs2-OST0002-osc to NID 10.179.16.124@o2ib 0s ago has failed due to network error (15s prior to deadline). This actually looks like: http://jira.whamcloud.com/browse/LU-1809 I will try to get more information from the customer about when they are seeing the message and if they are running with the workaround suggested in LU-1809 . Thanks, Kit

Niu Yawei (Inactive) added a comment - 06/Nov/12 9:14 PM

I see lots of following errors in mds log:

Jul 12 20:34:19 lfs-mds-2-1 kernel: Lustre: 7972:0:(lproc_quota.c:453:lprocfs_quota_wr_type()) lfs2-MDT0000: quotaon failed because quota files don't exist, please run quotacheck firstly

Seems there is something wrong on MDT, so that quota was not turned on for MDT, but it's turned on for OST, so OST can't acquire quota from MDT and report the messages "still haven't managed to acquire quota space from the quota master after 10 retries".

Could you check what's the quota configuration was used? (ug3 for both MDT & OST?) If rerun "lfs quotacheck" can fix your problem? I not, I think we'd collect debuglog for quotacheck to see why the amdin or local quota file wasn't created. Thanks.

Niu Yawei (Inactive) added a comment - 06/Nov/12 9:14 PM I see lots of following errors in mds log: Jul 12 20:34:19 lfs-mds-2-1 kernel: Lustre: 7972:0:(lproc_quota.c:453:lprocfs_quota_wr_type()) lfs2-MDT0000: quotaon failed because quota files don't exist, please run quotacheck firstly Seems there is something wrong on MDT, so that quota was not turned on for MDT, but it's turned on for OST, so OST can't acquire quota from MDT and report the messages "still haven't managed to acquire quota space from the quota master after 10 retries". Could you check what's the quota configuration was used? (ug3 for both MDT & OST?) If rerun "lfs quotacheck" can fix your problem? I not, I think we'd collect debuglog for quotacheck to see why the amdin or local quota file wasn't created. Thanks.

Peter Jones added a comment - 06/Nov/12 12:13 PM

Niu

Could you please look into this one?

Thanks

Peter

Peter Jones added a comment - 06/Nov/12 12:13 PM Niu Could you please look into this one? Thanks Peter

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Oz Rentas (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Nov/12 12:05 PM

Updated:: 29/Sep/15 8:22 AM

Resolved:: 29/Sep/15 8:22 AM