Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2289

still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 1.8.8
    • Lustre 1.8.8 + LU-1720
    • 3
    • 5479

    Description

      Even after adding the patch from LU-1720, we are still seeing messages like:
      Lustre: 18271:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

      At this site, we haven't added any other patches to 1.8.8. What do these messages mean? Is it possible that some of the other patches (like LU-1438) could fix these?

      The customer hasn't noticed any functional issues, but of course that doesn't mean there aren't any. Quotas >4TB work on this system.

      Attachments

        1. kern.log-mds
          1.90 MB
        2. kern.log-mds-Aug30
          97 kB
        3. kern.log-oss
          0.2 kB
        4. kern.log-oss-Aug30
          305 kB

        Activity

          [LU-2289] still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

          Hi Niu,

          The customer ran lfs quotaoff, quotaon, and quotacheck. The problem reappeared fairly quickly on one filesystem, and after a few weeks to the other filesystem.

          The customer asks:
          What is the impact of NOT acquiring quota space?

          I noticed that the quota debug level prints information about the UID/GID that is causing problems. I will try to get quota debug logs during the event. Is there anything else we can get, or any other ideas?

          Thanks,
          Kit

          kitwestneat Kit Westneat (Inactive) added a comment - Hi Niu, The customer ran lfs quotaoff, quotaon, and quotacheck. The problem reappeared fairly quickly on one filesystem, and after a few weeks to the other filesystem. The customer asks: What is the impact of NOT acquiring quota space? I noticed that the quota debug level prints information about the UID/GID that is causing problems. I will try to get quota debug logs during the event. Is there anything else we can get, or any other ideas? Thanks, Kit

          We've seen this issue again without any network issues. Is there any way to debug what is going on? The error only appears intermittently, so I think it's probably some kind of sync issue. Is there a way to tell what uid/gid is causing problems, or is it better to just redo the lfs quotacheck?

          kitwestneat Kit Westneat (Inactive) added a comment - We've seen this issue again without any network issues. Is there any way to debug what is going on? The error only appears intermittently, so I think it's probably some kind of sync issue. Is there a way to tell what uid/gid is causing problems, or is it better to just redo the lfs quotacheck?

          Thank you, Kit.

          Yes, the new log shows quite a lot of connection errors, I'm not sure if the filesystem was healty at that time.

          The most possible reason of OST constantly showing "still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" is that there are some kind of inconsistency between master (MDT) and slave (OST):

          • Quota is already disabled on master (MDT), but not disabled on slave (OST);
          • Quota limit for some certain uid/gid has been cleared on MDT (master), but not cleared on slave (OST);

          If the quota is supposed to be disabled, but writting files can cause the "still haven't...(err=0, rc=0)" message on OSS, you can rerun "lfs quotaoff" when all OSTs are up to make sure quota is disabled on each OST;
          If some uid/gid is supposed to have 0 limit, but writting to the uid/gid owned files can cause the message on OSS, you can rerun "lfs setquota ..." when all OSTs are up to make sure the limit is cleared on each OST;

          That's what I can think of so far.

          niu Niu Yawei (Inactive) added a comment - Thank you, Kit. Yes, the new log shows quite a lot of connection errors, I'm not sure if the filesystem was healty at that time. The most possible reason of OST constantly showing "still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" is that there are some kind of inconsistency between master (MDT) and slave (OST): Quota is already disabled on master (MDT), but not disabled on slave (OST); Quota limit for some certain uid/gid has been cleared on MDT (master), but not cleared on slave (OST); If the quota is supposed to be disabled, but writting files can cause the "still haven't...(err=0, rc=0)" message on OSS, you can rerun "lfs quotaoff" when all OSTs are up to make sure quota is disabled on each OST; If some uid/gid is supposed to have 0 limit, but writting to the uid/gid owned files can cause the message on OSS, you can rerun "lfs setquota ..." when all OSTs are up to make sure the limit is cleared on each OST; That's what I can think of so far.

          logs only for Aug 30

          kitwestneat Kit Westneat (Inactive) added a comment - logs only for Aug 30

          Hi Niu,

          Sorry I didn't truncate the logs to the relevant dates, those are actually old messages. I will reupload the logs with only the relevant parts.

          In preparing these logs, I noticed that when the OSS was displaying the message, the MDT was having problems connecting:
          Aug 30 22:28:11 lfs-mds-2-1 kernel: Lustre: 9067:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1411737505069379 sent from lfs2-OST0002-osc to NID 10.179.16.124@o2ib 0s ago has failed due to network error (15s prior to deadline).

          This actually looks like:
          http://jira.whamcloud.com/browse/LU-1809

          I will try to get more information from the customer about when they are seeing the message and if they are running with the workaround suggested in LU-1809.

          Thanks,
          Kit

          kitwestneat Kit Westneat (Inactive) added a comment - Hi Niu, Sorry I didn't truncate the logs to the relevant dates, those are actually old messages. I will reupload the logs with only the relevant parts. In preparing these logs, I noticed that when the OSS was displaying the message, the MDT was having problems connecting: Aug 30 22:28:11 lfs-mds-2-1 kernel: Lustre: 9067:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1411737505069379 sent from lfs2-OST0002-osc to NID 10.179.16.124@o2ib 0s ago has failed due to network error (15s prior to deadline). This actually looks like: http://jira.whamcloud.com/browse/LU-1809 I will try to get more information from the customer about when they are seeing the message and if they are running with the workaround suggested in LU-1809 . Thanks, Kit

          I see lots of following errors in mds log:

          Jul 12 20:34:19 lfs-mds-2-1 kernel: Lustre: 7972:0:(lproc_quota.c:453:lprocfs_quota_wr_type()) lfs2-MDT0000: quotaon failed because quota files don't exist, please run quotacheck firstly
          

          Seems there is something wrong on MDT, so that quota was not turned on for MDT, but it's turned on for OST, so OST can't acquire quota from MDT and report the messages "still haven't managed to acquire quota space from the quota master after 10 retries".

          Could you check what's the quota configuration was used? (ug3 for both MDT & OST?) If rerun "lfs quotacheck" can fix your problem? I not, I think we'd collect debuglog for quotacheck to see why the amdin or local quota file wasn't created. Thanks.

          niu Niu Yawei (Inactive) added a comment - I see lots of following errors in mds log: Jul 12 20:34:19 lfs-mds-2-1 kernel: Lustre: 7972:0:(lproc_quota.c:453:lprocfs_quota_wr_type()) lfs2-MDT0000: quotaon failed because quota files don't exist, please run quotacheck firstly Seems there is something wrong on MDT, so that quota was not turned on for MDT, but it's turned on for OST, so OST can't acquire quota from MDT and report the messages "still haven't managed to acquire quota space from the quota master after 10 retries". Could you check what's the quota configuration was used? (ug3 for both MDT & OST?) If rerun "lfs quotacheck" can fix your problem? I not, I think we'd collect debuglog for quotacheck to see why the amdin or local quota file wasn't created. Thanks.
          pjones Peter Jones added a comment -

          Niu

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Niu Could you please look into this one? Thanks Peter

          People

            niu Niu Yawei (Inactive)
            orentas Oz Rentas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: