Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1438

quota_chk_acq_common() still haven't managed to acquire quota

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • Lustre 1.8.7
    • None
    • lustre-1.8.7-wc1, RHEL5.7 for servers, RHEL6.2 for clients
    • 3
    • 4584

    Description

      we are getting some of quota related problem. the quota feature is enabled on the filesystem and the customer changed group name to the many big files, then ran lfs quotacheck command.
      After that, even the group didn't exceed the quota limitation, they got the disk quota exceeded messages.

      OSS/MDS side, the following messages showed up since changed group name.
      (quota_interface.c:473:quota_chk_acq_common()) still haven't managed to
      acquire quota space from the quota master after 20 retries (err=0, rc=0)

      It seems to be close to LU-428.

      Attachments

        Issue Links

          Activity

            [LU-1438] quota_chk_acq_common() still haven't managed to acquire quota

            probably, is there anything useful from the debug log?

            niu Niu Yawei (Inactive) added a comment - probably, is there anything useful from the debug log?

            This might be related to 32bit quota setting limitation?

            ihara Shuichi Ihara (Inactive) added a comment - This might be related to 32bit quota setting limitation?

            Hi Niu,

            Thanks for this debug patch. I will ask customer if we can apply patch.

            ihara Shuichi Ihara (Inactive) added a comment - Hi Niu, Thanks for this debug patch. I will ask customer if we can apply patch.

            Hi, Ihara

            Could you apply this debug patch? then we'll see lots more deubug information in the syslong along with the "still can't acquire..." messages. Thanks.

            niu Niu Yawei (Inactive) added a comment - Hi, Ihara Could you apply this debug patch? then we'll see lots more deubug information in the syslong along with the "still can't acquire..." messages. Thanks.

            Hi Niu,

            uploaded debug files on uploads/LU-1438/debugfile.20120628.gz
            However, it's not easy to reproduce this messages though the messages still showed up on the system log file irregularly.
            The lustre debug file doesn't contain messages. perhaps, the maximum debug size (100MB) exceeded quickly?
            Any ideas to keep track and debug information in this situation?

            ihara Shuichi Ihara (Inactive) added a comment - Hi Niu, uploaded debug files on uploads/ LU-1438 /debugfile.20120628.gz However, it's not easy to reproduce this messages though the messages still showed up on the system log file irregularly. The lustre debug file doesn't contain messages. perhaps, the maximum debug size (100MB) exceeded quickly? Any ideas to keep track and debug information in this situation?

            Hi, Ihara

            The debug patch would be similar to enable D_TRACE & D_QUOTA for debug log, If the cusotmer can't affort D_TRACE debug log, we can only enable D_QUOTA first to collect some debug log.

            The 28760 is pid, and the 0 is 'extern pid' (looks it's always 0 for now, you can just ignore it).

            niu Niu Yawei (Inactive) added a comment - Hi, Ihara The debug patch would be similar to enable D_TRACE & D_QUOTA for debug log, If the cusotmer can't affort D_TRACE debug log, we can only enable D_QUOTA first to collect some debug log. The 28760 is pid, and the 0 is 'extern pid' (looks it's always 0 for now, you can just ignore it).

            Niu, I meant you could make debug patch to see more detail information during the production system. e.g) Who exceed quota limit by this messages. surely, we didn't do any operations (set/clear quota) when the following messages showed up.

            Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
            

            We don't know how reproduce this problem at this morment, but let me ask we can enable debug flags. BTW, what is "28760:0" of above messages?

            ihara Shuichi Ihara (Inactive) added a comment - Niu, I meant you could make debug patch to see more detail information during the production system. e.g) Who exceed quota limit by this messages. surely, we didn't do any operations (set/clear quota) when the following messages showed up. Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) We don't know how reproduce this problem at this morment, but let me ask we can enable debug flags. BTW, what is "28760:0" of above messages?

            Hi, Ihara

            If it's easy to be reproduced, we can collect the debug log on OSS with D_TRACE & D_QUOTA enabled (echo +trace > /proc/sys/lnet/debug, echo +quota > /proc/sys/lnet/debug), then we can see where the acquire quota returns zero. Thanks.

            niu Niu Yawei (Inactive) added a comment - Hi, Ihara If it's easy to be reproduced, we can collect the debug log on OSS with D_TRACE & D_QUOTA enabled (echo +trace > /proc/sys/lnet/debug, echo +quota > /proc/sys/lnet/debug), then we can see where the acquire quota returns zero. Thanks.

            Hello Niu,

            The problem is not fixed yet even after reset group quota. The last week, we did zero reset quota for all groups, but a couple of groups were failed to set quota above.

            nos13: Jun 20 18:02:19 nos131i kernel: Lustre: 18627:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
            nos13: Jun 20 21:40:47 nos131i kernel: Lustre: 29660:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
            nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18554:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10856
            nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18575:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10857
            nos14: Jun 20 12:38:41 nos141i kernel: Lustre: 28944:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
            

            Then, we set quota again to these groups, but today, we got same messages on OSS.

            Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
            

            I don't know why still failing to acquire quota space from master. Any ideas of workaround to avoid this issue? or we can add debug to address this issue and to understand what happens?

            ihara Shuichi Ihara (Inactive) added a comment - Hello Niu, The problem is not fixed yet even after reset group quota. The last week, we did zero reset quota for all groups, but a couple of groups were failed to set quota above. nos13: Jun 20 18:02:19 nos131i kernel: Lustre: 18627:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) nos13: Jun 20 21:40:47 nos131i kernel: Lustre: 29660:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18554:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10856 nos14: Jun 18 13:10:06 nos141i kernel: LustreError: 18575:0:(quota_ctl.c:260:filter_quota_ctl()) fail to create lqs during setquota operation for gid 10857 nos14: Jun 20 12:38:41 nos141i kernel: Lustre: 28944:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) Then, we set quota again to these groups, but today, we got same messages on OSS. Jun 25 03:03:59 nos141i kernel: Lustre: 28760:0:(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) I don't know why still failing to acquire quota space from master. Any ideas of workaround to avoid this issue? or we can add debug to address this issue and to understand what happens?

            To change quota size, "clear -> set new size" isn't necessary safer than 'set new size' directly.

            Whenever 'lfs setquota' changes quota limit from zero to non-zero (or from non-zero to zero), quota master (MDS) will notify all slaves (OSTs) to change their local fs quota limit, however, if some slave is offline at that time (or the notification to some slave failed for some reason), inconsistence will raise between the master and the slaves which didn't receive quota change notification.

            Changing quota from old limit to new limit (all non-zero values) will not trigger the quota change notification to slaves, so it'll not cause the inconsistence, of course, it can't fix the inconsistence as well.

            In the new quota design, any later joined slave can sync the quota setting with master automatically, so user needn't worry about the offline slave anymore.

            niu Niu Yawei (Inactive) added a comment - To change quota size, "clear -> set new size" isn't necessary safer than 'set new size' directly. Whenever 'lfs setquota' changes quota limit from zero to non-zero (or from non-zero to zero), quota master (MDS) will notify all slaves (OSTs) to change their local fs quota limit, however, if some slave is offline at that time (or the notification to some slave failed for some reason), inconsistence will raise between the master and the slaves which didn't receive quota change notification. Changing quota from old limit to new limit (all non-zero values) will not trigger the quota change notification to slaves, so it'll not cause the inconsistence, of course, it can't fix the inconsistence as well. In the new quota design, any later joined slave can sync the quota setting with master automatically, so user needn't worry about the offline slave anymore.

            Niu,

            The customer did "lfs setquota" to clear the all quota size to some users.
            After that, as far as we can see, messages "(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" are gone so far.

            As you mentioned, there was an inconsistency between master and slave - user is not exceeded to the global limit, but some local quota were exceeded the limit. Once the local quota on all MDS/OSSs are cleared by "lfs setquota -b 0 -B 0...", then set quota again, now the quota works normally. Is this what you said a root cause and solution to fix current situation, right?

            So, for change the quota size, "1. clear quota size, 2. set quota size" is better and safe solution rather than just change the quota size by "lfs setquota -b X"?

            Anyway, very appreciate you detailed analysis and explanation.

            ihara Shuichi Ihara (Inactive) added a comment - Niu, The customer did "lfs setquota" to clear the all quota size to some users. After that, as far as we can see, messages "(quota_interface.c:481:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)" are gone so far. As you mentioned, there was an inconsistency between master and slave - user is not exceeded to the global limit, but some local quota were exceeded the limit. Once the local quota on all MDS/OSSs are cleared by "lfs setquota -b 0 -B 0...", then set quota again, now the quota works normally. Is this what you said a root cause and solution to fix current situation, right? So, for change the quota size, "1. clear quota size, 2. set quota size" is better and safe solution rather than just change the quota size by "lfs setquota -b X"? Anyway, very appreciate you detailed analysis and explanation.

            People

              niu Niu Yawei (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: