Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2289

still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 1.8.8
    • Lustre 1.8.8 + LU-1720
    • 3
    • 5479

    Description

      Even after adding the patch from LU-1720, we are still seeing messages like:
      Lustre: 18271:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

      At this site, we haven't added any other patches to 1.8.8. What do these messages mean? Is it possible that some of the other patches (like LU-1438) could fix these?

      The customer hasn't noticed any functional issues, but of course that doesn't mean there aren't any. Quotas >4TB work on this system.

      Attachments

        1. kern.log-mds
          1.90 MB
        2. kern.log-mds-Aug30
          97 kB
        3. kern.log-oss
          0.2 kB
        4. kern.log-oss-Aug30
          305 kB

        Activity

          [LU-2289] still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)

          Closing old bug.

          adilger Andreas Dilger added a comment - Closing old bug.

          Yes, when approaching limit, write performance will be lower because quota unit is very small when approaching limit, and there will be much more quota acquire/release operations between ost & mdt, even worse, the write operation has to be blocked on the quota acquire operation sometimes. We didn't measure the performance in such unusual case.

          niu Niu Yawei (Inactive) added a comment - Yes, when approaching limit, write performance will be lower because quota unit is very small when approaching limit, and there will be much more quota acquire/release operations between ost & mdt, even worse, the write operation has to be blocked on the quota acquire operation sometimes. We didn't measure the performance in such unusual case.

          Is it possible that this could cause performance issues when the UID/GID is approaching the quota limit? Do you know if there have ever been performance tests run for this case?

          Thanks.

          kitwestneat Kit Westneat (Inactive) added a comment - Is it possible that this could cause performance issues when the UID/GID is approaching the quota limit? Do you know if there have ever been performance tests run for this case? Thanks.

          Now I think it isn't a problem after I checked the large log for all processes (the small one I checked before is only for single process).

          When some uid/gid is approaching quota limit, the qunit size will be shrinked to the minimal size (1MB), and each quota acquire can only acquire 1MB limit, if there are many OST threads writing to the same uid/gid at this time, the acquired limit will be consumed very soon, so some writing threads have to recycle several times in the quota_chk_acq_common() until get required quota limit or fail with -EDQUOT at the end.

          niu Niu Yawei (Inactive) added a comment - Now I think it isn't a problem after I checked the large log for all processes (the small one I checked before is only for single process). When some uid/gid is approaching quota limit, the qunit size will be shrinked to the minimal size (1MB), and each quota acquire can only acquire 1MB limit, if there are many OST threads writing to the same uid/gid at this time, the acquired limit will be consumed very soon, so some writing threads have to recycle several times in the quota_chk_acq_common() until get required quota limit or fail with -EDQUOT at the end.

          Thank you, Kit. I checked the OSS log:

          00040000:00000400:6:1364224183.636169:0:1558:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
          00040000:04000000:6:1364224183.636179:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 10
          00040000:04000000:6:1364224183.636181:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
          00040000:04000000:6:1364224183.636190:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768
          00040000:04000000:6:1364224183.636193:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
          00002000:04000000:6:1364224183.636199:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3
          00040000:04000000:6:1364224183.636205:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
          00040000:04000000:6:1364224183.636209:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0.
          00040000:04000000:6:1364224183.636214:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 11
          00040000:04000000:6:1364224183.636217:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
          00040000:04000000:6:1364224183.636224:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768
          00040000:04000000:6:1364224183.636227:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
          00002000:04000000:6:1364224183.636231:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3
          00040000:04000000:6:1364224183.636239:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs
          00040000:04000000:6:1364224183.636243:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0.
          

          Looks there is pending acquire request (record: 2097152), and local limit will be able to satisfy current write request when the pending acquire done, so OST should wait for the pending reqeust, but it actually didn't wait, so quota_chk_acq_common() recycled many times until the pending request done.

          I didn't see why OST didn't wait on pending acquire so far, maybe there is something wrong with the lqs_blk_rec calculation? I'll look into it further.

          niu Niu Yawei (Inactive) added a comment - Thank you, Kit. I checked the OSS log: 00040000:00000400:6:1364224183.636169:0:1558:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0) 00040000:04000000:6:1364224183.636179:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 10 00040000:04000000:6:1364224183.636181:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636190:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768 00040000:04000000:6:1364224183.636193:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00002000:04000000:6:1364224183.636199:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3 00040000:04000000:6:1364224183.636205:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636209:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0. 00040000:04000000:6:1364224183.636214:0:1558:0:(quota_interface.c:482:quota_chk_acq_common()) recheck quota with rc: 0, cycle: 11 00040000:04000000:6:1364224183.636217:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636224:0:1558:0:(quota_interface.c:345:quota_check_common()) count=256 lqs_pending=6340608 qd_count=5087232 isblk=2 mb=0 pending[1]=1056768 00040000:04000000:6:1364224183.636227:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00002000:04000000:6:1364224183.636231:0:1558:0:(fsfilt-ldiskfs.c:1601:fsfilt_ldiskfs_quotactl()) quotactl command 0x800007, id 10311, type 0 failed: -3 00040000:04000000:6:1364224183.636239:0:1558:0:(quota_adjust_qunit.c:172:quota_search_lqs()) lqs(ffff810846b6a280) id(19302) flag(2147483649) type(g) bunit(1048576) btune(524288) iunit(5120) itune(2560) lqs_bwrite_pending(6340608) lqs_iwrite_pending(0) ino_rec(0) blk_rec(2097152)refcount(11): search lqs 00040000:04000000:6:1364224183.636243:0:1558:0:(quota_context.c:347:check_cur_qunit()) type: b, limit: 50417631232, usage: 50416787456, pending_write: 6340608, record: 2097152, qunit_sz: 1048576, tune_sz: 524288, ret: 0. Looks there is pending acquire request (record: 2097152), and local limit will be able to satisfy current write request when the pending acquire done, so OST should wait for the pending reqeust, but it actually didn't wait, so quota_chk_acq_common() recycled many times until the pending request done. I didn't see why OST didn't wait on pending acquire so far, maybe there is something wrong with the lqs_blk_rec calculation? I'll look into it further.

          Hi Niu,

          I've uploaded the logs here:
          http://eu.ddn.com:8080/lustre/LU-2289.tar.bz2

          It's about a 3 second window on each server, but it's still about 20GB uncompressed. It looks like for the message at 1364224183, it tries 19 times before it finally acquires it. Is there anything in the MDT logs that would indicate why it took so long? I didn't see anything in particular, but it's tough to read.

          kitwestneat Kit Westneat (Inactive) added a comment - Hi Niu, I've uploaded the logs here: http://eu.ddn.com:8080/lustre/LU-2289.tar.bz2 It's about a 3 second window on each server, but it's still about 20GB uncompressed. It looks like for the message at 1364224183, it tries 19 times before it finally acquires it. Is there anything in the MDT logs that would indicate why it took so long? I didn't see anything in particular, but it's tough to read.

          Hi, Kit

          If you have debug log, you can see if the quota is acquired eventually by checking the pid. (find the pid in the message of "still haven't managed ...", and check if this thread is constantly printing such message or it moves on eventually).

          When you attach the log, I suggest you truncate out the useful part, otherwise, it could be too large. The useful part should be the messages nearby "still haven't managed ..." OST and the corresponding messages on MDS.

          I didn't find anything abnormal in the output of "lfs quota -u -v", thank you for providing this.

          niu Niu Yawei (Inactive) added a comment - Hi, Kit If you have debug log, you can see if the quota is acquired eventually by checking the pid. (find the pid in the message of "still haven't managed ...", and check if this thread is constantly printing such message or it moves on eventually). When you attach the log, I suggest you truncate out the useful part, otherwise, it could be too large. The useful part should be the messages nearby "still haven't managed ..." OST and the corresponding messages on MDS. I didn't find anything abnormal in the output of "lfs quota -u -v", thank you for providing this.

          People

            niu Niu Yawei (Inactive)
            orentas Oz Rentas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: