Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4139

Significant perforamce issue when user over soft quota limit

Details

    • 4
    • 11230

    Description

      When a user goes over their softlimit there is a major performace hit.

      Testing showed a file copied in 3sec when under softlimit and 7 Min when over softlimit.

      Can be reproduced by just testing below and over softlimit.

      see trace for when the copy was slow.

      Attachments

        Issue Links

          Activity

            [LU-4139] Significant perforamce issue when user over soft quota limit
            pjones Peter Jones added a comment -

            ok Mahmoud

            pjones Peter Jones added a comment - ok Mahmoud

            Please close this one.

            mhanafi Mahmoud Hanafi added a comment - Please close this one.
            yujian Jian Yu added a comment - - edited

            Patch http://review.whamcloud.com/8078 landed on master branch and was cherry-picked to Lustre b2_4 branch.

            yujian Jian Yu added a comment - - edited Patch http://review.whamcloud.com/8078 landed on master branch and was cherry-picked to Lustre b2_4 branch.

            New benchmark number with the patch

            Direct I/O
            UnderSoftlimit: 383MB/sec
            OverSoftlimit: 359MB/sec

            Buffered I/O
            UnderSoftlimit:316MB.sec
            OverSoftlimit: 304MB/sec

            So it looks good!

            mhanafi Mahmoud Hanafi added a comment - New benchmark number with the patch Direct I/O UnderSoftlimit: 383MB/sec OverSoftlimit: 359MB/sec Buffered I/O UnderSoftlimit:316MB.sec OverSoftlimit: 304MB/sec So it looks good!

            How does this patch help with the 4k io sizes? I think that is the real issue with the performance.

            4K io size is caused by over quota flag on client, with this patch, slave can acquire/pre-acquire little bit more spare limit each time when over softlimit, then over quota flag won't be set on client anymore.

            niu Niu Yawei (Inactive) added a comment - How does this patch help with the 4k io sizes? I think that is the real issue with the performance. 4K io size is caused by over quota flag on client, with this patch, slave can acquire/pre-acquire little bit more spare limit each time when over softlimit, then over quota flag won't be set on client anymore.

            How does this patch help with the 4k io sizes? I think that is the real issue with the performance.

            mhanafi Mahmoud Hanafi added a comment - How does this patch help with the 4k io sizes? I think that is the real issue with the performance.

            Lose some grace time accuracy to improve write performance when over softlimit: http://review.whamcloud.com/8078

            niu Niu Yawei (Inactive) added a comment - Lose some grace time accuracy to improve write performance when over softlimit: http://review.whamcloud.com/8078

            While the servers run 2.4.1 the clients are 2.1.5. The client code has no knowledge of the new quota rules. Which variable/field enforce sync write, and how the server tells clients to start using sync write? I found where the qunit is adjusted, but I have not figured out how the sync write was enforced.

            The new quota didn't change client protocol, so triggering sync write when approaching limit is same as before, please check the server code qsd_op_begin0():

                                    __u64   usage;
            
                                    lqe_read_lock(lqe);
                                    usage  = lqe->lqe_usage;
                                    usage += lqe->lqe_pending_write;
                                    usage += lqe->lqe_waiting_write;
                                    usage += qqi->qqi_qsd->qsd_sync_threshold;
            
                                    /* if we should notify client to start sync write */
                                    if (usage >= lqe->lqe_granted - lqe->lqe_pending_rel)
                                            *flags |= LQUOTA_OVER_FL(qqi->qqi_qtype);
                                    else
                                            *flags &= ~LQUOTA_OVER_FL(qqi->qqi_qtype);
                                    lqe_read_unlock(lqe);
            

            And the client code osc_queue_async_io() -> osc_quota_chkdq().

            niu Niu Yawei (Inactive) added a comment - While the servers run 2.4.1 the clients are 2.1.5. The client code has no knowledge of the new quota rules. Which variable/field enforce sync write, and how the server tells clients to start using sync write? I found where the qunit is adjusted, but I have not figured out how the sync write was enforced. The new quota didn't change client protocol, so triggering sync write when approaching limit is same as before, please check the server code qsd_op_begin0(): __u64 usage; lqe_read_lock(lqe); usage = lqe->lqe_usage; usage += lqe->lqe_pending_write; usage += lqe->lqe_waiting_write; usage += qqi->qqi_qsd->qsd_sync_threshold; /* if we should notify client to start sync write */ if (usage >= lqe->lqe_granted - lqe->lqe_pending_rel) *flags |= LQUOTA_OVER_FL(qqi->qqi_qtype); else *flags &= ~LQUOTA_OVER_FL(qqi->qqi_qtype); lqe_read_unlock(lqe); And the client code osc_queue_async_io() -> osc_quota_chkdq().

            While the servers run 2.4.1 the clients are 2.1.5. The client code has no knowledge of the new quota rules. Which variable/field enforce sync write, and how the server tells clients to start using sync write? I found where the qunit is adjusted, but I have not figured out how the sync write was enforced.

            jaylan Jay Lan (Inactive) added a comment - While the servers run 2.4.1 the clients are 2.1.5. The client code has no knowledge of the new quota rules. Which variable/field enforce sync write, and how the server tells clients to start using sync write? I found where the qunit is adjusted, but I have not figured out how the sync write was enforced.

            What I noticed is when I was over my softlimit using cp all the I/O was 4KB RPCs. I was able to see this happen in the middle of my test as I would go over my softlimit the RPCs would drop from 1MB to 4KB. Also using IOR with buffered I/O all the RPCs where 4K. It seems that the smaller I/O sizes are the main issue.

            For quota accuracy, when approaching (or over) quota hardlimit (or softlimit), client turns to sync write (see bug16642), and in consequence the PRC size will be page size, 4K (page can't be cached on client, it has to be synced out on write).

            I think it is unintentional that over softlimit IO is done in 4kB chunks, even if the qunit is getting 1MB chunks. Is it possible to avoid throttling the clients if there is a large gap between the soft and hard quota limit (i.e. treating over softlimit the same as under softlimit if there is still a large margin before the hardlimit)?

            As I described above, page size (4KB) io is because of sync write on client. To avoid sync write on client after over softlimit, I think probably we can tweak the qunit size differently when over softlimit. I'll try to cooke a patch.

            niu Niu Yawei (Inactive) added a comment - What I noticed is when I was over my softlimit using cp all the I/O was 4KB RPCs. I was able to see this happen in the middle of my test as I would go over my softlimit the RPCs would drop from 1MB to 4KB. Also using IOR with buffered I/O all the RPCs where 4K. It seems that the smaller I/O sizes are the main issue. For quota accuracy, when approaching (or over) quota hardlimit (or softlimit), client turns to sync write (see bug16642), and in consequence the PRC size will be page size, 4K (page can't be cached on client, it has to be synced out on write). I think it is unintentional that over softlimit IO is done in 4kB chunks, even if the qunit is getting 1MB chunks. Is it possible to avoid throttling the clients if there is a large gap between the soft and hard quota limit (i.e. treating over softlimit the same as under softlimit if there is still a large margin before the hardlimit)? As I described above, page size (4KB) io is because of sync write on client. To avoid sync write on client after over softlimit, I think probably we can tweak the qunit size differently when over softlimit. I'll try to cooke a patch.

            Niu,
            I think it is unintentional that over softlimit IO is done in 4kB chunks, even if the qunit is getting 1MB chunks. Is it possible to avoid throttling the clients if there is a large gap between the soft and hard quota limit (i.e. treating over softlimit the same as under softlimit if there is still a large margin before the hardlimit)?

            adilger Andreas Dilger added a comment - Niu, I think it is unintentional that over softlimit IO is done in 4kB chunks, even if the qunit is getting 1MB chunks. Is it possible to avoid throttling the clients if there is a large gap between the soft and hard quota limit (i.e. treating over softlimit the same as under softlimit if there is still a large margin before the hardlimit)?

            People

              niu Niu Yawei (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: