Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7015

Grant space and reserved blocks percent parameters

Details

    • Question/Request
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.5.4
    • None
    • RHEL-6.6, lustre-2.5.4
    • 9223372036854775807

    Description

      Our system space utilization on one of our systems is high, and as we work to prune some of this data, we're exploring some other space tunings.

      One of our admins noted the "cur_grant_bytes" osc parameter. When we looked at a few clients, we saw that this variable often exceeds the max_dirty_mb, sometimes by an order of magnitude. We usually use 64MB of dirty cache per osc per client. Is there an upper limit to this cur_grants_bytes parameter? What are the side effects of setting this value to some lower value (or 0)? Can we reduce this client grant while there is active I/O, and can we do this for all osc connections simultaneously (for a collective of millions of osc connections) for a system? Is this documented well anywhere?

      Additionally, we are looking into tuning the reserved_blocks_percent parameter. The Lustre manual states that 5% is the minimum, but is that a sane value for all OST sizes?

      Thanks,

      Jesse

      Attachments

        Issue Links

          Activity

            [LU-7015] Grant space and reserved blocks percent parameters

            Landed for 2.8.0

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.0

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16216/
            Subject: LU-7015 ofd: Fix wanted grant calculation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 091988499717c729f8870b331ab3774b249d5818

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16216/ Subject: LU-7015 ofd: Fix wanted grant calculation Project: fs/lustre-release Branch: master Current Patch Set: Commit: 091988499717c729f8870b331ab3774b249d5818

            Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/16216
            Subject: LU-7015 ofd: Fix wanted grant calculation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 22c2ad105d9d420058f653f03030ce2e4a3f017b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/16216 Subject: LU-7015 ofd: Fix wanted grant calculation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 22c2ad105d9d420058f653f03030ce2e4a3f017b

            The upper limit for wanted grant is typically at least 2x max_dirty_mb, so by having max_dirty_mb=64M doubles the potential amount of grant per client. With 22,000 clients that will result in a large amount of grant, as you have seen. Presumably the change to max_dirty_mb=64M was done to improve single-client write performance and/or increase writeback caching on the client without blocking the IO?

            Unfortunately, without the grant shrinker active, writing to cur_grant_bytes will not permanently affect the amount of grant held by the client unless the filesystem is nearly out of space. Otherwise, the client will try to surrender the grant but the server will reply that there is still grant available and return the same amount back. Only when the available space begins to get constrained will the OST not return the grant, but this can be done on clients as an emergency measure when available space is running short, something like:

            pdsh -a <clients> lctl set_param osc.*.cur_grant_bytes=2M
            

            (or even 1MB if necessary) and then clients which do not need any grant will not get any more.

            The grant shrinking code had problems when it was first introduced (before 2.0 was released) and has never been fixed since then.

            adilger Andreas Dilger added a comment - The upper limit for wanted grant is typically at least 2x max_dirty_mb, so by having max_dirty_mb=64M doubles the potential amount of grant per client. With 22,000 clients that will result in a large amount of grant, as you have seen. Presumably the change to max_dirty_mb=64M was done to improve single-client write performance and/or increase writeback caching on the client without blocking the IO? Unfortunately, without the grant shrinker active, writing to cur_grant_bytes will not permanently affect the amount of grant held by the client unless the filesystem is nearly out of space. Otherwise, the client will try to surrender the grant but the server will reply that there is still grant available and return the same amount back. Only when the available space begins to get constrained will the OST not return the grant, but this can be done on clients as an emergency measure when available space is running short, something like: pdsh -a <clients> lctl set_param osc.*.cur_grant_bytes=2M (or even 1MB if necessary) and then clients which do not need any grant will not get any more. The grant shrinking code had problems when it was first introduced (before 2.0 was released) and has never been fixed since then.
            ezell Matt Ezell added a comment -

            I just ran a quick test on our TDS system. I took a newly mounted client and created 50 files striped across OST 0. I backgrounded 50 dd processes against those files and gathered logs with +cache enabled on the client and server.

            The first thing I noticed it that the server very quickly increased the grant to the client, maybe even before the client had a chance to realize it.

            00002000:00000020:4.0:1439910325.382841:0:36645:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:4.0:1439910325.383027:0:31053:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:4.0:1439910325.383615:0:36646:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:4.0:1439910325.383775:0:36647:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:4.0:1439910325.384272:0:36648:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:4.0:1439910325.385007:0:36649:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:4.0:1439910325.385154:0:36650:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:6.0:1439910325.416668:0:36648:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 335872 granting: 8388608
            00002000:00000020:5.0:1439910325.417207:0:36649:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:6.0:1439910325.417262:0:36645:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608
            00002000:00000020:6.0:1439910325.433766:0:31053:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 29917184 granting: 8388608
            00002000:00000020:4.0:1439910325.433773:0:36646:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 8187904 granting: 8388608
            00002000:00000020:5.0:1439910325.433789:0:31052:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 22822912 granting: 8388608
            00002000:00000020:6.0:1439910325.434528:0:31054:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 25923584 granting: 8388608
            00002000:00000020:4.0:1439910325.434534:0:36647:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 25845760 granting: 8388608
            00002000:00000020:5.0:1439910325.591676:0:36650:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 32403456 granting: 8388608
            00002000:00000020:4.0:1439910325.591852:0:36652:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 32382976 granting: 8388608
            00002000:00000020:5.0:1439910325.591860:0:36647:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 32608256 granting: 8388608
            00002000:00000020:6.0:1439910325.593790:0:31054:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 30371840 granting: 8388608
            00002000:00000020:5.0:1439910325.595378:0:36651:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 29700096 granting: 8388608
            00002000:00000020:4.0:1439910325.595384:0:31052:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 29696000 granting: 8388608
            

            The server granted it 56MB before the client even reported having a grant.

            I haven't read all of the grant-related code, so take this analysis with a grain of salt.

            Is the want parameter supposed to be an absolute or relative value?

            lustre/ofd/ofd_grant.c:ofd_grant()
                    /* Grant some fraction of the client's requested grant space so that
                     * they are not always waiting for write credits (not all of it to
                     * avoid overgranting in face of multiple RPCs in flight).  This
                     * essentially will be able to control the OSC_MAX_RIF for a client.
                     *
                     * If we do have a large disparity between what the client thinks it
                     * has and what we think it has, don't grant very much and let the
                     * client consume its grant first.  Either it just has lots of RPCs
                     * in flight, or it was evicted and its grants will soon be used up. */
                    if (curgrant >= want || curgrant >= fed->fed_grant + grant_chunk)
                               RETURN(0);
            

            This looks like want is being used as an absolute value. Assuming want should be absolute, do we also need a check to ensure that fed->fed_grant isn't much larger than want?

            lustre/ofd/ofd_grant.c:ofd_grant()
            grant = min(want, left);
            ...
                    /* Limit to ofd_grant_chunk() if not reconnect/recovery */
                    if ((grant > grant_chunk) && conservative)
                            grant = grant_chunk;
            ...
                    ofd->ofd_tot_granted += grant;
                    fed->fed_grant += grant;
            

            This looks like want is a relative value.

            So the clients repeatedly says "I want 32MB" and the server takes that request, lowers it to grant_chunk (8MB), and grants it 8MB repeatedly until the client claims it has at least 32MB.

            According to Andreas in LU-3859, OBD_CONNECT_GRANT_SHRINK isn't set, so this is never cleaned up automatically. Is there a reason this is disabled?

            ezell Matt Ezell added a comment - I just ran a quick test on our TDS system. I took a newly mounted client and created 50 files striped across OST 0. I backgrounded 50 dd processes against those files and gathered logs with +cache enabled on the client and server. The first thing I noticed it that the server very quickly increased the grant to the client, maybe even before the client had a chance to realize it. 00002000:00000020:4.0:1439910325.382841:0:36645:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:4.0:1439910325.383027:0:31053:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:4.0:1439910325.383615:0:36646:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:4.0:1439910325.383775:0:36647:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:4.0:1439910325.384272:0:36648:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:4.0:1439910325.385007:0:36649:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:4.0:1439910325.385154:0:36650:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:6.0:1439910325.416668:0:36648:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 335872 granting: 8388608 00002000:00000020:5.0:1439910325.417207:0:36649:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:6.0:1439910325.417262:0:36645:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 0 granting: 8388608 00002000:00000020:6.0:1439910325.433766:0:31053:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 29917184 granting: 8388608 00002000:00000020:4.0:1439910325.433773:0:36646:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 8187904 granting: 8388608 00002000:00000020:5.0:1439910325.433789:0:31052:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 22822912 granting: 8388608 00002000:00000020:6.0:1439910325.434528:0:31054:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 25923584 granting: 8388608 00002000:00000020:4.0:1439910325.434534:0:36647:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 25845760 granting: 8388608 00002000:00000020:5.0:1439910325.591676:0:36650:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 32403456 granting: 8388608 00002000:00000020:4.0:1439910325.591852:0:36652:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 32382976 granting: 8388608 00002000:00000020:5.0:1439910325.591860:0:36647:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 32608256 granting: 8388608 00002000:00000020:6.0:1439910325.593790:0:31054:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 30371840 granting: 8388608 00002000:00000020:5.0:1439910325.595378:0:36651:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 29700096 granting: 8388608 00002000:00000020:4.0:1439910325.595384:0:31052:0:(ofd_grant.c:662:ofd_grant()) atlastds-OST0000: cli ce2fb5d0-e502-410d-675d-3b8d0dd26305/ffff880806fe3c00 wants: 33554432 current grant 29696000 granting: 8388608 The server granted it 56MB before the client even reported having a grant. I haven't read all of the grant-related code, so take this analysis with a grain of salt. Is the want parameter supposed to be an absolute or relative value? lustre/ofd/ofd_grant.c:ofd_grant() /* Grant some fraction of the client's requested grant space so that * they are not always waiting for write credits (not all of it to * avoid overgranting in face of multiple RPCs in flight). This * essentially will be able to control the OSC_MAX_RIF for a client. * * If we do have a large disparity between what the client thinks it * has and what we think it has, don't grant very much and let the * client consume its grant first. Either it just has lots of RPCs * in flight, or it was evicted and its grants will soon be used up. */ if (curgrant >= want || curgrant >= fed->fed_grant + grant_chunk) RETURN(0); This looks like want is being used as an absolute value. Assuming want should be absolute, do we also need a check to ensure that fed->fed_grant isn't much larger than want ? lustre/ofd/ofd_grant.c:ofd_grant() grant = min(want, left); ... /* Limit to ofd_grant_chunk() if not reconnect/recovery */ if ((grant > grant_chunk) && conservative) grant = grant_chunk; ... ofd->ofd_tot_granted += grant; fed->fed_grant += grant; This looks like want is a relative value. So the clients repeatedly says "I want 32MB" and the server takes that request, lowers it to grant_chunk (8MB), and grants it 8MB repeatedly until the client claims it has at least 32MB. According to Andreas in LU-3859 , OBD_CONNECT_GRANT_SHRINK isn't set, so this is never cleaned up automatically. Is there a reason this is disabled?
            ezell Matt Ezell added a comment -

            It looks like ofd_grant_space_left() uses ofd->ofd_osfs.os_bavail, so it appears to take the reserved space into account.

            ezell Matt Ezell added a comment - It looks like ofd_grant_space_left() uses ofd->ofd_osfs.os_bavail, so it appears to take the reserved space into account.
            ezell Matt Ezell added a comment -

            I guess until we get usage down or a patch for this, we will need to periodically shrink grants on clients to avoid ENOSPC.

            The source of the question about reserved space was to better understand when a user might get ENOSPC. Would it be when a client has exhausted its grant and (kbytesfree - (tot_granted/1024)) <= 0 or does it use (kbytesavail - (tot_granted/1024)) <= 0 ?

            The Lustre Operations manual has a pretty strong warning about lowering the reserved space:

            Reducing the space reservation can cause severe performance degradation as the OST file system becomes more than 95% full, due to difficulty in locating large areas of contiguous free space. This performance degradation may persist even if the space usage drops below 95% again. It is recommended NOT to reduce the reserved disk space below 5%.

            But if that will give us a little headroom, we know grants will help keep us from getting too close to completely empty.

            ezell Matt Ezell added a comment - I guess until we get usage down or a patch for this, we will need to periodically shrink grants on clients to avoid ENOSPC. The source of the question about reserved space was to better understand when a user might get ENOSPC. Would it be when a client has exhausted its grant and (kbytesfree - (tot_granted/1024)) <= 0 or does it use (kbytesavail - (tot_granted/1024)) <= 0 ? The Lustre Operations manual has a pretty strong warning about lowering the reserved space: Reducing the space reservation can cause severe performance degradation as the OST file system becomes more than 95% full, due to difficulty in locating large areas of contiguous free space. This performance degradation may persist even if the space usage drops below 95% again. It is recommended NOT to reduce the reserved disk space below 5%. But if that will give us a little headroom, we know grants will help keep us from getting too close to completely empty.
            hanleyja Jesse Hanley added a comment -

            Thanks Oleg for the detail!

            These servers were originally formatted with Lustre 2.4. When I checked, it looks like the OSTs are at 5% reserved:

            Block count: 3755999232
            Reserved block count: 187799961

            187799961 / 3755999232 * 100 ~= 5%

            With that being the case, can/should we lower this to a smaller reserved block count?

            Also, do I need to submit a new case about the server logic?

            Thanks!

            Jesse

            hanleyja Jesse Hanley added a comment - Thanks Oleg for the detail! These servers were originally formatted with Lustre 2.4. When I checked, it looks like the OSTs are at 5% reserved: Block count: 3755999232 Reserved block count: 187799961 187799961 / 3755999232 * 100 ~= 5% With that being the case, can/should we lower this to a smaller reserved block count? Also, do I need to submit a new case about the server logic? Thanks! – Jesse
            green Oleg Drokin added a comment -

            cur_grant_bytes is how much grant was received from this OST by a client. It has no direct relation to max_dirty_mb other than max_dirty_mb cannot be higher than this.

            Technically the calculation for the grant request is

                            long max_in_flight = (cli->cl_max_pages_per_rpc <<
                                                  PAGE_CACHE_SHIFT) *
                                                 (cli->cl_max_rpcs_in_flight + 1);
                            oa->o_undirty = max(cli->cl_dirty_max_pages << PAGE_CACHE_SHIFT,
                                                max_in_flight);
            

            This is how much every client RPC requests.
            It's a max of either your num RPCs in flight or max_dirty_mb

            Theoretically we should not exceed this value (want = o_undirty):

                    /* Grant some fraction of the client's requested grant space so that
                     * they are not always waiting for write credits (not all of it to
                     * avoid overgranting in face of multiple RPCs in flight).  This
                     * essentially will be able to control the OSC_MAX_RIF for a client.
                     *
                     * If we do have a large disparity between what the client thinks it
                     * has and what we think it has, don't grant very much and let the
                     * client consume its grant first.  Either it just has lots of RPCs
                     * in flight, or it was evicted and its grants will soon be used up. */
                    if (curgrant >= want || curgrant >= fed->fed_grant + grant_chunk)
                               RETURN(0);
            ...
                    grant = min(want, left);
                    /* round grant upt to the next block size */
                    grant = (grant + (1 << ofd->ofd_blockbits) - 1) &
                            ~((1ULL << ofd->ofd_blockbits) - 1);
                    /* Limit to ofd_grant_chunk() if not reconnect/recovery */
                    if ((grant > grant_chunk) && conservative)
                            grant = grant_chunk;
            ...
                    ofd->ofd_tot_granted += grant;
                    fed->fed_grant += grant;
            

            So I imagine biggest case could be that if a client sends a bunch of requests while the grant is nearly at the max, then every one of those RPCs would return 2M of grant each,
            which I guess theoretically should only allow to get a client to receive at most 2x of the max_dirty_mb or max_requests_in_flight megabytes (though if you are in recovery then every request could bring as much grant).
            Overall it seems there's seem to be a bit of a logic flaw in server-side granting logic where after initial checks want should be recalculated as want -= grant or something like that.

            While you can write a low value into the proc file, it would only have an effect of releasing the extra grant above the value you write there immediately, but the value does not stick and the grant would keep accumulating according to the calculations above.

            As for the reserved_blocks_percent, do you mean the ext4 reservation? I think recent e2fspogs reduced that to smaller value for large filesystem sizes by default already.

            green Oleg Drokin added a comment - cur_grant_bytes is how much grant was received from this OST by a client. It has no direct relation to max_dirty_mb other than max_dirty_mb cannot be higher than this. Technically the calculation for the grant request is long max_in_flight = (cli->cl_max_pages_per_rpc << PAGE_CACHE_SHIFT) * (cli->cl_max_rpcs_in_flight + 1); oa->o_undirty = max(cli->cl_dirty_max_pages << PAGE_CACHE_SHIFT, max_in_flight); This is how much every client RPC requests. It's a max of either your num RPCs in flight or max_dirty_mb Theoretically we should not exceed this value (want = o_undirty): /* Grant some fraction of the client's requested grant space so that * they are not always waiting for write credits (not all of it to * avoid overgranting in face of multiple RPCs in flight). This * essentially will be able to control the OSC_MAX_RIF for a client. * * If we do have a large disparity between what the client thinks it * has and what we think it has, don't grant very much and let the * client consume its grant first. Either it just has lots of RPCs * in flight, or it was evicted and its grants will soon be used up. */ if (curgrant >= want || curgrant >= fed->fed_grant + grant_chunk) RETURN(0); ... grant = min(want, left); /* round grant upt to the next block size */ grant = (grant + (1 << ofd->ofd_blockbits) - 1) & ~((1ULL << ofd->ofd_blockbits) - 1); /* Limit to ofd_grant_chunk() if not reconnect/recovery */ if ((grant > grant_chunk) && conservative) grant = grant_chunk; ... ofd->ofd_tot_granted += grant; fed->fed_grant += grant; So I imagine biggest case could be that if a client sends a bunch of requests while the grant is nearly at the max, then every one of those RPCs would return 2M of grant each, which I guess theoretically should only allow to get a client to receive at most 2x of the max_dirty_mb or max_requests_in_flight megabytes (though if you are in recovery then every request could bring as much grant). Overall it seems there's seem to be a bit of a logic flaw in server-side granting logic where after initial checks want should be recalculated as want -= grant or something like that. While you can write a low value into the proc file, it would only have an effect of releasing the extra grant above the value you write there immediately, but the value does not stick and the grant would keep accumulating according to the calculations above. As for the reserved_blocks_percent, do you mean the ext4 reservation? I think recent e2fspogs reduced that to smaller value for large filesystem sizes by default already.
            ezell Matt Ezell added a comment -

            To include some numbers:

            # ls exports | wc -l
            20190
            # echo "$(cat tot_granted) / $(ls exports | wc -l)" | bc
            116961044
            # echo "100 * $(cat tot_granted) / 1024 / $(cat kbytestotal)" | bc
            15
            

            Our 20,000 clients average 116MB of grants per OST, resulting in 15% of the OST reserved for grants. That means when any OST hits 85% full, users start getting ENOSPC. I picked a random client and grant sizes range from 2MB to 343MB per OSC.

            ezell Matt Ezell added a comment - To include some numbers: # ls exports | wc -l 20190 # echo "$(cat tot_granted) / $(ls exports | wc -l)" | bc 116961044 # echo "100 * $(cat tot_granted) / 1024 / $(cat kbytestotal)" | bc 15 Our 20,000 clients average 116MB of grants per OST, resulting in 15% of the OST reserved for grants. That means when any OST hits 85% full, users start getting ENOSPC. I picked a random client and grant sizes range from 2MB to 343MB per OSC.

            People

              green Oleg Drokin
              hanleyja Jesse Hanley
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: