<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:07:04 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14125] client starved for grant but OST has plenty of free space</title>
                <link>https://jira.whamcloud.com/browse/LU-14125</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Something is causing cur_grant_bytes for some OSCs to go below 1MB.  Which OSCs and which nodes are apparently random.  The OSTs themselves have many TB of free space.  Sequential writes (e.g. dd if=/dev/urandom of=file_on_ost_3 bs=1M count=40 where the file has just one stripe) produces osc_enter_cache() debug entries reporting it is falling back to sync I/O.  We also see osc_update_grant() report it got 0 extra grant.&lt;/p&gt;

&lt;p&gt;We have not been able to identify a workload or other trigger that pushes cur_grant_bytes low in the first place.  We also have not been able to find a workaround that results in the OST returning extra grant.&lt;/p&gt;

&lt;p&gt;We set grant_shrink=0 on all clients (using set_param -P on the mgs) and then stopped and started all the OSTs on the file system.  This did not change the symptoms in any obvious way.&lt;/p&gt;

&lt;p&gt;Client snippet with debug=&quot;+cache&quot;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000008:00000020:55.0:1604686702.850333:0:15766:0:(osc_cache.c:1613:osc_enter_cache()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }need:1703936

00000008:00000020:55.0:1604686702.850335:0:15766:0:(osc_cache.c:1543:osc_enter_cache_try()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }need:1703936

00000008:00000020:55.0:1604686702.850337:0:15766:0:(osc_cache.c:1658:osc_enter_cache()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }no grant space, fall back to sync i/o

00000008:00400020:55.0:1604686702.850352:0:15766:0:(osc_io.c:127:osc_io_submit()) 256 1
00000008:00000020:55.0:1604686702.850385:0:15766:0:(osc_cache.c:1743:osc_update_pending()) obj ffff8b2c6bc58640 ready 0|-|- wr 256|+|- rd 0|- update pending cmd 2 delta 256.
00000008:00000020:55.0:1604686702.850387:0:15766:0:(osc_cache.c:2297:osc_io_unplug0()) Queue writeback work for client ffff8b1efb0d25e0.
00000008:00000020:19.0:1604686702.850400:0:20698:0:(osc_request.c:3171:brw_queue_work()) Run writeback work for client obd ffff8b1efb0d25e0.
00000008:00000020:19.0:1604686702.850402:0:20698:0:(osc_cache.c:2222:osc_check_rpcs()) obj ffff8b2c6bc58640 ready 0|-|- wr 256|+|- rd 0|- 0 in flight
00000008:00000020:19.0:1604686702.850404:0:20698:0:(osc_cache.c:1697:osc_makes_rpc()) high prio request forcing RPC
00000008:00000020:19.0:1604686702.850405:0:20698:0:(osc_cache.c:1888:try_to_add_extent_for_io()) extent ffff8b2c6b9c7c30@{[9984 -&amp;gt; 10239/10239], [1|0|+|lockdone|wShu|ffff8b2c6bc58640], [0|256|+|-|          (null)|256|          (null)]} trying to add this extent
00000008:00000020:19.0:1604686702.850408:0:20698:0:(osc_cache.c:1743:osc_update_pending()) obj ffff8b2c6bc58640 ready 0|-|- wr 0|-|- rd 0|- update pending cmd 2 delta -256.
00000008:00000020:19.0:1604686702.850441:0:20698:0:(osc_request.c:705:osc_announce_cached()) dirty: 0 undirty: 1879048191 dropped 0 grant: 997461
00000008:00000020:19.0:1604686702.850443:0:20698:0:(osc_request.c:714:osc_update_next_shrink()) next time 6200398 to shrink grant
00000008:00000020:60.0:1604686703.244890:0:20699:0:(osc_request.c:727:osc_update_grant()) got 0 extra grant
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Server snippet:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000020:00000020:1.0:1604686702.851666:0:15401:0:(tgt_grant.c:413:tgt_grant_statfs()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 free: 263316194721792 avail: 263316186333184
00000020:00000020:1.0:1604686702.851668:0:15401:0:(tgt_grant.c:477:tgt_grant_space_left()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 avail 263316186333184 left 262158115930112 unstable 3407872 tot_grant 1158069646026 pending 3407872
00000020:00000020:1.0:1604686702.851670:0:15401:0:(tgt_grant.c:519:tgt_grant_incoming()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 reports grant 997461 dropped 0, local 1882456063
00000020:00000020:1.0:1604686702.851672:0:15401:0:(tgt_grant.c:848:tgt_grant_check()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 granted: 0 ungranted: 1703936 grant: 1882456063 dirty: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each pool contains only one OST, and the storage is used for nothing else.  All the OSTs have about the same amount of free space.  This is the pool containing OST0003.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@brass8:toss-4917-grant]# zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
brass8   580T   243T   337T         -    25%    41%  1.00x  ONLINE  -
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>zfs-0.7&lt;br/&gt;
lustre-2.12.5_5.llnl-1.ch6.x86_64</environment>
        <key id="61579">LU-14125</key>
            <summary>client starved for grant but OST has plenty of free space</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="4" iconUrl="https://jira.whamcloud.com/images/icons/statuses/reopened.png" description="This issue was once resolved, but the resolution was deemed incorrect. From here issues are either marked assigned or resolved.">Reopened</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                            <label>topllnl</label>
                    </labels>
                <created>Fri, 6 Nov 2020 18:44:42 +0000</created>
                <updated>Fri, 29 Jul 2022 23:11:06 +0000</updated>
                                                            <fixVersion>Lustre 2.14.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>21</watches>
                                                                            <comments>
                            <comment id="284522" author="ofaaland" created="Fri, 6 Nov 2020 18:50:34 +0000"  >&lt;p&gt;We can get a debug patch on either the server or the client within about 1 week (it would need to be pushed to gerrit and pass tests first).  For instance, there are no CDEBUGs in osc_should_shrink_grant().&lt;/p&gt;

&lt;p&gt;We haven&apos;t identified a workaround other than stopping and starting servers, which is obviously not good enough.   Workaround suggestions would be very useful, as this is significantly impacting our users.&lt;/p&gt;</comment>
                            <comment id="284523" author="pjones" created="Fri, 6 Nov 2020 18:54:51 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Could you please advise&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="284524" author="ofaaland" created="Fri, 6 Nov 2020 18:57:24 +0000"  >&lt;p&gt;Thanks Peter&lt;/p&gt;</comment>
                            <comment id="284525" author="ofaaland" created="Fri, 6 Nov 2020 18:58:07 +0000"  >&lt;p&gt;For my tracking, my local ticket is TOSS-4917&lt;/p&gt;</comment>
                            <comment id="284531" author="adilger" created="Fri, 6 Nov 2020 19:17:52 +0000"  >&lt;p&gt;There was a grant-related overflow fixed in patch &lt;a href=&quot;https://review.whamcloud.com/39380&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39380&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13763&quot; title=&quot;ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13763&quot;&gt;&lt;del&gt;LU-13763&lt;/del&gt;&lt;/a&gt; osc: don&apos;t allow negative grants&lt;/tt&gt;&quot; that may be causing issues here. If the client thinks it has little or no grant, but the server thinks it has lots of grant, the client will keep asking, but the server will not give it more grant until it uses its current amount.&lt;/p&gt;

&lt;p&gt;The patch was landed to b2_12 but not until after 2.12.5 was released. &lt;/p&gt;

&lt;p&gt;Mike, for debugging purposes, it would be useful to add per-export grant parameters on the server (&lt;tt&gt;obdfilter.&amp;#42;.exports.&amp;#42;.grant&amp;#42;&lt;/tt&gt;&quot; to match the clients).  That would allow comparing the values on the client and server and see what kind of difference there is. &lt;/p&gt;</comment>
                            <comment id="284532" author="ofaaland" created="Fri, 6 Nov 2020 19:20:35 +0000"  >&lt;p&gt;Thanks, Andreas.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Mike, for debugging purposes, it would be useful to add per-export grant parameters on the server (obdfilter.&lt;b&gt;.exports.&lt;/b&gt;.grant*&quot; to match the clients). That would allow comparing the values on the client and server and see what kind of difference there is.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Like this?&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/39324/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/39324/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="284533" author="ofaaland" created="Fri, 6 Nov 2020 19:24:40 +0000"  >&lt;blockquote&gt;&lt;p&gt;There was a grant-related overflow fixed in patch &lt;a href=&quot;https://review.whamcloud.com/39380&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39380&lt;/a&gt; &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13763&quot; title=&quot;ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13763&quot;&gt;&lt;del&gt;LU-13763&lt;/del&gt;&lt;/a&gt; osc: don&apos;t allow negative grants&quot; that may be causing issues here. If the client thinks it has little or no grant, but the server thinks it has lots of grant, the client will keep asking, but the server will not give it more grant until it uses its current amount.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Ah.  Next week, the systems are scheduled to get an update with that patch.&lt;/p&gt;

&lt;p&gt;Specifically, this:&lt;br/&gt;
&lt;a href=&quot;https://github.com/LLNL/lustre/compare/2.12.5_5.llnl...2.12.5_10.llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/compare/2.12.5_5.llnl...2.12.5_10.llnl&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="284538" author="adilger" created="Fri, 6 Nov 2020 20:03:14 +0000"  >&lt;p&gt;Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/40563&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40563&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; obdclass: add grant fields to export procfile&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: efeaa656a5533fb6cff20c06447c7bf686db2bce&lt;/p&gt;</comment>
                            <comment id="284548" author="simmonsja" created="Fri, 6 Nov 2020 20:40:29 +0000"  >&lt;p&gt;ORNL is also running into this problem and we have the grant overflow fix patch.&lt;/p&gt;</comment>
                            <comment id="284551" author="ofaaland" created="Fri, 6 Nov 2020 21:05:23 +0000"  >&lt;blockquote&gt;&lt;p&gt;ORNL is also running into this problem and we have the grant overflow fix patch.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Thanks James, that&apos;s good to know.&lt;/p&gt;</comment>
                            <comment id="284554" author="ofaaland" created="Fri, 6 Nov 2020 21:19:23 +0000"  >&lt;p&gt;I&apos;d forgotten, the lustre version running on these two machines already has the per-export grant parameters patch.&lt;/p&gt;

&lt;p&gt;From the server:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@brass8:192.168.128.133@o2ib32]# cat export 
726797b8-322a-1989-0cb5-3645daf9a6ce:
    name: lsrza-OST0003
    client: 192.168.128.133@o2ib32
    connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, grant_shrink, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, grant_param, lvb_type, short_io, lfsck, bulk_mbits, second_flags, lockaheadv2 ]
    connect_data:
       flags: 0xa0425af2e3440478
       instance: 42
       target_version: 2.12.5.0
       initial_grant: 2097152
       max_brw_size: 1048576
       grant_block_size: 1048576
       grant_inode_size: 4096
       grant_max_extent_size: 1073741824
       grant_extent_tax: 655360
       cksum_types: 0xf7
       max_object_bytes: 9223372036854775807
    export_flags: [  ]
    grant:
       granted: 1882456063
       dirty: 0
       pending: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From the client:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rztopaz133:toss-4917-grant]# lctl list_nids
192.168.128.133@o2ib32
[root@rztopaz133:toss-4917-grant]# lctl get_param osc.lsrza-OST0003*.cur_grant_bytes
osc.lsrza-OST0003-osc-ffff8b2f37fc1000.cur_grant_bytes=997461
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="284577" author="adilger" created="Sat, 7 Nov 2020 01:50:27 +0000"  >&lt;p&gt;So &lt;tt&gt;1882456063 = 0x7033ffff&lt;/tt&gt;, which means that it is very close to overflowing, and likely had overflowed at some point in the past?&lt;/p&gt;

&lt;p&gt;Would it be possible to check all of the exports on that OSS (and/or other OSSes) and see if the high &lt;tt&gt;granted:&lt;/tt&gt; line on the OSS, and a low  &lt;tt&gt;cur_grant_bytes&lt;/tt&gt; on the client correlates to those clients that are having slow writes?  They should in theory be identical between client and OSS when the client does not have any dirty cached pages.   It may not be 100% correlation, because some clients might not have actually overflowed yet.&#160;&lt;/p&gt;</comment>
                            <comment id="284657" author="ofaaland" created="Mon, 9 Nov 2020 07:04:43 +0000"  >&lt;p&gt;Hi Andreas, the high &lt;b&gt;granted:&lt;/b&gt; line on the OST does not correlate well with low &lt;b&gt;cur_grant_bytes&lt;/b&gt; on the clients.&lt;/p&gt;

&lt;p&gt;One cluster, rztopaz, has 6 clients with &lt;b&gt;cur_grant_bytes&lt;/b&gt; &amp;lt; 1M for OST0003. The OST reports that 238 exports to nodes in that cluster have &lt;b&gt;granted:&lt;/b&gt; value of 1882456063.&lt;/p&gt;

&lt;p&gt;It&apos;s interesting there are so many nodes with granted = 1882456063.&#160;&lt;br/&gt;
 I spot-checked 5 clients with that value for granted, and cur_grant_bytes on those 5 clients are all over the place.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.102@o2ib32/export:       granted: 1882456063
/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.104@o2ib32/export:       granted: 1882456063
/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.108@o2ib32/export:       granted: 1882456063
/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.109@o2ib32/export:       granted: 1882456063
/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.110@o2ib32/export:       granted: 1882456063
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;vs&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;e102: osc.lsrza-OST0003-osc-ffff9bbe688de000.cur_grant_bytes=82790394
e104: osc.lsrza-OST0003-osc-ffff9b69c9dea000.cur_grant_bytes=1412694016
e108: osc.lsrza-OST0003-osc-ffffa1006893e000.cur_grant_bytes=271487350
e109: osc.lsrza-OST0003-osc-ffff8ac65aa26800.cur_grant_bytes=1072398336
e110: osc.lsrza-OST0003-osc-ffff8f737194f800.cur_grant_bytes=1875640319
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="284742" author="adilger" created="Mon, 9 Nov 2020 19:39:23 +0000"  >&lt;p&gt;Olaf, what values do you have on the client(s) for &lt;tt&gt;osc.&amp;#42;.max_pages_per_rpc&lt;/tt&gt;, &lt;tt&gt;osc.&amp;#42;.max_rpcs_in_flight&lt;/tt&gt;, and &lt;tt&gt;osc.&amp;#42;.max_dirty_mb&lt;/tt&gt;?  I&apos;m trying to figure out if there is some kind of overflow in the calculation of the max grant.  There are several patches in this area that could be related:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;&lt;a href=&quot;https://review.whamcloud.com/32288&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32288&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10990&quot; title=&quot;Get rid of per-osc max_dirty_mb setting&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10990&quot;&gt;&lt;del&gt;LU-10990&lt;/del&gt;&lt;/a&gt; osc: increase default max_dirty_mb to 2G&lt;/tt&gt;&quot; (new in 2.12.0)&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://review.whamcloud.com/39380&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39380&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13763&quot; title=&quot;ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13763&quot;&gt;&lt;del&gt;LU-13763&lt;/del&gt;&lt;/a&gt; osc: don&apos;t allow negative grants&lt;/tt&gt;&quot; (new in 2.12.6, but I think it is in use at LLNL)&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://review.whamcloud.com/35896&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35896&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12687&quot; title=&quot;Fast ENOSPC on direct I/O&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12687&quot;&gt;&lt;del&gt;LU-12687&lt;/del&gt;&lt;/a&gt; osc: consume grants for direct I/O&lt;/tt&gt;&quot; (new in 2.12.6, but &lt;em&gt;may&lt;/em&gt; in use at LLNL?)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;but this is just speculation so far.&lt;/p&gt;

&lt;p&gt;The other question is whether there are any error messages in the client logs like &quot;&lt;tt&gt;dirty NNNN &amp;gt; dirty_max MMMM&lt;/tt&gt;&quot;, or related to grant?&lt;/p&gt;

&lt;p&gt;Some things to try if this is relatively easily reproduced on a client after a fresh mount:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;run &quot;&lt;tt&gt;lctl set_param osc.&amp;#42;.max_dirty_mb=1024&lt;/tt&gt;&quot; to limit the amount of dirty data per OSC to see if this prevents the overflow.  This should still be below &lt;tt&gt;(max_rpcs_in_flight &amp;#42; max_pages_per_rpc)&lt;/tt&gt;, typically &lt;tt&gt;(8 * 16MB = 128MB)&lt;/tt&gt; so would not limit performance.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="284745" author="ofaaland" created="Mon, 9 Nov 2020 20:02:40 +0000"  >&lt;p&gt;Andreas, we have (I just show OST0003 but the same for all OSTs and all clients)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_pages_per_rpc=256
osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_rpcs_in_flight=8
osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_dirty_mb=2000 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;No console log warnings or errors on the clients at all, except for reconnections when we bounced OSTs or MDTs.&lt;/p&gt;

&lt;p&gt;The current patch stack does not include the negative grants / direct IO patches.  It is:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;* 27e1e03 (tag: 2.12.5_5.llnl) log lfs setstripe paths to syslog
* f7a3e6e (tag: 2.12.5_4.llnl) LU-13599 mdt: fix mti_big_lmm buffer usage
* 8179deb LU-13599 mdt: fix logic of skipping local locks in reply_state
* 52a48df (tag: 2.12.5_3.llnl) LU-13709 utils: &apos;lfs mkdir -i -1&apos; doesn&apos;t work
* 81479ad LU-13657 kernel: kernel update RHEL8.2 [4.18.0-193.6.3.el8_2]
* 2744b2f LU-13421 kernel: kernel update RHEL8.1 [4.18.0-147.8.1.el8_1]
* 0aaf383 LU-10395 osd: stop OI at device shutdown
* fb8ae25 LU-13766 obdclass: add grant fields to export procfile
* e6ee866 (tag: 2.12.5_2.chaos) LU-13667 ptlrpc: fix endless loop issue
* 049ed85 LU-11623 llite: hash just created files if lock allows
* e1e865f (tag: 2.12.5_1.llnl, tag: 2.12.5_1.chaos) Don&apos;t install lustre init script on systemd systems
* 6eebb51 LLNL build customizations
* 407ac6c TOSS-4431 build: build ldiskfs only for x86_64
* 78d712a (tag: v2_12_5, tag: 2.12.5) New release 2.12.5
* &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="284747" author="ofaaland" created="Mon, 9 Nov 2020 20:11:17 +0000"  >&lt;p&gt;We have several clusters where we see this occur in fairly short order (4-8 hours) after a reboot.  But so far we do not have a procedure for reproducing the issue.&lt;/p&gt;</comment>
                            <comment id="284755" author="charr" created="Mon, 9 Nov 2020 21:00:55 +0000"  >&lt;p&gt;I&apos;m dropping max_dirty_mb to 1024 on Pascal; however, the CZ is relatively clean now after restarting all the OSTs last week (and remounting Lustre on a handful of uncooperative compute nodes), so I don&apos;t think we&apos;d be hitting the issue anytime soon anyway.&lt;/p&gt;

&lt;p&gt;I will also set it on rztopaz which hasn&apos;t had the cleanup.&lt;/p&gt;</comment>
                            <comment id="284982" author="gerrit" created="Wed, 11 Nov 2020 23:09:24 +0000"  >&lt;p&gt;Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/40615&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40615&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; osc: prevent overflow of o_dropped&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 968f580554f5211ea78bb825d8666cf70a17c288&lt;/p&gt;</comment>
                            <comment id="284992" author="dauchy" created="Thu, 12 Nov 2020 00:57:22 +0000"  >&lt;p&gt;As another reference point, we seem to be hitting this issue at NOAA as well, with&#160;2.12.3_ddn44 on the clients and&#160;lustre-2.12.3_ddn31 on the servers.&#160; Unmounting and remounting a client seems to improve performance, at least for a little while.&lt;/p&gt;</comment>
                            <comment id="284993" author="dauchy" created="Thu, 12 Nov 2020 01:04:23 +0000"  >&lt;p&gt;A colleague found a way to force reconnection between client and server when working a different issue, and I tried it in this case and was able to get cur_grant_bytes to renegotiate without the heavy-handed unmount.&#160; Is this a dangerous thing to do as a workaround?&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@Jet:k11 ~&amp;#93;&lt;/span&gt;# lctl set_param osc.*.grant_shrink=0&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@Jet:k11 ~&amp;#93;&lt;/span&gt;# lctl get_param osc.*.cur_grant_bytes | sort -nk2 -t = | head -n 5&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;osc.lfs1-OST0002-osc-ffff8e7aed73c800.cur_grant_bytes=66638&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;osc.lfs1-OST0018-osc-ffff8e7aed73c800.cur_grant_bytes=651802&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;osc.lfs1-OST0029-osc-ffff8e7aed73c800.cur_grant_bytes=3013787&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;osc.lfs1-OST001a-osc-ffff8e7aed73c800.cur_grant_bytes=3146809&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;osc.lfs1-OST003b-osc-ffff8e7aed73c800.cur_grant_bytes=3363014&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@Jet:k11 ~&amp;#93;&lt;/span&gt;# lctl device_list | egrep &quot;lfs1-OST00(02|18)&quot;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&#160; 4 UP osc lfs1-OST0018-osc-ffff8e7aed73c800 d39fcae6-1604-7b71-2bba-0ddeb32aa971 4&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&#160;25 UP osc lfs1-OST0002-osc-ffff8e7aed73c800 d39fcae6-1604-7b71-2bba-0ddeb32aa971 4&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@Jet:k11 ~&amp;#93;&lt;/span&gt;# lctl --device 4 deactivate; sleep 1;&#160;lctl --device 4 activate&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@Jet:k11 ~&amp;#93;&lt;/span&gt;# lctl --device 25 deactivate; sleep 1;&#160;lctl --device 25 activate&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@Jet:k11 ~&amp;#93;&lt;/span&gt;# lctl get_param osc.*.cur_grant_bytes | egrep &quot;lfs1-OST00(02|18)&quot;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;osc.lfs1-OST0002-osc-ffff8e7aed73c800.cur_grant_bytes=541450240&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;osc.lfs1-OST0018-osc-ffff8e7aed73c800.cur_grant_bytes=542035968&lt;/tt&gt;&lt;/p&gt;</comment>
                            <comment id="285041" author="adilger" created="Thu, 12 Nov 2020 16:00:51 +0000"  >&lt;p&gt;Nathan,&lt;br/&gt;
You can also likely use &quot;&lt;tt&gt;--device Xxxx recover&lt;/tt&gt;&quot; to do the same thing in one step.&lt;/p&gt;

&lt;p&gt;However, it should be noted that disconnecting and reconnecting the device like this could cause in-flight RPCs to that OST to be aborted, so this should only be done on quiescent clients after &quot;&lt;tt&gt;lctl set_param ldlm.namespaces.&amp;#42;.lru_size=clear&lt;/tt&gt;&quot; to flush all dirty data from the client.&lt;/p&gt;

&lt;p&gt;So it is &lt;em&gt;mostly&lt;/em&gt; ok, but not safe to do randomly during the day. &lt;/p&gt;</comment>
                            <comment id="285116" author="scadmin" created="Fri, 13 Nov 2020 08:30:49 +0000"  >&lt;p&gt;Hiya,&lt;/p&gt;

&lt;p&gt;using the small dd test from our &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14124&quot; title=&quot;super slow i/o on client maybe related to low grant&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14124&quot;&gt;&lt;del&gt;LU-14124&lt;/del&gt;&lt;/a&gt; I&apos;ve been tracking how the issue appears on our compute nodes over time, and setting slow nodes to drain. it&apos;s interesting that after they are drained of jobs, perhaps 1/2 or 2/3 of our compute nodes recover and no longer seem to have any grant or i/o issues. but some nodes still have 1 or more slow OSTs even when there are no user processes left on them.&lt;/p&gt;

&lt;p&gt;before the slow nodes drained of jobs, I used lsof and lfs getstripe to find all files using the slow OSTs from those nodes. in all of these cases there were exe&apos;s (&apos;txt&apos; in lsof) or .so&apos;s (&apos;mem&apos;) using the slow OSTs. sometimes there was a regular file as well, but in most cases the slow OST was only referenced by one exe (eg. 32 references to a 32-way OpenMP exe) or a few .so&apos;s (with eg. 22 refs each).&lt;/p&gt;

&lt;p&gt;(( a caveat that this could be a skewed observation for many reasons - unknown node history prior to going slow / 2 user&apos;s jobs are on a disproportionate number of these nodes  / there can be a lot of .so&apos;s open compared to regular files / ... but it seemed like a pattern. ))&lt;/p&gt;

&lt;p&gt;so could this issue be something to do with mmap rather than regular i/o?&lt;br/&gt;
on the nodes that recovered it looked like these &apos;txt&apos; and &apos;mem&apos; files were holding the grant low until the processes ended. no idea why it was some nodes and not others that recovered... perhaps there is more than one thing going on...&lt;/p&gt;

&lt;p&gt;probably way off-track, but I&apos;m reminded of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13588&quot; title=&quot;sigbus sent to mmap writer that is a long way below quota&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13588&quot;&gt;&lt;del&gt;LU-13588&lt;/del&gt;&lt;/a&gt; which judging from Oleg&apos;s comment was something to do with grant and mmap, and perhaps not entirely to do with quota.&lt;/p&gt;

&lt;p&gt;also FYI max_dirty_mb=1000 doesn&apos;t seem to work. we set that on 2 login nodes after a reboot, and one was broken again within a day or so.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="285121" author="adilger" created="Fri, 13 Nov 2020 08:56:47 +0000"  >&lt;p&gt;I was looking at where the &lt;tt&gt;cl_lost_grant&lt;/tt&gt; might be coming from.  According to a comment in the code:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
/**
 * Free grant after IO is finished or canceled.
 *
 * @lost_grant is used to remember how many grants we have allocated but not
 * used, we should &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; these grants to OST. There&apos;re two cases where grants
 * can be lost:
 * 1. truncate;
 * 2. blocksize at OST is less than PAGE_SIZE and a partial page was
 *    written. In &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;case&lt;/span&gt; OST may use less chunks to serve &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; partial
 *    write. OSTs don&apos;t actually know the page size on the client side. so
 *    clients have to calculate lost grant by the blocksize on the OST.
 *    See filter_grant_check() &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; details.
 */
&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; void osc_free_grant(struct client_obd *cli, unsigned &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; nr_pages,
                           unsigned &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; lost_grant, unsigned &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; dirty_grant)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;so it is possible that the &lt;tt&gt;cl_lost_grant&lt;/tt&gt; overflow problem from Olaf&apos;s patch &lt;a href=&quot;https://review.whamcloud.com/40615&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40615&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; osc: prevent overflow of o_dropped&lt;/tt&gt;&quot; happens more frequently for ZFS OSTs.  I can&apos;t imagine that this would be the case for ldiskfs OSTs.&lt;/p&gt;

&lt;p&gt;There definitely still seems to be some kind of grant accounting error.  Even on my idle single-client test system, the client &quot;&lt;tt&gt;osc.&amp;#42;.cur_grant_bytes&lt;/tt&gt;&quot; (about 8MB) does not match what the OST &quot;&lt;tt&gt;obdfilter.&amp;#42;.tot_granted&lt;/tt&gt;&quot; thinks it has granted (about 256KB).&lt;/p&gt;</comment>
                            <comment id="285279" author="gerrit" created="Mon, 16 Nov 2020 23:04:37 +0000"  >&lt;p&gt;Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/40659&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40659&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; osc: prevent overflow of o_dropped&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: d80c9e075c0422994865435b35929f9af30476a7&lt;/p&gt;</comment>
                            <comment id="285293" author="ofaaland" created="Tue, 17 Nov 2020 02:49:20 +0000"  >&lt;p&gt;Our Lustre 2.12 machines all got updates over the last week, to a new Lustre version that has the two grant patches.  Everything started out without the symptom because of the client + server reboots.  Still waiting for the issue to reappear.&lt;/p&gt;</comment>
                            <comment id="286062" author="gerrit" created="Thu, 26 Nov 2020 09:25:48 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/40563/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40563/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; obdclass: add grant fields to export procfile&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 53ee416097a9a77ca0ee352714af02e77489e3f8&lt;/p&gt;</comment>
                            <comment id="286076" author="gerrit" created="Thu, 26 Nov 2020 09:27:11 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/40659/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40659/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; osc: prevent overflow of o_dropped&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 82e9a11056a55289c880786da71d8b1125f357b2&lt;/p&gt;</comment>
                            <comment id="286092" author="pjones" created="Thu, 26 Nov 2020 14:41:54 +0000"  >&lt;p&gt;The patches have landed for 2.14. Let&apos;s reopen if the problem still reappears with them in place.&lt;/p&gt;</comment>
                            <comment id="286225" author="adilger" created="Sun, 29 Nov 2020 17:57:39 +0000"  >&lt;p&gt;Olaf, and news on the client grant front?&lt;/p&gt;</comment>
                            <comment id="286226" author="dauchy" created="Sun, 29 Nov 2020 23:57:09 +0000"  >&lt;p&gt;Are the patches to prevent overflow of o_dropped and add debugging information available for 2.12.x?&#160; We have a maintenance window coming up, during which we are planning to upgrade clients to 2.12.5+ anyway.&lt;/p&gt;</comment>
                            <comment id="286232" author="adilger" created="Mon, 30 Nov 2020 07:12:19 +0000"  >&lt;p&gt;There is patch &lt;a href=&quot;https://review.whamcloud.com/40615&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40615&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="286294" author="ofaaland" created="Mon, 30 Nov 2020 17:49:31 +0000"  >&lt;p&gt;Andreas,&lt;br/&gt;
No news.  We haven&apos;t seen the issue now in about two weeks, but we believe that may be due to the system updates which resulted in everything getting a clean boot.&lt;/p&gt;</comment>
                            <comment id="290273" author="ofaaland" created="Mon, 25 Jan 2021 16:10:36 +0000"  >&lt;p&gt;Since my last update in November our production machines for the most part have been running Lustre 2.12.5_10.llnl, which has these grant-related patches landed post 2.12.5:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;663e688 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12687&quot; title=&quot;Fast ENOSPC on direct I/O&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12687&quot;&gt;&lt;del&gt;LU-12687&lt;/del&gt;&lt;/a&gt; osc: consume grants for direct I/O&lt;/li&gt;
	&lt;li&gt;ef65452 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13763&quot; title=&quot;ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13763&quot;&gt;&lt;del&gt;LU-13763&lt;/del&gt;&lt;/a&gt; osc: don&apos;t allow negative grants&lt;/li&gt;
	&lt;li&gt;fb8ae25 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13766&quot; title=&quot;tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13766&quot;&gt;&lt;del&gt;LU-13766&lt;/del&gt;&lt;/a&gt; obdclass: add grant fields to export procfile&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;With that patch set, disabling grant_shrink and then umounting and remounting the OSTs not only gets cur_grant_bytes and tot_granted back in sync but also prevents grant starvation.&lt;/p&gt;

&lt;p&gt;Our next update coming up in a few weeks will be running 2.12.6_3.llnl which also has:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;8d78d2e &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; osc: prevent overflow of o_dropped&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;After that update we will re-enable grant_shrink and see if the problem recurs.&lt;/p&gt;</comment>
                            <comment id="291267" author="spitzcor" created="Thu, 4 Feb 2021 21:03:21 +0000"  >&lt;p&gt;&amp;gt; fb8ae25 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13766&quot; title=&quot;tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13766&quot;&gt;&lt;del&gt;LU-13766&lt;/del&gt;&lt;/a&gt; obdclass: add grant fields to export procfile&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ofaaland&quot; class=&quot;user-hover&quot; rel=&quot;ofaaland&quot;&gt;ofaaland&lt;/a&gt;, can you say more about that one?  I don&apos;t see it posted to gerrit yet.&lt;/p&gt;</comment>
                            <comment id="291277" author="ofaaland" created="Thu, 4 Feb 2021 22:16:43 +0000"  >&lt;p&gt;Hi Cory, &lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/39324/5&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/39324/5&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;that patch has been on our 2.12 stack at LLNL for several weeks and hasn&apos;t caused us any trouble.  I&apos;m not sure why it hasn&apos;t landed to b2_12.  I added gerrit gatekeeper as a reviewer and posted a query to that effect.&lt;/p&gt;</comment>
                            <comment id="291322" author="adilger" created="Fri, 5 Feb 2021 07:55:26 +0000"  >&lt;p&gt;The landings to b2_12 have been paused while we focus efforts on getting 2.14.0 out the door.  There are a number of patches queued up for 2.12.7 once Oleg has cycles to run them through his test rig.&lt;/p&gt;</comment>
                            <comment id="291804" author="spitzcor" created="Thu, 11 Feb 2021 21:45:59 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ofaaland&quot; class=&quot;user-hover&quot; rel=&quot;ofaaland&quot;&gt;ofaaland&lt;/a&gt;, ah, thanks for pointing it out.  That patch is from &lt;b&gt;this&lt;/b&gt; ticket and not directly from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13766&quot; title=&quot;tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13766&quot;&gt;&lt;del&gt;LU-13766&lt;/del&gt;&lt;/a&gt;.  I just didn&apos;t see a patch in gerrit for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13766&quot; title=&quot;tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13766&quot;&gt;&lt;del&gt;LU-13766&lt;/del&gt;&lt;/a&gt; (&lt;a href=&quot;https://review.whamcloud.com/#/q/LU-13766&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/q/LU-13766&lt;/a&gt;) and thought I was missing something.  Thanks for letting us know!&lt;/p&gt;</comment>
                            <comment id="292700" author="ofaaland" created="Mon, 22 Feb 2021 23:00:20 +0000"  >&lt;p&gt;My update from Jan 5 said:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Our next update coming up in a few weeks will be running 2.12.6_3.llnl which also has:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;* 8d78d2e &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; osc: prevent overflow of o_dropped&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;After that update we will re-enable grant_shrink and see if the problem recurs.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Our updates are delayed so it will be another week or two before we can re-enable grant_shrink and then another couple of weeks before we see if the symptoms reappear.&lt;/p&gt;</comment>
                            <comment id="293095" author="adilger" created="Thu, 25 Feb 2021 22:30:32 +0000"  >&lt;p&gt;I&apos;m reopening this, because I don&apos;t think it is clear the problem has been fixed.  I believe it was just closed because the two &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; patches were landed for 2.14.0.&lt;/p&gt;</comment>
                            <comment id="293098" author="adilger" created="Thu, 25 Feb 2021 22:47:31 +0000"  >&lt;p&gt;Another site reported that just the &lt;tt&gt;o_dropped&lt;/tt&gt; does not seem to have (fully?) resolved the problem, since they are still seeing clients with low grant.  I&apos;m still trying to find out whether the site is running with &lt;tt&gt;grant_shrink=1&lt;/tt&gt; or not.&lt;/p&gt;

&lt;p&gt;One theory that I had was that something with the grant shrink is causing incremental loss of grant because it is using non-PAGE_SIZE grant amounts?  I notice that the clients and servers have values that are not even PAGE_SIZE multiples, so maybe there is some kind of rounding problems between the client and server?&lt;/p&gt;</comment>
                            <comment id="293915" author="gerrit" created="Thu, 4 Mar 2021 08:36:10 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/39324/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39324/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; obdclass: add grant fields to export procfile&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 7a354e82d99d57103ed52cb7872cd64090b43383&lt;/p&gt;</comment>
                            <comment id="294133" author="spitzcor" created="Fri, 5 Mar 2021 22:47:58 +0000"  >&lt;p&gt;FWIW, we have a couple of reports.  One customer running 2.12.6 saw trouble with performance.  One set of nodes was set to grant_shrink=0, which mitigated the problem.  Another customer running ~2.12.4 saw their grant problems disappear with this patch and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11409&quot; title=&quot;only first 100 OSC can shrink grants&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11409&quot;&gt;&lt;del&gt;LU-11409&lt;/del&gt;&lt;/a&gt;.  Related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11409&quot; title=&quot;only first 100 OSC can shrink grants&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11409&quot;&gt;&lt;del&gt;LU-11409&lt;/del&gt;&lt;/a&gt;: &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=vsaveliev&quot; class=&quot;user-hover&quot; rel=&quot;vsaveliev&quot;&gt;vsaveliev&lt;/a&gt; has spotted that tgt_grant_sanity_check() is a no-op and the grant check gets turned off with more than 100 exports.&lt;/p&gt;</comment>
                            <comment id="294186" author="adilger" created="Sun, 7 Mar 2021 20:09:08 +0000"  >&lt;p&gt;Cory, thanks for the update. You may be conflating two issues here. &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11409&quot; title=&quot;only first 100 OSC can shrink grants&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11409&quot;&gt;&lt;del&gt;LU-11409&lt;/del&gt;&lt;/a&gt; is applicable to clients connecting to more than 100 OSTs. It would be useful to know from the reporters on this ticket if that applies to the systems where this problem is being seen.&lt;/p&gt;

&lt;p&gt;Separately, &lt;tt&gt;tgt_grant_sanity_check()&lt;/tt&gt; is a server-side verification of the grants given to the clients vs. the total granted counters for the target at disconnect time, which is disabled on systems with more than 100 connected clients because it adds significant overhead at that point (O(n^2) with the number of connected clients).  However, that check is not being triggered in this case (AFAIK) because it isn&apos;t a problem with the per-export vs. global counters on the OST, but a disconnect between what the client is counting and what the server is counting. &lt;/p&gt;</comment>
                            <comment id="294406" author="ofaaland" created="Tue, 9 Mar 2021 17:11:27 +0000"  >&lt;blockquote&gt;&lt;p&gt;Cory, thanks for the update. You may be conflating two issues here.&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11409&quot; title=&quot;only first 100 OSC can shrink grants&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11409&quot;&gt;&lt;del&gt;LU-11409&lt;/del&gt;&lt;/a&gt;&#160;is applicable to clients connecting to more than 100 OSTs. It would be useful to know from the reporters on this ticket if that applies to the systems where this problem is being seen.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;At LLNL the clients where we&apos;re seeing this problem do connect to more than 100 OSTs (across 3 file systems).&lt;/p&gt;</comment>
                            <comment id="294412" author="adilger" created="Tue, 9 Mar 2021 17:53:18 +0000"  >&lt;p&gt;Olaf, the backported patch &lt;a href=&quot;https://review.whamcloud.com/40564&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40564&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11409&quot; title=&quot;only first 100 OSC can shrink grants&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11409&quot;&gt;&lt;del&gt;LU-11409&lt;/del&gt;&lt;/a&gt; osc: grant shrink shouldn&apos;t account skipped OSC&lt;/tt&gt;&quot; just landed to b2_12 but is not in any tag yet.  This is a very simple client-only patch, so could be added to your clients relatively easily to see if it solves the problem. &lt;/p&gt;</comment>
                            <comment id="295280" author="gerrit" created="Wed, 17 Mar 2021 23:21:20 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/40615/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40615/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14125&quot; title=&quot;client starved for grant but OST has plenty of free space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14125&quot;&gt;LU-14125&lt;/a&gt; osc: prevent overflow of o_dropped&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 1da8349175a136df0aadb28ae0e0f64ac0385961&lt;/p&gt;</comment>
                            <comment id="297166" author="adilger" created="Tue, 30 Mar 2021 03:18:59 +0000"  >&lt;p&gt;There is a new patch &lt;a href=&quot;https://review.whamcloud.com/42129&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/42129&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14543&quot; title=&quot;tgt_grant_discard(): avoid  tgd-&amp;gt;tgd_tot_granted overflowing&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14543&quot;&gt;&lt;del&gt;LU-14543&lt;/del&gt;&lt;/a&gt; target: prevent overflowing of tgd-&amp;gt;tgd_tot_granted&lt;/tt&gt;&quot; that may be of interest here.  I&apos;m not 100% sure it is related, since it involves an &lt;b&gt;underflow&lt;/b&gt; of &lt;tt&gt;tot_grant&lt;/tt&gt; and/or &lt;tt&gt;tot_dirty&lt;/tt&gt; AFAICS, so if that happened it would likely affect all clients, but worthwhile to mention it here.&lt;/p&gt;</comment>
                            <comment id="299865" author="ofaaland" created="Tue, 27 Apr 2021 23:25:48 +0000"  >&lt;p&gt;We&apos;ve seen the issue at LLNL again:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;with 40615 &quot;prevent overflow of o_dropped&quot;&lt;/li&gt;
	&lt;li&gt;without 40564 &quot;grant shrink shouldn&apos;t account skipped OSC&quot;&lt;/li&gt;
	&lt;li&gt;without 42129 &quot;prevent overflowing of tgd-&amp;gt;tgd_tot_granted&quot;&lt;/li&gt;
	&lt;li&gt;with grant_shrink enabled&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The same clients mounted another file system which had grant_shrink &lt;em&gt;disabled&lt;/em&gt; and those OSCs did &lt;em&gt;not&lt;/em&gt; encounter the issue.&lt;/p&gt;

&lt;p&gt;Our clients will get 40564 &quot;grant shrink shouldn&apos;t account skipped OSC&quot; in the next few weeks, but it typically takes weeks for the issue to become easily detectable.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="61569">LU-14124</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="53371">LU-11409</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="59896">LU-13766</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="65465">LU-14901</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="46830">LU-9704</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="63449">LU-14543</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01emn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>