<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:25:37 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9372] OOM happens on OSS during Lustre recovery for more than 5000 clients</title>
                <link>https://jira.whamcloud.com/browse/LU-9372</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.&lt;/p&gt;

&lt;p&gt;After joint analysis, it looks like a huge memory part is being consumed by &quot;ptlrpc_request_buffer_desc&quot; (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).&lt;/p&gt;

&lt;p&gt;Having a look to the concerned source code, it looks like these &quot;ptlrpc_request_buffer_desc&quot; could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().&lt;/p&gt;

&lt;p&gt;This problem has caused several OSS failovers to fail due to OOM.&lt;/p&gt;</description>
                <environment>Server running with b2_7_fe&lt;br/&gt;
Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)</environment>
        <key id="45601">LU-9372</key>
            <summary>OOM happens on OSS during Lustre recovery for more than 5000 clients</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="bfaccini">Bruno Faccini</reporter>
                        <labels>
                            <label>cea</label>
                    </labels>
                <created>Thu, 20 Apr 2017 09:55:17 +0000</created>
                <updated>Mon, 8 Jun 2020 17:48:46 +0000</updated>
                            <resolved>Wed, 31 Jan 2018 13:49:42 +0000</resolved>
                                                    <fixVersion>Lustre 2.11.0</fixVersion>
                    <fixVersion>Lustre 2.10.6</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="192793" author="bfaccini" created="Thu, 20 Apr 2017 10:06:28 +0000"  >&lt;p&gt;According to the concerned source code (and its comments), one possible option could be to run with &quot;test_req_buffer_pressure=1&quot; as a ptlrpc module parameter, to avoid dynamic alloc of new &quot;ptlrpc_request_buffer_desc&quot; objects, but it will need to be carefully tested.&lt;/p&gt;

&lt;p&gt;On the other hand, looks like the &quot;history&quot; code in ptlrpc_server_drop_request() could be changed a little bit in order to progressively free &quot;ptlrpc_request_buffer_desc&quot; objects.&lt;/p&gt;</comment>
                            <comment id="192795" author="gerrit" created="Thu, 20 Apr 2017 10:16:30 +0000"  >&lt;p&gt;Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/26752&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/26752&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; ptlrpc: drain &quot;ptlrpc_request_buffer_desc&quot; objects&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 2d0dff4f5d2dd25ca55f5401c1139519207a0a02&lt;/p&gt;</comment>
                            <comment id="193841" author="bruno.travouillon" created="Thu, 27 Apr 2017 21:48:49 +0000"  >&lt;p&gt;To complete the analysis, there are 958249 size-32768 slabs in the vmcore, and&lt;br/&gt;
 1068498 size-1024 slabs (ptlrpc_request_buffer_desc are kmalloc of size-1024)&lt;/p&gt;

&lt;p&gt;There are 4 instances of ptlrpc_service_part which have a lot of rqbds:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809cf4dd800
  scp_nrqbds_total = 98342
crash&amp;gt; struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809dbe9ec00
  scp_nrqbds_total = 302031
crash&amp;gt; struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc14000
  scp_nrqbds_total = 272040
crash&amp;gt; struct ptlrpc_service_part.scp_nrqbds_total 0xffff8809ddc1c400
  scp_nrqbds_total = 285039


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Most of the other ptlrpc_service_part instances have scp_nrqbds_total &amp;lt;= 64.&lt;/p&gt;

&lt;p&gt;For these 4 instances, the rqbds are in the scp_rqbd_posted list, while &lt;br/&gt;
 scp_nrqbds_posted is quite low:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809cf4dd800
  scp_nrqbds_posted = 12
crash&amp;gt; struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809cf4dd800
  scp_rqbd_posted = {
    next = 0xffff8809e0758800,
    prev = 0xffff8809db055800
  }
crash&amp;gt; list 0xffff8809e0758800|wc -l
98343

crash&amp;gt; struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809dbe9ec00
  scp_nrqbds_posted = 191
crash&amp;gt; struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809dbe9ec00
  scp_rqbd_posted = {
    next = 0xffff8809ed5b7400,
    prev = 0xffff8809cf4d1000
  }
crash&amp;gt; list 0xffff8809ed5b7400|wc -l
302032

crash&amp;gt; struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc14000
  scp_nrqbds_posted = 1
crash&amp;gt; struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc14000
  scp_rqbd_posted = {
    next = 0xffff8809ec199400,
    prev = 0xffff8809dc6e7800
  }
crash&amp;gt; list 0xffff8809ec199400|wc -l
272041

crash&amp;gt; struct ptlrpc_service_part.scp_nrqbds_posted 0xffff8809ddc1c400
  scp_nrqbds_posted = 0
crash&amp;gt; struct ptlrpc_service_part.scp_rqbd_posted 0xffff8809ddc1c400
  scp_rqbd_posted = {
    next = 0xffff8809e4880800,
    prev = 0xffff88097c4ddc00
  }
crash&amp;gt; list 0xffff8809e4880800|wc -l
285040

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In request_in_callback(), the svcpt-&amp;gt;scp_nrqbds_posted decrease if ev-&amp;gt;unlinked, but it looks like no rqbd is removed from the scp_rqbd_posted list. I have to admit that I don&apos;t clearly understand if it&apos;s normal or not...&lt;/p&gt;</comment>
                            <comment id="193851" author="bfaccini" created="Fri, 28 Apr 2017 00:31:57 +0000"  >&lt;p&gt;Bruno,&lt;br/&gt;
Thanks for this clarification which makes me have to re-think about my patch since, before I can have access to these necessary infos, I had anticipated that all this allocated rqbds upon need and now in excess, because never freed, would have been linked on scp_rqbd_idle/scp_hist_rqbds lists.&lt;/p&gt;

&lt;p&gt;Will now try to get a new patch version available asap.&lt;/p&gt;</comment>
                            <comment id="194125" author="bfaccini" created="Tue, 2 May 2017 09:02:56 +0000"  >&lt;p&gt;Bruno,&lt;br/&gt;
Since you have told me that you have some OSSs in this case, can you also check for me these same counter/list (in fact all ptlrpc_service_part&apos;s counters/lists values/entries-count would be nice to have) on a live system where rqbds may have been massively allocated during a similar recovery storm but have fortunately avoided the OOM ? &lt;/p&gt;</comment>
                            <comment id="195617" author="gerrit" created="Fri, 12 May 2017 05:06:26 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/26752/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/26752/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; ptlrpc: drain &quot;ptlrpc_request_buffer_desc&quot; objects&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 220124bff7b13cd26b1b7b81ecf46e137ac174d3&lt;/p&gt;</comment>
                            <comment id="195649" author="pjones" created="Fri, 12 May 2017 12:32:06 +0000"  >&lt;p&gt;Landed for 2.10&lt;/p&gt;</comment>
                            <comment id="199181" author="bruno.travouillon" created="Wed, 14 Jun 2017 08:50:35 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;The patch has been backported into the CEA 2.7 branch.&lt;/p&gt;

&lt;p&gt;FYI, we have been able to successfully recover 5000+ clients this morning. Thank you!&lt;/p&gt;</comment>
                            <comment id="206121" author="bruno.travouillon" created="Wed, 23 Aug 2017 14:08:59 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;For the record, we hit a similar occurrence after the upgrade of our Lustre servers to CentOS 7. The OSS panic on OOM during the recovery process (20 OSTs per OSS).&lt;/p&gt;

&lt;p&gt;We have been able to start the filesystem by mounting successive subset of 5 OSTs on each OSS. With a few OSTs, the memory consumption increases more slowly, and thanks to your patch &lt;a href=&quot;https://review.whamcloud.com/#/c/26752/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/26752/&lt;/a&gt; , the memory was freed after the completion of the recovery.&lt;/p&gt;

&lt;p&gt;Unfortunately, I don&apos;t have any vmcore available.&lt;/p&gt;</comment>
                            <comment id="206225" author="bfaccini" created="Thu, 24 Aug 2017 08:02:56 +0000"  >&lt;p&gt;Hello Bruno,&lt;br/&gt;
Well, too bad... For both the new occurrence with patch and no crash-dump available!&lt;br/&gt;
But can you at least provide the syslog/dmesg/Console for one of the crashedOSS ?&lt;br/&gt;
And also, do I correctly understand that the last successful failover/recovery with patch had occurred with CentOS 6.x and not CentOS 7 ?&lt;/p&gt;
</comment>
                            <comment id="207799" author="bfaccini" created="Thu, 7 Sep 2017 17:52:23 +0000"  >&lt;p&gt;Hello Bruno,&lt;br/&gt;
Is &quot;smurf623.log-20170709&quot; the log of one of the crashed OSSs from last occurrence of problem and running with CentOS and my patch?&lt;/p&gt;

&lt;p&gt;By the way, the fact that you Servers are now running CentOS7 may have some implications since in terms of Kernel memory allocation, SLUBs are now used by the Kernel instead of SLABs with Kernels shipped for CentOS6.&lt;/p&gt;

&lt;p&gt;Anyway, I am investigating new possible ways to avoid OOM during Server failover+recovery, like to set a hard limit on the volume/number of ptlrpc_request_buffer_desc+buffer, better understand the reason of the 17K size of the mainly used buffers, and/or may be to create a specific kmem_cache for this purpose and thus limit wasted space...&lt;/p&gt;</comment>
                            <comment id="207823" author="bruno.travouillon" created="Thu, 7 Sep 2017 20:41:16 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;This OSS smurf623 has been upgraded to CentOS 7, Lustre 2.7.3 with patch &lt;a href=&quot;https://review.whamcloud.com/#/c/26752/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;26752&lt;/a&gt; on July 5th. Because of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8685&quot; title=&quot;Fix JBD2 issue in EL7 Kernels&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8685&quot;&gt;&lt;del&gt;LU-8685&lt;/del&gt;&lt;/a&gt;, we had to upgrade the kernel the next day, on July 6th, while 5201 clients had the filesystem mounted. We then hit the OOM issue while mounting the 20 OSTs on the OSS.(between 14:38:48 and 15:03:13).&lt;/p&gt;</comment>
                            <comment id="207825" author="bruno.travouillon" created="Thu, 7 Sep 2017 20:53:14 +0000"  >&lt;p&gt;About the 17K size, look at OST_MAXREQSIZE and&#160;OST_BUFSIZE into lustre/include/lustre_net.h.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;#define OST_MAXREQSIZE (16 * 1024)&lt;br/&gt;
/** OST_BUFSIZE = max_reqsize + max sptlrpc payload size */&lt;br/&gt;
#define OST_BUFSIZE max_t(int, OST_MAXREQSIZE + 1024, 16 * 1024)&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="207832" author="bfaccini" created="Thu, 7 Sep 2017 22:25:24 +0000"  >&lt;p&gt;Mais oui!!!! Well, I should have been really tired today since I have not been able to find the forced crash in the log and I was looking for MDS requests sizes !!...&lt;br/&gt;
Thanks Bruno, I will try to push an other patch soon, based on my earlier comment thoughts,.&lt;/p&gt;</comment>
                            <comment id="207858" author="bfaccini" created="Fri, 8 Sep 2017 09:26:29 +0000"  >&lt;p&gt;Ok, so now I have a better look to the OSS log you have provided, I am concerned about the multiple MMP errors that have been encountered for OSTs, they tend to indicate that some of the OSTs ownership has been xfered to an other/HA OSS during the restart+recovery process but it may have induced some &quot;anarchy&quot; in the recovery process...&lt;/p&gt;

&lt;p&gt;Also the OOM stats print only show one Numa node, is it the case? And this node&apos;s SLAB content does not look so excessive regarding the memory size (&quot;Node 0 Normal free:64984kB ... present:91205632kB managed:89727696kB ... slab_reclaimable:1356412kB slab_unreclaimable:6092980kB ...&quot;), so do you confirm that you could still see the &quot;size-1024&quot; and &quot;size-32768&quot; Slabs growing to billion of objects ???&lt;/p&gt;</comment>
                            <comment id="208349" author="bruno.travouillon" created="Thu, 14 Sep 2017 12:37:53 +0000"  >&lt;p&gt;We use pacemaker for HA. When the first OSS crashed, the target resources failover to the partner OSS, which explains the MMP messages.&lt;/p&gt;

&lt;p&gt;These OSS are KVM guests running on top of a DDN SFA14KX-E controller. Indeed, there is only one NUMA node in an OSS.&lt;/p&gt;

&lt;p&gt;Unfortunately, I can&apos;t confirm the slabs were size-1024 and size-32768. As far as I remember, they were, but I can&apos;t assert it...my bad.&lt;/p&gt;

&lt;p&gt;I will provide an action plan to capture some relevant data during the next occurrence.&lt;/p&gt;</comment>
                            <comment id="208704" author="gerrit" created="Mon, 18 Sep 2017 23:29:36 +0000"  >&lt;p&gt;Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/29064&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/29064&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; ptlrpc: allow to limit number of service&apos;s rqbds&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 78d7f8f02ba35bbcd84dd44f103ed1326d94c30c&lt;/p&gt;</comment>
                            <comment id="208706" author="bfaccini" created="Mon, 18 Sep 2017 23:36:51 +0000"  >&lt;p&gt;Looks like my first change #26752 (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; ptlrpc: drain &quot;ptlrpc_request_buffer_desc&quot; objects) could not be efficient to drain rqbds during too long period of heavy load of requests from huge number of Clients and thus still allow for OOM...&lt;br/&gt;
So, I have just pushed #29064 (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; ptlrpc: allow to limit number of service&apos;s rqbds) in order to allow to set a limit on the max number of rqbds per service.&lt;/p&gt;</comment>
                            <comment id="208878" author="bfaccini" created="Wed, 20 Sep 2017 06:23:06 +0000"  >&lt;p&gt;J-B, would be cool if we could give a try to this new/2nd patch, with a significant peacock setup (let say with hundreds of clients, thousands ?), by forcing an OSS/OSTs failover+recovery and monitoring kmem memory consumption and if new limitation mechanism is working as expected and has no effect.&lt;/p&gt;</comment>
                            <comment id="209155" author="adilger" created="Thu, 21 Sep 2017 23:28:58 +0000"  >&lt;p&gt;It isn&apos;t clear what the benefit of a tunable to limit the number of RQBDs is, if it is off by default, since most users will not even know it exists.  Even if users do know this tunable exists, there probably isn&apos;t an easy way to know a good default value for the maximum number of rqbd buffers, since that depends greatly based on RAM size on the server, the number of clients, and the client load.&lt;/p&gt;

&lt;p&gt;Looking at older comments here, there are several things that concern me:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;the total number of RQBDs allocated seem far more than could possibly ever be used, since clients should typically only have at most 8 RPCs in flight per OST&lt;/li&gt;
	&lt;li&gt;during recovery, clients should normally only have a single RPC in flight per OST&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This means there shouldn&apos;t be more than about 5000 clients * 20 OSTs/OSS = 100000 RPCs/OSS outstanding on the OSS (maybe 200000 RPCs if you have 40 OSTs/OSS in failover mode, or is 10 OSTs/OSS the normal config and 20 OSTs/OSS is the failover config?).&lt;/p&gt;

&lt;p&gt;Are we sure that clients are not generating a flood of RPCs per OST during recovery (more than &lt;tt&gt;max_rpcs_in_flight&lt;/tt&gt;)?  I also recall there may be a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed if there are many threads trying to send an RPC at the same time and the buffers run out.&lt;/p&gt;

&lt;p&gt;On a related note, having so many OSTs on a single OSS is not really a good configuration, since it doesn&apos;t provide very good performance, and as you see it has a very high RAM requirement, and also causes a larger point of failure if the OSS goes down.  In addition to the number of outstanding RPCs that the clients may send, there is also a bunch of RAM used by the ldiskfs journals.&lt;/p&gt;

&lt;p&gt;Also, if you are seeing messages like the following in your logs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LDISKFS-fs warning (device sfa0049): ldiskfs_multi_mount_protect:331: Device is already active on another node
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;then there is something significantly wrong with your HA or STONITH configuration. MMP is meant as a backup sanity check to prevent double import to prevent filesystem corruption (which it did successfully for 20 OSTs in this case), but it is not intended to be a primary HA exclusion method for the storage.&lt;/p&gt;</comment>
                            <comment id="210168" author="bfaccini" created="Tue, 3 Oct 2017 10:34:53 +0000"  >&lt;p&gt;Andreas, I can&apos;t answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot.&lt;/p&gt;

&lt;p&gt;Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total).&lt;/p&gt;

&lt;p&gt;About your comment concerning &quot;a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ...&quot; , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations.&lt;/p&gt;

&lt;p&gt;Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?&lt;/p&gt;</comment>
                            <comment id="210276" author="adilger" created="Wed, 4 Oct 2017 08:40:22 +0000"  >&lt;p&gt;Bruno, my main concern is that a static tunable will not avoid a similar problem for most users.&lt;/p&gt;</comment>
                            <comment id="214422" author="bfaccini" created="Wed, 22 Nov 2017 10:11:52 +0000"  >&lt;p&gt;I was just talking about this problem, and I have found that I had never clearly indicated in this ticket that the reason of this 32k alloc for each ptlrpc_rqbd (for a real size of 17k) is due to patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4755&quot; title=&quot;ASSERTION( req-&amp;gt;rq_reqbuf_len &amp;gt;= msgsize ) failed when using 4MB RPC&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4755&quot;&gt;&lt;del&gt;LU-4755&lt;/del&gt;&lt;/a&gt; (&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4755&quot; title=&quot;ASSERTION( req-&amp;gt;rq_reqbuf_len &amp;gt;= msgsize ) failed when using 4MB RPC&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4755&quot;&gt;&lt;del&gt;LU-4755&lt;/del&gt;&lt;/a&gt; ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC&quot;).&lt;/p&gt;

&lt;p&gt;Since the way this size has been identified looks a bit empiric, we may also want to give try to 15k (+ payload size, thus leading to 16k) in order to divide the real consumed size by 2. &lt;br/&gt;
To allow almost the same size reduction, we may also try to use a specific kmem_cache/Slabs for ptlrpc_rqbd/17k, keeping in mind that it may be made useless due to Kernel merging of Slabs.&lt;/p&gt;</comment>
                            <comment id="216459" author="adilger" created="Sat, 16 Dec 2017 17:14:23 +0000"  >&lt;p&gt;I couldn&#8217;t find in the ticket how much RAM is on this OSS for the 20 OSTs? I&#8217;m wondering if we are also having problems here with CPT allocations all happening on one CPT and hitting OOM while there is plenty of RAM available on a second CPT?&lt;/p&gt;</comment>
                            <comment id="216462" author="bruno.travouillon" created="Sat, 16 Dec 2017 17:57:44 +0000"  >&lt;p&gt;We allocate 90 GB of RAM and 8 CPU cores to each OSS. We can&apos;t allocate more resources per virtual guest in a SFA14KXE. The cores are HT.&lt;br/&gt;
&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; oss# cat /proc/sys/lnet/cpu_partition_table
 0 : 0 1 2 3 
 1 : 4 5 6 7 
 2 : 8 9 10 11 
 3 : 12 13 14 15
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Last time we hit OOM, the memory consumption of the OSS was at its maximum (90GB).&lt;/p&gt;</comment>
                            <comment id="219497" author="gerrit" created="Wed, 31 Jan 2018 05:51:53 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/29064/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/29064/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; ptlrpc: allow to limit number of service&apos;s rqbds&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: d9e57a765e73e1bc3046124433eb6e2186f7e07c&lt;/p&gt;</comment>
                            <comment id="219528" author="pjones" created="Wed, 31 Jan 2018 13:49:42 +0000"  >&lt;p&gt;Landed for 2.11&lt;/p&gt;</comment>
                            <comment id="219554" author="gerrit" created="Wed, 31 Jan 2018 15:47:56 +0000"  >&lt;p&gt;Minh Diep (minh.diep@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/31108&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31108&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; ptlrpc: allow to limit number of service&apos;s rqbds&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 69ad99bf62cf461df93419e57adb323a6d537e31&lt;/p&gt;</comment>
                            <comment id="221321" author="bfaccini" created="Tue, 20 Feb 2018 21:05:22 +0000"  >&lt;p&gt;Master patch &lt;a href=&quot;https://review.whamcloud.com/31162&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31162&lt;/a&gt; from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10603&quot; title=&quot;ptlrpc_lprocfs_req_buffers_max_fops unused&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10603&quot;&gt;&lt;del&gt;LU-10603&lt;/del&gt;&lt;/a&gt; is required to make associated tunable visible to the external world and thus to allow this &lt;a href=&quot;https://review.whamcloud.com/29064/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/29064/&lt;/a&gt; patch/feature to be usable.&lt;/p&gt;

&lt;p&gt;So just in case, Minh, any back-port of #29064 requires also to back-port #31162.&lt;/p&gt;</comment>
                            <comment id="223300" author="gerrit" created="Mon, 12 Mar 2018 12:14:58 +0000"  >&lt;p&gt;Wang Shilong (wshilong@ddn.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/31622&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31622&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; ptlrpc: fix req_buffers_max and req_history_max setting&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: aa9005eb5c9e873e9e83619ff830ba848917f118&lt;/p&gt;</comment>
                            <comment id="224092" author="bfaccini" created="Wed, 21 Mar 2018 09:00:37 +0000"  >&lt;p&gt;Both patches from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10803&quot; title=&quot;req_buffers_max and req_history_max setting problems&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10803&quot;&gt;&lt;del&gt;LU-10803&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10826&quot; title=&quot;Regression in LU-9372 on OPA enviroment and no recovery triggered&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10826&quot;&gt;&lt;del&gt;LU-10826&lt;/del&gt;&lt;/a&gt; are also must-have/follow-ons to the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9372&quot; title=&quot;OOM happens on OSS during Lustre recovery for more than 5000 clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9372&quot;&gt;&lt;del&gt;LU-9372&lt;/del&gt;&lt;/a&gt; serie.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="13170">LU-1099</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="51283">LU-10803</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="51413">LU-10826</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="50524">LU-10603</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="59319">LU-13600</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="28227" name="smurf623.log-20170709" size="322582" author="bruno.travouillon" created="Thu, 7 Sep 2017 15:16:18 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzaov:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>