<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:13:02 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14815] memory issues leading to blocked new connections until drop_cache set</title>
                <link>https://jira.whamcloud.com/browse/LU-14815</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have a system with 2 OSS and 2 MDS running community lustre editions with patched kernels. The specs of both OSS nodes are:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;2 socket 20 core Xeon Gold 6230 @ 2.1GHz&lt;/li&gt;
	&lt;li&gt;384GB of RAM&lt;/li&gt;
	&lt;li&gt;single port EDR IB&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Under load on the system, we average the following load usage:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss2 ~&amp;#93;&lt;/span&gt;# uptime&lt;/tt&gt;&lt;br/&gt;
{{ 15:54:37 up 7:33, 2 users, load average: 185.68, 161.83, 164.17}}&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss2 ~&amp;#93;&lt;/span&gt;# free&lt;/tt&gt;&lt;br/&gt;
{{ total used free shared buff/cache available}}&lt;br/&gt;
&lt;tt&gt;Mem: 394501208 11600480 21953712 968424 360947016 380784736&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;Swap: 4194300 8968 4185332&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss2 ~&amp;#93;&lt;/span&gt;#&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;When the system is being utilized we see quite often the following (once the messages start, they continue until either a reboot or an &quot;echo 3 &amp;gt; /proc/sys/vm_drop_caches&quot;):&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; kworker/13:0: page allocation failure: order:8, mode:0x80d0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; CPU: 13 PID: 115770 Comm: kworker/13:0 Kdump: loaded Tainted: P OE ------------ 3.10.0-1062.9.1.el7_lustre.x86_64 #1&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.10.0 11/12/2020&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Workqueue: ib_cm cm_work_handler &lt;span class=&quot;error&quot;&gt;&amp;#91;ib_cm&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Call Trace:&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffabf7ac23&amp;gt;&amp;#93;&lt;/span&gt; dump_stack+0x19/0x1b&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab9c3d70&amp;gt;&amp;#93;&lt;/span&gt; warn_alloc_failed+0x110/0x180&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab9c6af0&amp;gt;&amp;#93;&lt;/span&gt; ? drain_pages+0xb0/0xb0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab9c897f&amp;gt;&amp;#93;&lt;/span&gt; __alloc_pages_nodemask+0x9df/0xbe0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffaba16b28&amp;gt;&amp;#93;&lt;/span&gt; alloc_pages_current+0x98/0x110&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab9c2b1e&amp;gt;&amp;#93;&lt;/span&gt; __get_free_pages+0xe/0x40&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffabbadafc&amp;gt;&amp;#93;&lt;/span&gt; swiotlb_alloc_coherent+0x5c/0x160&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab86ead1&amp;gt;&amp;#93;&lt;/span&gt; x86_swiotlb_alloc_coherent+0x41/0x50&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc06d8394&amp;gt;&amp;#93;&lt;/span&gt; mlx5_dma_zalloc_coherent_node+0xb4/0x110 &lt;span class=&quot;error&quot;&gt;&amp;#91;mlx5_core&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc06d8c89&amp;gt;&amp;#93;&lt;/span&gt; mlx5_buf_alloc_node+0x89/0x120 &lt;span class=&quot;error&quot;&gt;&amp;#91;mlx5_core&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffaba674e1&amp;gt;&amp;#93;&lt;/span&gt; ? alloc_inode+0x51/0xa0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc06d8d34&amp;gt;&amp;#93;&lt;/span&gt; mlx5_buf_alloc+0x14/0x20 &lt;span class=&quot;error&quot;&gt;&amp;#91;mlx5_core&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d5e63f&amp;gt;&amp;#93;&lt;/span&gt; create_kernel_qp.isra.65+0x43a/0x741 &lt;span class=&quot;error&quot;&gt;&amp;#91;mlx5_ib&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d48d1c&amp;gt;&amp;#93;&lt;/span&gt; create_qp_common+0x8ec/0x17a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mlx5_ib&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffaba25286&amp;gt;&amp;#93;&lt;/span&gt; ? kmem_cache_alloc_trace+0x1d6/0x200&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d49d1a&amp;gt;&amp;#93;&lt;/span&gt; mlx5_ib_create_qp+0x14a/0x820 &lt;span class=&quot;error&quot;&gt;&amp;#91;mlx5_ib&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab9ddc2d&amp;gt;&amp;#93;&lt;/span&gt; ? kvmalloc_node+0x8d/0xe0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab9ddc2d&amp;gt;&amp;#93;&lt;/span&gt; ? kvmalloc_node+0x8d/0xe0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab9ddcb5&amp;gt;&amp;#93;&lt;/span&gt; ? kvfree+0x35/0x40&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d411e6&amp;gt;&amp;#93;&lt;/span&gt; ? mlx5_ib_create_cq+0x346/0x6f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mlx5_ib&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0cbccab&amp;gt;&amp;#93;&lt;/span&gt; ib_create_qp+0x8b/0x320 &lt;span class=&quot;error&quot;&gt;&amp;#91;ib_core&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d98974&amp;gt;&amp;#93;&lt;/span&gt; rdma_create_qp+0x34/0xb0 &lt;span class=&quot;error&quot;&gt;&amp;#91;rdma_cm&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a0b85c&amp;gt;&amp;#93;&lt;/span&gt; kiblnd_create_conn+0xe5c/0x19b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ko2iblnd&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffaba2630d&amp;gt;&amp;#93;&lt;/span&gt; ? kmem_cache_alloc_node_trace+0x11d/0x210&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a1a24c&amp;gt;&amp;#93;&lt;/span&gt; kiblnd_passive_connect+0xa2c/0x1760 &lt;span class=&quot;error&quot;&gt;&amp;#91;ko2iblnd&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0a1b6d5&amp;gt;&amp;#93;&lt;/span&gt; kiblnd_cm_callback+0x755/0x23a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ko2iblnd&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d978af&amp;gt;&amp;#93;&lt;/span&gt; ? _cma_attach_to_dev+0x5f/0x70 &lt;span class=&quot;error&quot;&gt;&amp;#91;rdma_cm&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d9cca0&amp;gt;&amp;#93;&lt;/span&gt; cma_ib_req_handler+0xce0/0x12a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;rdma_cm&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d87beb&amp;gt;&amp;#93;&lt;/span&gt; cm_process_work+0x2b/0x130 &lt;span class=&quot;error&quot;&gt;&amp;#91;ib_cm&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d89468&amp;gt;&amp;#93;&lt;/span&gt; cm_req_handler+0xaa8/0xf80 &lt;span class=&quot;error&quot;&gt;&amp;#91;ib_cm&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab82b59e&amp;gt;&amp;#93;&lt;/span&gt; ? __switch_to+0xce/0x580&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d89d1d&amp;gt;&amp;#93;&lt;/span&gt; cm_work_handler+0x15d/0xfcf &lt;span class=&quot;error&quot;&gt;&amp;#91;ib_cm&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffabf805a2&amp;gt;&amp;#93;&lt;/span&gt; ? __schedule+0x402/0x840&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab8be21f&amp;gt;&amp;#93;&lt;/span&gt; process_one_work+0x17f/0x440&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab8bf336&amp;gt;&amp;#93;&lt;/span&gt; worker_thread+0x126/0x3c0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab8bf210&amp;gt;&amp;#93;&lt;/span&gt; ? manage_workers.isra.26+0x2a0/0x2a0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab8c61f1&amp;gt;&amp;#93;&lt;/span&gt; kthread+0xd1/0xe0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab8c6120&amp;gt;&amp;#93;&lt;/span&gt; ? insert_kthread_work+0x40/0x40&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffabf8dd37&amp;gt;&amp;#93;&lt;/span&gt; ret_from_fork_nospec_begin+0x21/0x21&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffab8c6120&amp;gt;&amp;#93;&lt;/span&gt; ? insert_kthread_work+0x40/0x40&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Mem-Info:&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; active_anon:133507 inactive_anon:148978 isolated_anon:0&lt;/tt&gt;&lt;br/&gt;
{{ active_&lt;a href=&quot;file:10646819&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:10646819&lt;/a&gt; inactive_&lt;a href=&quot;file:81974614&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:81974614&lt;/a&gt; isolated_&lt;a href=&quot;file:0&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:0&lt;/a&gt;}}&lt;br/&gt;
{{ unevictable:19088 dirty:1079 writeback:0 unstable:0}}&lt;br/&gt;
{{ slab_reclaimable:1867029 slab_unreclaimable:195034}}&lt;br/&gt;
{{ mapped:30616 shmem:229818 pagetables:2598 bounce:0}}&lt;br/&gt;
{{ free:1061087 free_pcp:148 free_cma:0}}&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_&lt;a href=&quot;file:0kB&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:0kB&lt;/a&gt; inactive_&lt;a href=&quot;file:0kB&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:0kB&lt;/a&gt; unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; lowmem_reserve[]: 0 1281 191724 191724&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 0 DMA32 free:762136kB min:296kB low:368kB high:444kB active_anon:16kB inactive_anon:152kB active_&lt;a href=&quot;file:4548kB&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:4548kB&lt;/a&gt; inactive_&lt;a href=&quot;file:4916kB&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:4916kB&lt;/a&gt; unevictable:216kB isolated(anon):0kB isolated(file):0kB present:1566348kB managed:1312364kB mlocked:216kB dirty:4kB writeback:0kB mapped:216kB shmem:80kB slab_reclaimable:213740kB slab_unreclaimable:38944kB kernel_stack:144kB pagetables:236kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; lowmem_reserve[]: 0 0 190442 190442&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 0 Normal free:2062308kB min:44544kB low:55680kB high:66816kB active_anon:111288kB inactive_anon:128592kB active_&lt;a href=&quot;file:15617196kB&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:15617196kB&lt;/a&gt; inactive_&lt;a href=&quot;file:169639488kB&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:169639488kB&lt;/a&gt; unevictable:8768kB isolated(anon):0kB isolated(file):0kB present:198180864kB managed:195015992kB mlocked:8768kB dirty:1668kB writeback:0kB mapped:39240kB shmem:126220kB slab_reclaimable:3365336kB slab_unreclaimable:340572kB kernel_stack:14336kB pagetables:4008kB unstable:0kB bounce:0kB free_pcp:420kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; lowmem_reserve[]: 0 0 0 0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 1 Normal free:1404008kB min:45260kB low:56572kB high:67888kB active_anon:422868kB inactive_anon:467168kB active_&lt;a href=&quot;file:26965532kB&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:26965532kB&lt;/a&gt; inactive_&lt;a href=&quot;file:158254052kB&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;file:158254052kB&lt;/a&gt; unevictable:67368kB isolated(anon):0kB isolated(file):0kB present:201326592kB managed:198156940kB mlocked:67368kB dirty:2644kB writeback:0kB mapped:83008kB shmem:792972kB slab_reclaimable:3889040kB slab_unreclaimable:400620kB kernel_stack:20768kB pagetables:6148kB unstable:0kB bounce:0kB free_pcp:216kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; lowmem_reserve[]: 0 0 0 0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 0 DMA32: 1512*4kB (UEM) 1277*8kB (UEM) 881*16kB (UEM) 767*32kB (UEM) 720*64kB (UEM) 457*128kB (UEM) 252*256kB (UEM) 155*512kB (UM) 22*1024kB (UEM) 51*2048kB (UE) 81*4096kB (M) = 762104kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 0 Normal: 5268*4kB (UE) 4145*8kB (UE) 125630*16kB (UEM) 1*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2064344kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 1 Normal: 57139*4kB (UEM) 5107*8kB (UE) 70991*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1405268kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; 92854017 total pagecache pages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; 86 pages in swap cache&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Swap cache stats: add 520, delete 434, find 402/518&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Free swap = 4185332kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; Total swap = 4194300kB&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; 100272446 pages RAM&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; 0 pages HighMem/MovableOnly&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:40:39 2021&amp;#93;&lt;/span&gt; 1647148 pages reserved&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;I have seen mention in other Jira entries (e.g. &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10133&quot; title=&quot;Multi-page allocation failures in mlx4/mlx5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10133&quot;&gt;&lt;del&gt;LU-10133&lt;/del&gt;&lt;/a&gt;) versions of the MLNX code perhaps being an issue, this is using the versions of OFED included prebuilt with the Lustre packages.&lt;/p&gt;

&lt;p&gt;When the above messages begin to repeat we see an issue where new mounts cannot succeed (they hang at mount). A rebooted compute node (where the filesystem was mounted, then rebooted) will not be able to mount after the reboot until either the OSS nodes are rebooted (going clearing memory and going through recovery) or &quot;echo 3 &amp;gt; /proc/sys/vm/drop_cache&quot; is run.&lt;/p&gt;

&lt;p&gt;Currently, all lustre module tunables are default, we had tried with a few different options in hoping to provide better performance but the same issues above occurred.&lt;/p&gt;</description>
                <environment>CentOS 7.8.2003&lt;br/&gt;
Kernel -- kernel-3.10.0-1062.9.1.el7_lustre.x86_64&lt;br/&gt;
Lustre -- lustre-2.12.4-1.el7.x86_64&lt;br/&gt;
e2fsprogs -- e2fsprogs-1.45.6.wc5-0.el7.x86_64&lt;br/&gt;
IML -- yes</environment>
        <key id="64985">LU-14815</key>
            <summary>memory issues leading to blocked new connections until drop_cache set</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="makia">Makia Minich</reporter>
                        <labels>
                    </labels>
                <created>Mon, 5 Jul 2021 14:00:09 +0000</created>
                <updated>Thu, 22 Jul 2021 15:56:29 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="306215" author="makia" created="Mon, 5 Jul 2021 14:11:26 +0000"  >&lt;p&gt;During this, on the client side we would see the following:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:56:39 2021&amp;#93;&lt;/span&gt; LNet: 3570:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.15.7@o2ib: 0 seconds&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:56:39 2021&amp;#93;&lt;/span&gt; LNet: 3570:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 2 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 15:59:54 2021&amp;#93;&lt;/span&gt; LNetError: 23802:0:(o2iblnd_cb.c:2957:kiblnd_rejected()) 172.19.15.7@o2ib rejected: o2iblnd no resources&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;At this point, from the clients view, file operations appear to be working (I was able to df, touch/dd new files, etc) but then attempting to umount or mount caused an &quot;in use&quot; error (with lsof and ps showing now jobs utilizing the filesystem).&lt;/p&gt;</comment>
                            <comment id="306216" author="makia" created="Mon, 5 Jul 2021 14:14:35 +0000"  >&lt;p&gt;I attempted to run a drop_cache on the OSS while system was locked, and immediately while running it, the OSS&apos;s dmesg looked to start processing bottlenecked requests:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:05:06 2021&amp;#93;&lt;/span&gt; Lustre: lustre3p-OST000f: Export ffff90889aaf2c00 already connecting from 172.19.5.98@o2ib&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:05:06 2021&amp;#93;&lt;/span&gt; Lustre: Skipped 1 previous similar message&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:06:22 2021&amp;#93;&lt;/span&gt; Lustre: lustre3p-OST000f: Client ad43237e-1cd2-3183-5c17-fd718830486d (at 172.19.4.231@o2ib) reconnecting&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:06:22 2021&amp;#93;&lt;/span&gt; Lustre: Skipped 104 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:06:55 2021&amp;#93;&lt;/span&gt; LustreError: 137-5: lustre3p-OST000a_UUID: not available for connect from 172.19.2.49@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:06:55 2021&amp;#93;&lt;/span&gt; LustreError: Skipped 853 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:07:22 2021&amp;#93;&lt;/span&gt; Lustre: lustre3p-OST0006: Connection restored to 7f8fbac2-b395-511a-0c54-0db2e0bb9f4b (at 172.19.4.178@o2ib)&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:07:22 2021&amp;#93;&lt;/span&gt; Lustre: Skipped 2624 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:07:41 2021&amp;#93;&lt;/span&gt; Lustre: 9662:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (100:5s); client may timeout. req@ffff90463dd80850 x1703551324093056/t0(0) o8-&amp;gt;c52b4ac8-56a4-5335-0855-f4c02ca42bf4@172.19.1.70@o2ib:0/0 lens 520/384 e 0 to 0 dl 1625494121 ref 1 fl Complete:/0/0 rc 0/0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:07:42 2021&amp;#93;&lt;/span&gt; Lustre: 8283:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (100:21s); client may timeout. req@ffff907f651c7050 x1703549555959744/t0(0) o8-&amp;gt;192b6053-7423-c4b1-ecb3-5c1d5b6ce24e@172.19.4.137@o2ib:0/0 lens 520/384 e 0 to 0 dl 1625494107 ref 1 fl Complete:/0/0 rc 0/0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:07:42 2021&amp;#93;&lt;/span&gt; Lustre: 8283:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 3 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Mon Jul 5 16:09:41 2021&amp;#93;&lt;/span&gt; Lustre: lustre3p-OST0006: denying duplicate export for 42b2a082-b982-1dae-1423-ee8ad63fe6bc, -114&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="306217" author="pjones" created="Mon, 5 Jul 2021 14:20:37 +0000"  >&lt;p&gt;Does the same behaviour exhibit on 2.12.7 RC1?&lt;/p&gt;</comment>
                            <comment id="306220" author="makia" created="Mon, 5 Jul 2021 14:34:02 +0000"  >&lt;p&gt;At the moment I&apos;m unsure as this system is attempting to run in production so changing versions is somewhat limited. Would there be something in 2.12.7 RC1 I should be looking towards moreso than perhaps the latest 2.12.6 version?&lt;/p&gt;</comment>
                            <comment id="307165" author="makia" created="Tue, 13 Jul 2021 13:14:05 +0000"  >&lt;p&gt;We have upgraded the lustre servers to 2.12.6 and continue to see the original reported error. In addition, we&apos;re seeing the following intermingled:&lt;/p&gt;


&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:05:31 2021&amp;#93;&lt;/span&gt; Lustre: lustre3p-OST0008: Connection restored to 2ec09a79-f161-5925-2c1a-06ba4e2425a8 (at 172.19.2.66@o2ib)&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:05:31 2021&amp;#93;&lt;/span&gt; Lustre: Skipped 643 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:05:42 2021&amp;#93;&lt;/span&gt; Lustre: 12315:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/5), not sending early reply&lt;/tt&gt;&lt;br/&gt;
{{ req@ffff8dae96cb6050 x1703545892360832/t0(0) o2-&amp;gt;d30c2397-6a5e-d86a-b8eb-3c398c7190ad@172.19.1.160@o2ib:74/0 lens 440/432 e 8 to 0 dl 1626181739 ref 2 fl Interpret:/0/0 rc 0/0}}&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:05:42 2021&amp;#93;&lt;/span&gt; Lustre: 12315:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:06:20 2021&amp;#93;&lt;/span&gt; LustreError: 12383:0:(ldlm_lib.c:3287:target_bulk_io()) @@@ bulk WRITE failed: rc &lt;del&gt;107 req@ffff8d99924f0850 x1703545892358336/t0(0) o4&lt;/del&gt;&amp;gt;d30c2397-6a5e-d86a-b8eb-3c398c7190ad@172.19.1.160@o2ib:229/0 lens 488/448 e 0 to 0 dl 1626181894 ref 1 fl Interpret:/0/0 rc 0/0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:06:20 2021&amp;#93;&lt;/span&gt; Lustre: lustre3p-OST000e: Bulk IO write error with d30c2397-6a5e-d86a-b8eb-3c398c7190ad (at 172.19.1.160@o2ib), client will retry: rc = -107&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:06:20 2021&amp;#93;&lt;/span&gt; LustreError: 12383:0:(ldlm_lib.c:3287:target_bulk_io()) Skipped 6 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:06:20 2021&amp;#93;&lt;/span&gt; LNet: Service thread pid 19455 completed after 941.21s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:06:20 2021&amp;#93;&lt;/span&gt; LNet: Skipped 1 previous similar message&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;Interesting is the service threads as we&apos;ve been running with limited threads after seeing this earlier. At the moment we have:&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss1 ~&amp;#93;&lt;/span&gt;# lctl get_param ost.OSS.*.threads_started&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ost.OSS.ost.threads_started=112&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ost.OSS.ost_create.threads_started=24&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ost.OSS.ost_io.threads_started=248&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ost.OSS.ost_out.threads_started=4&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ost.OSS.ost_seq.threads_started=4&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss1 ~&amp;#93;&lt;/span&gt;#&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="307173" author="makia" created="Tue, 13 Jul 2021 13:28:08 +0000"  >&lt;p&gt;Unsure if related, a cause, a result, or a different issue but we have also captured the following errors in the logs as well:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; LNet: Service thread pid 12287 was inactive for 1014.41s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debuggi&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ng purposes:&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; Pid: 12287, comm: ll_ost01_034 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; Call Trace:&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0682085&amp;gt;&amp;#93;&lt;/span&gt; wait_transaction_locked+0x85/0xd0 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0682378&amp;gt;&amp;#93;&lt;/span&gt; add_transaction_credits+0x278/0x310 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0682601&amp;gt;&amp;#93;&lt;/span&gt; start_this_handle+0x1a1/0x430 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0682cce&amp;gt;&amp;#93;&lt;/span&gt; jbd2__journal_restart+0xfe/0x160 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0682d43&amp;gt;&amp;#93;&lt;/span&gt; jbd2_journal_restart+0x13/0x20 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1cb23fe&amp;gt;&amp;#93;&lt;/span&gt; ldiskfs_truncate_restart_trans+0x4e/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1c683a2&amp;gt;&amp;#93;&lt;/span&gt; ldiskfs_ext_truncate_extend_restart+0x42/0x60 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1c6bd5a&amp;gt;&amp;#93;&lt;/span&gt; ldiskfs_ext_remove_space+0x56a/0x1150 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1c6e880&amp;gt;&amp;#93;&lt;/span&gt; ldiskfs_ext_truncate+0xb0/0xe0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1cb7967&amp;gt;&amp;#93;&lt;/span&gt; ldiskfs_truncate+0x3b7/0x3f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1cb86ca&amp;gt;&amp;#93;&lt;/span&gt; ldiskfs_evict_inode+0x58a/0x630 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa386c154&amp;gt;&amp;#93;&lt;/span&gt; evict+0xb4/0x180&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa386c58c&amp;gt;&amp;#93;&lt;/span&gt; iput+0xfc/0x190&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c280c8&amp;gt;&amp;#93;&lt;/span&gt; osd_object_delete+0x1f8/0x370 &lt;span class=&quot;error&quot;&gt;&amp;#91;osd_ldiskfs&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc17f4698&amp;gt;&amp;#93;&lt;/span&gt; lu_object_free.isra.32+0x68/0x170 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc17f81d5&amp;gt;&amp;#93;&lt;/span&gt; lu_object_put+0xc5/0x3e0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0b26b5e&amp;gt;&amp;#93;&lt;/span&gt; ofd_destroy_by_fid+0x20e/0x510 &lt;span class=&quot;error&quot;&gt;&amp;#91;ofd&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0b1d797&amp;gt;&amp;#93;&lt;/span&gt; ofd_destroy_hdl+0x257/0x9d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ofd&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1aeaf1a&amp;gt;&amp;#93;&lt;/span&gt; tgt_request_handle+0xada/0x1570 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1a8f88b&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_server_handle_request+0x24b/0xab0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1a931f4&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_main+0xb34/0x1470 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa36c5c21&amp;gt;&amp;#93;&lt;/span&gt; kthread+0xd1/0xe0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa3d93df7&amp;gt;&amp;#93;&lt;/span&gt; ret_from_fork_nospec_end+0x0/0x39&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffffffffff&amp;gt;&amp;#93;&lt;/span&gt; 0xffffffffffffffff&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; LustreError: dumping log to /tmp/lustre-log.1626182339.12287&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:15:46 2021&amp;#93;&lt;/span&gt; LustreError: 12397:0:(tgt_grant.c:758:tgt_grant_check()) lustre3p-OST000e: cli eaa76b78-0728-7cc7-0fc0-9aa3218a1a41 claims 9076736 GRANT, real grant 0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:16:01 2021&amp;#93;&lt;/span&gt; Lustre: 12301:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (100:30s); client may timeout. req@ffff8dcef0976050 x1704649368&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;564480/t0(0) o8-&amp;gt;aecbb3c1-bc2b-9f53-4e37-e2efc6ef0638@172.19.4.220@o2ib:0/0 lens 520/384 e 0 to 0 dl 1626182324 ref 1 fl Complete:/0/0 rc 0/0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:16:01 2021&amp;#93;&lt;/span&gt; Lustre: 12301:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 5 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:17:24 2021&amp;#93;&lt;/span&gt; LustreError: 12712:0:(tgt_grant.c:758:tgt_grant_check()) lustre3p-OST000e: cli eaa76b78-0728-7cc7-0fc0-9aa3218a1a41 claims 9076736 GRANT, real grant 7868416&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Tue Jul 13 15:17:24 2021&amp;#93;&lt;/span&gt; LustreError: 12712:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 1 previous similar message&lt;/tt&gt;&lt;/p&gt;</comment>
                            <comment id="308079" author="eaujames" created="Thu, 22 Jul 2021 07:44:30 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;Sorry to interfere here.&lt;/p&gt;

&lt;p&gt;You seem to be missing the following commit on your MOFED:&lt;br/&gt;
 &lt;a href=&quot;https://github.com/torvalds/linux/commit/34f4c9554d8b2a7d2deb9503e9373b598ee3279f&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;&lt;font color=&quot;#0000ff&quot;&gt;34f4c9554d8b2a7d2deb9503e9373b598ee3279f&lt;/font&gt;&lt;/a&gt;&lt;font color=&quot;#000000&quot;&gt;: IB/mlx5: Use fragmented QP&apos;s buffer for in-kernel users&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;&lt;font color=&quot;#000000&quot;&gt;mlx5_frag_buf_alloc_node should be used instead of &lt;tt&gt;mlx5_buf_alloc_node&lt;/tt&gt;.&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;&lt;font color=&quot;#000000&quot;&gt;MOFED 4.7 should have this symbol (in kmod-mlnx-ofa_kernel.x86_64 rpm).&lt;/font&gt;&lt;/p&gt;</comment>
                            <comment id="308118" author="makia" created="Thu, 22 Jul 2021 15:56:29 +0000"  >&lt;p&gt;It does look like that patch is missing from all kernel options from the community release. I&apos;m looking at the patch and options to try it on this system. Thank you for that link.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01yhr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>