<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:42:44 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11305] MDS&apos; all cpu cores eaten up by ldlm_cn [100% cpu sys]</title>
                <link>https://jira.whamcloud.com/browse/LU-11305</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;MDS would sometimes have all cpu cores with cpu sys 100%, consumed all by ldlm_cn processes.&lt;/p&gt;


&lt;p&gt; PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND&lt;br/&gt;
 4362 root 20 0 0 0 0 R 80.3 0.0 3:20.09 ldlm_cn00_009&lt;br/&gt;
 3575 root 20 0 0 0 0 R 78.4 0.0 3:14.47 ldlm_cn00_001&lt;/p&gt;

&lt;p&gt;...... &lt;span class=&quot;error&quot;&gt;&amp;#91;a lot more&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;When this occurs, the mds server totally does not service, that is, all clients get near to stuck on every lustre operation.&lt;/p&gt;

&lt;p&gt;I am wondering if this might be caused by too many lock cancellings?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Sample logs would be like below:&lt;/p&gt;


&lt;p&gt;Aug 28 15:40:13 mds01 kernel: Lustre: 173646:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (&lt;del&gt;1s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff887c82057b00 x1606968047458384/t0(0) o103&lt;/del&gt;&amp;gt;0dd14e56-602f-6dfd-904c-c16839ee8771@10.5.130.198@o2ib:22/0 lens 328/192 e 0 to 0 dl 1535442012 ref 1 fl Complete:/0/0 rc 0/0&lt;br/&gt;
Aug 28 15:40:13 mds01 kernel: Lustre: shlfs1-MDT0000: Client bd0cfc4b-0556-18d5-e2f1-d639eae3e01d (at 10.5.130.204@o2ib) reconnecting&lt;br/&gt;
Aug 28 15:40:13 mds01 kernel: Lustre: Skipped 2 previous similar messages&lt;br/&gt;
Aug 28 15:40:13 mds01 kernel: Lustre: shlfs1-MDT0000: Connection restored to 1107c10e-a39c-cb3c-70cd-a12f619c2dc7 (at 10.5.130.204@o2ib)&lt;br/&gt;
Aug 28 15:40:13 mds01 kernel: Lustre: Skipped 2 previous similar messages&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: 195350:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff887ab4261e00 x1606966870393364/t0(0) o103-&amp;gt;71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: 9552:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff883b11cd9200 x1606966870394288/t0(0) o103-&amp;gt;71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: 9552:0:(service.c:1961:ptlrpc_server_handle_req_in()) Skipped 49 previous similar messages&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: 2827:0:(service.c:1961:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@ffff887ac59e2050 x1606964113976448/t0(0) o103-&amp;gt;bf7e471f-a94d-caa4-7d7a-d7d587e62d38@10.5.130.207@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: 2827:0:(service.c:1961:ptlrpc_server_handle_req_in()) Skipped 266 previous similar messages&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1511:ptlrpc_at_check_timed()) earlyQ=7 reqQ=61491 recA=0, svcEst=36, delay=945(jiff)&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (&lt;del&gt;1s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff883b12896600 x1606964113937204/t0(0) o103&lt;/del&gt;&amp;gt;bf7e471f-a94d-caa4-7d7a-d7d587e62d38@10.5.130.207@o2ib:36/0 lens 328/0 e 0 to 0 dl 1535442026 ref 2 fl New:/2/ffffffff rc 0/-1&lt;br/&gt;
Aug 28 15:40:27 mds01 kernel: Lustre: 9550:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 15326 previous similar messages&lt;br/&gt;
Aug 28 15:40:28 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).&lt;/p&gt;


&lt;p&gt;Aug 28 15:45:13 mds01 kernel: LNet: Service thread pid 5649 was inactive for 200.66s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:&lt;br/&gt;
Aug 28 15:45:15 mds01 kernel: Pid: 5649, comm: mdt01_041&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: #012Call Trace:&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0aa9f91&amp;gt;&amp;#93;&lt;/span&gt; ? lprocfs_counter_sub+0xc1/0x130 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff816ab6b9&amp;gt;&amp;#93;&lt;/span&gt; schedule+0x29/0x70&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff816a9004&amp;gt;&amp;#93;&lt;/span&gt; schedule_timeout+0x174/0x2c0&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8109a6c0&amp;gt;&amp;#93;&lt;/span&gt; ? process_timeout+0x0/0x10&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0900e81&amp;gt;&amp;#93;&lt;/span&gt; ? cfs_block_sigsinv+0x71/0xa0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c6d020&amp;gt;&amp;#93;&lt;/span&gt; ? ldlm_expired_completion_wait+0x0/0x240 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c6d811&amp;gt;&amp;#93;&lt;/span&gt; ldlm_completion_ast+0x5b1/0x920 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810c6440&amp;gt;&amp;#93;&lt;/span&gt; ? default_wake_function+0x0/0x20&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c6f730&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cli_enqueue_local+0x230/0x860 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c6d260&amp;gt;&amp;#93;&lt;/span&gt; ? ldlm_completion_ast+0x0/0x920 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1144f10&amp;gt;&amp;#93;&lt;/span&gt; ? mdt_blocking_ast+0x0/0x2e0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1154352&amp;gt;&amp;#93;&lt;/span&gt; mdt_object_local_lock+0x452/0xaf0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1144f10&amp;gt;&amp;#93;&lt;/span&gt; ? mdt_blocking_ast+0x0/0x2e0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c6d260&amp;gt;&amp;#93;&lt;/span&gt; ? ldlm_completion_ast+0x0/0x920 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0ac9165&amp;gt;&amp;#93;&lt;/span&gt; ? lu_object_find_at+0x95/0x290 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1154a4e&amp;gt;&amp;#93;&lt;/span&gt; mdt_object_lock_internal+0x5e/0x300 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1154d04&amp;gt;&amp;#93;&lt;/span&gt; mdt_object_lock+0x14/0x20 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc11784a1&amp;gt;&amp;#93;&lt;/span&gt; mdt_reint_open+0xc71/0x31a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0adc4ce&amp;gt;&amp;#93;&lt;/span&gt; ? upcall_cache_get_entry+0x20e/0x8f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc115eaa3&amp;gt;&amp;#93;&lt;/span&gt; ? ucred_set_jobid+0x53/0x70 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc116e8a0&amp;gt;&amp;#93;&lt;/span&gt; mdt_reint_rec+0x80/0x210 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc115030b&amp;gt;&amp;#93;&lt;/span&gt; mdt_reint_internal+0x5fb/0x9c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc1150832&amp;gt;&amp;#93;&lt;/span&gt; mdt_intent_reint+0x162/0x430 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc115b59e&amp;gt;&amp;#93;&lt;/span&gt; mdt_intent_policy+0x43e/0xc70 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c5a12f&amp;gt;&amp;#93;&lt;/span&gt; ? ldlm_resource_get+0x9f/0xa30 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c53277&amp;gt;&amp;#93;&lt;/span&gt; ldlm_lock_enqueue+0x387/0x970 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0c7c903&amp;gt;&amp;#93;&lt;/span&gt; ldlm_handle_enqueue0+0x9c3/0x1680 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0ca4ae0&amp;gt;&amp;#93;&lt;/span&gt; ? lustre_swab_ldlm_request+0x0/0x30 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d01ea2&amp;gt;&amp;#93;&lt;/span&gt; tgt_enqueue+0x62/0x210 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d05da5&amp;gt;&amp;#93;&lt;/span&gt; tgt_request_handle+0x925/0x1370 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0caeb16&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_server_handle_request+0x236/0xa90 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0cab148&amp;gt;&amp;#93;&lt;/span&gt; ? ptlrpc_wait_event+0x98/0x340 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810c6452&amp;gt;&amp;#93;&lt;/span&gt; ? default_wake_function+0x12/0x20&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810bc0f8&amp;gt;&amp;#93;&lt;/span&gt; ? __wake_up_common+0x58/0x90&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0cb2252&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_main+0xa92/0x1e40 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0cb17c0&amp;gt;&amp;#93;&lt;/span&gt; ? ptlrpc_main+0x0/0x1e40 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810b252f&amp;gt;&amp;#93;&lt;/span&gt; kthread+0xcf/0xe0&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810b2460&amp;gt;&amp;#93;&lt;/span&gt; ? kthread+0x0/0xe0&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff816b8798&amp;gt;&amp;#93;&lt;/span&gt; ret_from_fork+0x58/0x90&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810b2460&amp;gt;&amp;#93;&lt;/span&gt; ? kthread+0x0/0xe0&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel:&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1535442313.5649&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: Lustre: 173675:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Already past deadline (&lt;del&gt;93s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff883a352ccb00 x1607113994636712/t0(0) o103&lt;/del&gt;&amp;gt;&amp;lt;?&amp;gt;@&amp;lt;?&amp;gt;:232/0 lens 328/0 e 0 to 0 dl 1535442222 ref 2 fl New:/0/ffffffff rc 0/-1&lt;br/&gt;
Aug 28 15:45:18 mds01 kernel: Lustre: 173675:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 430820 previous similar messages&lt;br/&gt;
Aug 28 15:45:21 mds01 kernel: Lustre: 16782:0:(service.c:2112:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (45:264s); client may timeout. req@ffff883b9911ec00 x1606966869834560/t0(0) o103-&amp;gt;71fcac4f-4a38-7ef2-79f9-1674a8f4335c@10.5.130.108@o2ib:67/0 lens 328/0 e 0 to 0 dl 1535442057 ref 1 fl Interpret:/0/ffffffff rc 0/-1&lt;br/&gt;
Aug 28 15:45:21 mds01 kernel: Lustre: 16782:0:(service.c:2112:ptlrpc_server_handle_request()) Skipped 6215 previous similar messages&lt;br/&gt;
Aug 28 15:45:36 mds01 kernel: LustreError: 5581:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff883a2f27e600 x1609956713313792/t0(0) o104-&amp;gt;shlfs1-MDT0000@10.5.130.231@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1&lt;br/&gt;
Aug 28 15:45:43 mds01 kernel: Lustre: MGS: haven&apos;t heard from client 0c714426-f6a5-e005-8bca-baf99256c140 (at 10.5.130.54@o2ib) in 209 seconds. I think it&apos;s dead, and I am evicting it. exp ffff887f7a386800, cur 1535442342 expire 1535442192 last 1535442133&lt;br/&gt;
Aug 28 15:45:47 mds01 kernel: Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).&lt;br/&gt;
Aug 28 15:45:47 mds01 kernel: Lustre: Skipped 8172 previous similar messages&lt;br/&gt;
Aug 28 15:45:47 mds01 kernel: Lustre: 9552:0:(service.c:1511:ptlrpc_at_check_timed()) earlyQ=44471 reqQ=1316948 recA=0, svcEst=364, delay=703(jiff)&lt;br/&gt;
Aug 28 15:45:47 mds01 kernel: Lustre: 9552:0:(service.c:1511:ptlrpc_at_check_timed()) Skipped 8173 previous similar messages&lt;br/&gt;
Aug 28 15:45:49 mds01 kernel: LNet: Service thread pid 2647 was inactive for 200.56s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>centos 7.4, SSD MDT * 2&lt;br/&gt;
MDS: E5-2697A v4 * 2, 512GB</environment>
        <key id="53161">LU-11305</key>
            <summary>MDS&apos; all cpu cores eaten up by ldlm_cn [100% cpu sys]</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ys">Yang Sheng</assignee>
                                    <reporter username="anhua">anhua</reporter>
                        <labels>
                            <label>ldlm</label>
                    </labels>
                <created>Thu, 30 Aug 2018 12:42:47 +0000</created>
                <updated>Tue, 26 Jul 2022 03:54:42 +0000</updated>
                            <resolved>Tue, 26 Jul 2022 03:54:42 +0000</resolved>
                                    <version>Lustre 2.10.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="252808" author="icostelloddn" created="Fri, 9 Aug 2019 02:36:00 +0000"  >&lt;p&gt;I am seeing the same issue as well however on the oss&apos;s, the similarities I am seeing are the ldlm threads consuming all except 2 cores (libcfs cpu_pattern) on the oss&apos;s (DNE). &lt;/p&gt;</comment>
                            <comment id="254763" author="degremoa" created="Mon, 16 Sep 2019 22:29:38 +0000"  >&lt;p&gt;We were able to reproduce a very similar issue.&lt;/p&gt;

&lt;p&gt;We are seeing an unresponsive MDS, all CPUs at 100% with lots of &lt;b&gt;ldlm_cn_xxx&lt;/b&gt; threads and few &lt;b&gt;ldlm_bl&lt;/b&gt; threads consumming all the CPUs.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;ldlm_cn&lt;/b&gt; threads have this kind of strack trace:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&#160;Aug 26 13:38:31 ip-198-19-95-73 kernel: Lustre: 79258:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:136s); client may timeout. req@ffff88d6dfb47850 x1642929689998160/t0(0) o103-&amp;gt;0e4989ad-ae70-2c70-9db7-05f5fafae50e@10.26.86.9@tcp:745/0 lens 3584/0 e 0 to 0 dl 1566826575 ref 1 fl Interpret:/2/ffffffff rc 0/-1
 Aug 26 13:38:31 ip-198-19-95-73 kernel: Lustre: 79258:0:(service.c:2114:ptlrpc_server_handle_request()) Skipped 11126 previous similar messages
 Aug 26 13:39:11 ip-198-19-95-73 kernel: LNet: Service thread pid 14956 was inactive for 204.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
 Aug 26 13:39:11 ip-198-19-95-73 kernel: Pid: 14956, comm: mdt01_094 4.14.123-111.109.amzn2.x86_64 #1 SMP Mon Jun 10 19:37:57 UTC 2019
 Aug 26 13:39:11 ip-198-19-95-73 kernel: Call Trace:
 Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_completion_ast+0x63e/0x910 [ptlrpc]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_cli_enqueue_local+0x27c/0x880 [ptlrpc]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_object_local_lock+0x759/0xa00 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_object_lock_internal+0x5b/0x2e0 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_open+0xb4e/0x3020 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_rec+0x7f/0x250 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_reint_internal+0x5ee/0x680 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_reint+0x173/0x3e0 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_opc+0x385/0xa00 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: mdt_intent_policy+0x1c5/0x370 [mdt]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_lock_enqueue+0x234/0x950 [ptlrpc]
 Aug 26 13:39:11 ip-198-19-95-73 kernel: ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc]&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;A couple of &lt;b&gt;ldlm_bl_NN&lt;/b&gt; have this kind of stack, obtained from crash bt -T:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ffffc9001eecfa58] try_to_wake_up at ffffffff810b3b74
 [ffffc9001eecfac8] __update_load_avg_se at ffffffff810b9258
 [ffffc9001eecfad8] __update_load_avg_se at ffffffff810b9258
 [ffffc9001eecfb10] update_load_avg at ffffffff810bbacd
 [ffffc9001eecfb30] account_entity_enqueue at ffffffff810bdb45
 [ffffc9001eecfb68] enqueue_entity at ffffffff810bfa63
 [ffffc9001eecfbf8] tgt_blocking_ast at ffffffffa0c78a28 [ptlrpc]
 [ffffc9001eecfc40] ldlm_export_lock_hash at ffffffffa0bea6d0 [ptlrpc]
 --------------
 [ffffc9001eecfc48] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs]
 [ffffc9001eecfc68] cfs_hash_rw_lock at ffffffffa084f370 [libcfs]
 [ffffc9001eecfc70] cfs_hash_hh_hnode_del at ffffffffa084f3e0 [libcfs]
 [ffffc9001eecfc78] cfs_hash_bd_del_locked at ffffffffa084f7df [libcfs]
 [ffffc9001eecfc90] cfs_hash_rw_unlock at ffffffffa084ff20 [libcfs]
 [ffffc9001eecfc98] cfs_hash_del at ffffffffa085305e [libcfs]
 [ffffc9001eecfcd8] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs]
 [ffffc9001eecfd18] cfs_hash_bd_from_key at ffffffffa084fad1 [libcfs]
 [ffffc9001eecfd80] ldlm_lock_put at ffffffffa0bc3f73 [ptlrpc]
 [ffffc9001eecfd90] cfs_hash_hh_hhead at ffffffffa084f390 [libcfs]
 [ffffc9001eecfd98] cfs_hash_for_each_relax at ffffffffa0850458 [libcfs]
 [ffffc9001eecfda0] ldlm_cancel_locks_for_export_cb at ffffffffa0bcbe50 [ptlrpc]
 [ffffc9001eecfdb8] ldlm_export_lock_put at ffffffffa0bea740 [ptlrpc]
 [ffffc9001eecfdf0] ldlm_cancel_locks_for_export_cb at ffffffffa0bcbe50 [ptlrpc]
 [ffffc9001eecfe18] cfs_hash_for_each_empty at ffffffffa08538b8 [libcfs]
 [ffffc9001eecfe58] ldlm_export_cancel_locks at ffffffffa0bcc10f [ptlrpc]
 [ffffc9001eecfe80] ldlm_bl_thread_main at ffffffffa0bf5147 [ptlrpc]
 [ffffc9001eecfec0] default_wake_function at ffffffff810b4040
 [ffffc9001eecff08] ldlm_bl_thread_main at ffffffffa0bf4c60 [ptlrpc]
 [ffffc9001eecff10] kthread at ffffffff810a90fa
 [ffffc9001eecff20] kthread at ffffffff810a8fe0
 [ffffc9001eecff50] ret_from_fork at ffffffff81800205&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;b&gt;perf top&lt;/b&gt; :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Samples: 10M of event &apos;cpu-clock&apos;, Event count (approx.): 118365912201 
 Overhead Shared Object Symbol 
 45.75% [libcfs] [k] cfs_hash_hh_hhead
 21.23% [libcfs] [k] cfs_hash_for_each_relax
 6.93% [kernel] [k] __x86_indirect_thunk_rax
 3.85% [obdclass] [k] class_handle2object
 2.72% [kernel] [k] _raw_read_lock
 2.46% [libcfs] [k] cfs_hash_rw_unlock&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This issue appears when there are lots of locks in the MDT namespace.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ldlm.namespaces.mdt-lustre-MDT0000_UUID.contended_locks=32
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.contention_seconds=2
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.ctime_age_limit=10
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.early_lock_cancel=0
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_count=46344331
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_timeouts=55
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lock_unused_count=0
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lru_max_age=3900000
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.lru_size=9600
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.max_nolock_bytes=0
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.max_parallel_ast=1024
 ldlm.namespaces.mdt-lustre-MDT0000_UUID.resource_count=14281836&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ldlm.namespaces.mdt-lustre-MDT0000_UUID.pool.stats=
snapshot_time 1568643075.466635065 secs.nsecs
granted 59817 samples [locks] 0 55589762 171302800233 8049595240895311875
grant 153081925 samples [locks] 1 1 153081925 153081925
cancel 106817757 samples [locks] 1 1 106817757 106817757
grant_rate 59817 samples [locks/s] 0 527737 153081896 36151088814526
cancel_rate 59817 samples [locks/s] 0 315229 106817738 15370244224014
grant_plan 59817 samples [locks/s] 382861 47857748 189705106051 7747978793807775269
slv 59817 samples [slv] 1 1378303200000 77899468648378585 6008560669154165631
recalc_freed 259959295 samples [locks] 0 0 0 0
recalc_timing 59791 samples [sec] 1 52689 140658 3151201290&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;ul&gt;
	&lt;li&gt;MDT is receiving plenty of LDLM_CANCEL from clients.&lt;/li&gt;
	&lt;li&gt;As we have 747 GB of RAM, LDLM computes a LDLM pool limit to 36 millions locks.&lt;/li&gt;
	&lt;li&gt;When number of active locks get over this number the MDT can&apos;t cope with that and the problem starts appearing&lt;/li&gt;
	&lt;li&gt;If we control the ldlm tuning on client side to limit the number of locks, we don&apos;t see the problem.&lt;/li&gt;
	&lt;li&gt;Based on the MDT HW we are using, the threshold is around 10 millions locks.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;For anybody hitting a similar issue I would recommend checking the number of locks on server side and limit it on client side&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;ldlm.namespace.&lt;b&gt;.lru_max_age and ldlm.namespace.&lt;/b&gt;.lru_size&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Is this a scalibility issue of the LDLM code? Too many active locks?&lt;br/&gt;
 Or do you think the problem is elsewhere?&lt;/p&gt;</comment>
                            <comment id="254956" author="ys" created="Wed, 18 Sep 2019 08:55:18 +0000"  >&lt;p&gt;Hi, Aurelien,&lt;/p&gt;

&lt;p&gt;Have possible to share the vmcore with us? Or only paste output of &apos;foreach bt&apos; in crash. Many thanks for your help.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
YangSheng&lt;/p&gt;</comment>
                            <comment id="255047" author="degremoa" created="Thu, 19 Sep 2019 15:50:59 +0000"  >&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/33515/33515_foreach_bt.log&quot; title=&quot;foreach_bt.log attached to LU-11305&quot;&gt;foreach_bt.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Not easy to provide you with a vmcore, but here a &apos;foreach bt&apos; log. (This is a new occurence of the problem, do not compare this output with the exact output I posted above)&lt;/p&gt;</comment>
                            <comment id="341527" author="ys" created="Tue, 26 Jul 2022 03:54:42 +0000"  >&lt;p&gt;Please feel free to reopen it once have new info.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="33515" name="foreach_bt.log" size="1523071" author="degremoa" created="Thu, 19 Sep 2019 15:49:45 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>ldlm</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i001hz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>