<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:07:54 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14221] Client hangs when using DoM with a fixed mdc lru_size</title>
                <link>https://jira.whamcloud.com/browse/LU-14221</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;After enabling DoM and beginning to use one of our file systems more heavily recently, we discovered a bug seemingly related to locking.&lt;/p&gt;

&lt;p&gt;Basically, with any fixed `lru_size`, everything will work normally until the number of locks hit the `lru_size`. From that point, everything will hang until the `lru_max_age` is hit, at which point it will clear the locks and move on, until filling again. We confirmed this by setting the number of locks pretty low, then setting a low (10s) `lru_max_age`, and kicking off a tar extraction. The tar would extract until the `lock_count` hit our `lru_size` value (basically 1 for 1 with number of files), then hang for 10s, then continue with another batch after the locks had been cleared. The same behavior can be replicated by letting it hang and then running `lctl set_param ldlm.namespaces.&lt;b&gt;mdc&lt;/b&gt;.lru_size=clear`, which will free up the process temporarily as well.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Our current workaround is to set `lru_size` to 0 and set the `lru_max_age` to 30s to keep the number of locks to a manageable level.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;This appears to only occur on our SLES clients. RHEL clients running the same Lustre version encounter no such problems. This may be due to the kernel version on SLES (4.12.14-197) vs RHEL (3.10.0-1160)&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;James believes this may be related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11518&quot; title=&quot;lock_count is exceeding lru_size&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11518&quot;&gt;&lt;del&gt;LU-11518&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;lru_size and lock_count while it&apos;s stuck:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;lctl get_param ldlm.namespaces.*.lru_size&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ldlm.namespaces.cyclone-MDT0000-mdc-ffff88078946d800.lru_size=200&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;lctl get_param ldlm.namespaces.*.lock_count&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ldlm.namespaces.cyclone-MDT0000-mdc-ffff88078946d800.lock_count=201&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Process stack while it&apos;s stuck:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0ad1932&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_set_wait+0x362/0x700 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0ad1d57&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_queue_wait+0x87/0x230 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0ab7217&amp;gt;&amp;#93;&lt;/span&gt; ldlm_cli_enqueue+0x417/0x8f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0a6105d&amp;gt;&amp;#93;&lt;/span&gt; mdc_enqueue_base+0x3ad/0x1990 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0a62e38&amp;gt;&amp;#93;&lt;/span&gt; mdc_intent_lock+0x288/0x4c0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0bf29ca&amp;gt;&amp;#93;&lt;/span&gt; lmv_intent_lock+0x9ca/0x1670 &lt;span class=&quot;error&quot;&gt;&amp;#91;lmv&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0cfea99&amp;gt;&amp;#93;&lt;/span&gt; ll_layout_intent+0x319/0x660 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d09fe2&amp;gt;&amp;#93;&lt;/span&gt; ll_layout_refresh+0x282/0x11d0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d47c73&amp;gt;&amp;#93;&lt;/span&gt; vvp_io_init+0x233/0x370 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa085d4d1&amp;gt;&amp;#93;&lt;/span&gt; cl_io_init0.isra.15+0xa1/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa085d641&amp;gt;&amp;#93;&lt;/span&gt; cl_io_init+0x41/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa085fb64&amp;gt;&amp;#93;&lt;/span&gt; cl_io_rw_init+0x104/0x200 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d02c5b&amp;gt;&amp;#93;&lt;/span&gt; ll_file_io_generic+0x2cb/0xb70 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0d03825&amp;gt;&amp;#93;&lt;/span&gt; ll_file_write_iter+0x125/0x530 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81214c9b&amp;gt;&amp;#93;&lt;/span&gt; __vfs_write+0xdb/0x130&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81215581&amp;gt;&amp;#93;&lt;/span&gt; vfs_write+0xb1/0x1a0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81216ac6&amp;gt;&amp;#93;&lt;/span&gt; SyS_write+0x46/0xa0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81002af5&amp;gt;&amp;#93;&lt;/span&gt; do_syscall_64+0x75/0xf0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8160008f&amp;gt;&amp;#93;&lt;/span&gt; entry_SYSCALL_64_after_hwframe+0x42/0xb7&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffffffffff&amp;gt;&amp;#93;&lt;/span&gt; 0xffffffffffffffff&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;I can reproduce and provide any other debug data as necessary.&lt;/tt&gt;&lt;/p&gt;</description>
                <environment></environment>
        <key id="62004">LU-14221</key>
            <summary>Client hangs when using DoM with a fixed mdc lru_size</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="nilesj">Jeff Niles</reporter>
                        <labels>
                            <label>ORNL</label>
                    </labels>
                <created>Tue, 15 Dec 2020 19:37:14 +0000</created>
                <updated>Tue, 1 Feb 2022 19:43:41 +0000</updated>
                                            <version>Lustre 2.12.5</version>
                    <version>Lustre 2.12.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="287620" author="pjones" created="Tue, 15 Dec 2020 19:38:59 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="287621" author="nilesj" created="Tue, 15 Dec 2020 19:39:06 +0000"  >&lt;p&gt;Forgot to mention that James is currently building a 2.14 based client (incorporates the patches from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11518&quot; title=&quot;lock_count is exceeding lru_size&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11518&quot;&gt;&lt;del&gt;LU-11518&lt;/del&gt;&lt;/a&gt;) to test with. I&apos;ll update with results once that&apos;s complete.&lt;/p&gt;</comment>
                            <comment id="287656" author="adilger" created="Wed, 16 Dec 2020 01:11:00 +0000"  >&lt;p&gt;There was a recent landing of patch &lt;a href=&quot;https://review.whamcloud.com/36903&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36903&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10664&quot; title=&quot;DoM: make DoM lock enqueue non-blocking&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10664&quot;&gt;&lt;del&gt;LU-10664&lt;/del&gt;&lt;/a&gt; dom: non-blocking enqueue for DOM locks&lt;/tt&gt;&quot; and patch &lt;a href=&quot;https://review.whamcloud.com/34858&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34858&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12296&quot; title=&quot;ll_dom_lock_cancel() should zero kms attribute&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12296&quot;&gt;&lt;del&gt;LU-12296&lt;/del&gt;&lt;/a&gt; llite: improve ll_dom_lock_cancel&lt;/tt&gt;&quot; to master which &lt;em&gt;may&lt;/em&gt; help this situation, but I&apos;m not sure.  &lt;/p&gt;

&lt;p&gt;That said, it is possible that the MDT DLM LRU is getting full while there are still dirty pages under the MDT locks, and the next lock enqueue has to block while the dirty data is flushed to the MDS before a new lock can be granted.  That would definitely be more likely if the LRU size is small, and that isn&apos;t something that we have been testing.&lt;/p&gt;

&lt;p&gt;As for possible causes of dirty data under the locks, it seems possible that the usage pattern of DoM (i.e. small files that are below a single RPC in size) means that the RPC generation engine does not submit the writes to the MDT in a timely manner, preferring to wait in case more data is written to the file.  It might be better to more aggressively generate the write RPC for DoM files shortly after at close time so that the DLM locks do not linger on the client with dirty data in memory.&lt;/p&gt;</comment>
                            <comment id="287669" author="nilesj" created="Wed, 16 Dec 2020 04:13:02 +0000"  >&lt;p&gt;Hey Andreas, thanks for the response!&lt;/p&gt;

&lt;p&gt;Wanted to provide some more details after testing today. We have a reproducer (small file based workload) that we&apos;ve been running to easily trigger this issue. In a non-DoM directory, it takes right about 20 minutes to complete. With the 2.12.5 and 2.12.6 clients in a DoM directory, that reproducer would never finish when left overnight (12+ hours) at an `lru_size` of 200 (our tunable from an older system).&lt;/p&gt;

&lt;p&gt;After building a 2.14 client, that same reproducer with an `lru_size` of 200 actually completed, but with a time of 223 minutes. Pretty rough for performance, but this isn&apos;t a benchmark, and it at least completed. After this, I set the `lru_size` to 2000, which we&apos;re using on some other clients that we have. I would consider this a more reasonable tuning anyway. With this, the reproducer completes in 20 minutes, identical to the non-DoM directory. Since it&apos;s a small file workload, a speedup would be ideal, but this at least isn&apos;t a failure or loss of performance. After the success of the `lru_size=2000`, I wanted to baseline against performance with `lru_size=0`, so I ran that, and it mirrors the same 20 minute result with a peak lock count around 36,000&lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/warning.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;. S&lt;/p&gt;

&lt;p&gt;Bottom line: something between 2.12.6 and 2.14 fixed the issue at least; I guess we will need to work backward to identify what that something is now. I&apos;ll talk with James about pulling the patches you identified into our 2.12.6 build.&lt;/p&gt;

&lt;p&gt;You mention issues when the LRU size is small; what do you consider small in current Lustre? The manual states that 100/core is a good starting point for tuning it, but at 128 cores/node, we&apos;re looking at an LRU size of 12,800, which across a large number of clients seems like it&apos;ll put a large memory strain on the MDSs. Would love to hear your thoughts on tuning LRU size.&lt;/p&gt;</comment>
                            <comment id="287685" author="adilger" created="Wed, 16 Dec 2020 09:49:38 +0000"  >&lt;p&gt;I&apos;m pretty sure I wrote the 100 locks/core recommendation many years ago, when there were 2-4 cores in compute nodes...  I agree that &lt;tt&gt;lru_size=2000&lt;/tt&gt; is pretty reasonable even if you have a large number of clients (e.g. 10k), as long as the MDS nodes have enough RAM, since it is unlikely that every client will have that many locks on every MDT at the same time.&lt;/p&gt;

&lt;p&gt;Using dynamic LRU size (&lt;tt&gt;lru_size=0&lt;/tt&gt;) is possible, and the MDS &lt;em&gt;should&lt;/em&gt; provide back-pressure on the clients when it is getting low on memory, but since it is a dynamic system it may not always work in the optimal way.  Having 36k locks on a single client for a short time is not unreasonable, so long as &lt;b&gt;all&lt;/b&gt; clients don&apos;t do this at the same time.  It would be useful/interesting to know how quickly the number of locks on the client(s) dropped after the job completed?&lt;/p&gt;

&lt;p&gt;As for performance, it would be useful to calculate what the file_size * count / runtime = bandwidth is for the file IO, and how that compares to the MDS bandwidth.  While DoM can speed small file IO to some extent, often the aggregate bandwidth of the OSSes is equivalent to the bandwidth of the MDSes.  That said, one of the significant, but often unseen, benefits of DoM is that shifting the small file IOPS off of the HDD OSTs will avoid contention with large file IO. That reduce access latency for small files when there is a concurrent large-file IO workload, and can also improve the performance of the large-file IO because of reduced HDD OSTs seeking and RPCs (see e.g. DoM presentation at previous LUG).&lt;/p&gt;</comment>
                            <comment id="287740" author="nilesj" created="Wed, 16 Dec 2020 16:49:30 +0000"  >&lt;blockquote&gt;&lt;p&gt;It would be useful/interesting to know how quickly the number of locks on the client(s) dropped after the job completed?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Agreed. I didn&apos;t have this data, so I just ran a test. On the unpatched 2.12.6 client (the broken one) with `lru_size=0`, it seems like it&apos;s not releasing them at all. This was checked at 1, 5, and 10 minutes post-job. I guess this was expected and tracks with the issues seen while trying to run a job (locks not clearing, ever). Running on the 2.14 client that James built, I see the same thing as well though, which is a bit odd.&lt;/p&gt;

&lt;p&gt;As part of our workaround, we&apos;ve set the `lru_max_age` rather low (30s in this case). This is doing what we&apos;d expect and helping to clear things up faster. Once we&apos;re patched and running, we were planning on adjusting this upward, but I&apos;m not sure what a good setting is here. If we were to use dynamic lru size, is there a sane default there, or is the typical 65 minutes okay?&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;As for performance, it would be useful to calculate what the file_size * count / runtime = bandwidth is for the file IO, and how that compares to the MDS bandwidth.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;We&apos;ve done this calculation in the past with some benchmarks, and it always seems like we run out of IOPS before we run out of bandwidth. As an example, at a 256k DoM size (and writing the entire DoM size), 50,000 operations/second should only be in the ~12GB/s bandwidth range.&lt;/p&gt;</comment>
                            <comment id="287789" author="adilger" created="Wed, 16 Dec 2020 20:37:36 +0000"  >&lt;blockquote&gt;
&lt;p&gt;If we were to use dynamic lru size, is there a sane default there, or is the typical 65 minutes okay?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;The 65 minute lock timeout was chosen because this allows locks to remain on the client across hourly checkpoints.  However, it isn&apos;t clear if that is worthwhile for a small number of locks vs. flushing the unused locks more quickly.  I think anything around 5-10 minutes would allow active locks to be reused, and would keep the client from accumulating too many unused locks, but it depends heavily on your workflows.  The dynamic LRU will expire old locks when the sum(lock_ages) gets too large, so it &lt;em&gt;should&lt;/em&gt; keep this in check.&lt;/p&gt;</comment>
                            <comment id="287998" author="tappro" created="Fri, 18 Dec 2020 13:08:56 +0000"  >&lt;p&gt;What sort of test or IO pattern was used when MDT hang? Is there a shared file being accessed or many files, how many clients and so on. I&apos;d try to run it locally&lt;br/&gt;
 In current state you can try to decrease lock contention it is possible by that command:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 lctl set_param mdt.*.dom_lock=trylock
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;which will takes DoM lock at open optionally. And as mentioned by Andreas, patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10664&quot; title=&quot;DoM: make DoM lock enqueue non-blocking&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10664&quot;&gt;&lt;del&gt;LU-10664&lt;/del&gt;&lt;/a&gt; is helpful, though it may not apply cleanly, I will check&lt;/p&gt;</comment>
                            <comment id="288010" author="nilesj" created="Fri, 18 Dec 2020 15:39:41 +0000"  >&lt;p&gt;Small file creates; you can probably reproduce locally with a Linux Kernel source tarball extract or similar. We were seeing it on tar extracts here. This is many small files rather than shared file. We can reproduce with a single client. The MDS itself never hangs, only the client.&lt;/p&gt;

&lt;p&gt;Do you mind going into a bit more detail as to what dom_lock=trylock does? I can&apos;t seem to find much info on it.&lt;/p&gt;</comment>
                            <comment id="288020" author="tappro" created="Fri, 18 Dec 2020 16:06:56 +0000"  >&lt;p&gt;Jeff, with DoM file, server can return DOM IO lock in open reply, in advance. Default option is &apos;always&apos; which means file open will always take that lock, other options are &apos;trylock&apos; - take DoM lock only if it has no conflicting locks and &apos;never&apos; - no DoM lock at open, i.e. the same behavior as for OST files.&lt;/p&gt;

&lt;p&gt;That &apos;trylock&apos; can be helpful with shared file access I was thinking of, because it prevents ping-ping lock taking for the same file from different clients, but that doesn&apos;t look helpful for untar case.&lt;/p&gt;

&lt;p&gt;Is that still true that problem exists with SLES client only?&lt;/p&gt;</comment>
                            <comment id="288046" author="nilesj" created="Fri, 18 Dec 2020 17:55:15 +0000"  >&lt;p&gt;Thanks for the info. Yeah, I don&apos;t think it would affect the case we saw the issue with. Do you think it&apos;s a good idea to turn on anyway for different workloads maybe?&lt;/p&gt;

&lt;p&gt;I believe the &quot;SLES only&quot; issue was more that we had some tunings that we didn&apos;t have mirrored on the SLES clients. I&apos;ve since updated the RHEL clients, so I may not be able to confirm.&lt;/p&gt;</comment>
                            <comment id="288050" author="spitzcor" created="Fri, 18 Dec 2020 18:58:42 +0000"  >&lt;p&gt;Hi, &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=nilesj&quot; class=&quot;user-hover&quot; rel=&quot;nilesj&quot;&gt;nilesj&lt;/a&gt;.  Could you please clarify something.  You said:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt; With the 2.12.5 and 2.12.6 clients in a DoM directory, that reproducer would never finish when left overnight (12+ hours) at an `lru_size` of 200 (our tunable from an older system).&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;But also,&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt; Bottom line: something between 2.12.6 and 2.14 fixed the issue at least&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;But, I don&apos;t quite see how you made that determination (that something was wrong with 2.12 LTS).&lt;br/&gt;
It actually sounds to me that it was fine because you also said:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;I wanted to baseline against performance with `lru_size=0`, so I ran that, and it mirrors the same 20 minute result with a peak lock count around 36,000&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;To be clear, have you run the experiment with default lru_size and lru_max_age?  Does the LTS client behave poorly?  Or, does it match the non-DoM performance?&lt;/p&gt;</comment>
                            <comment id="288058" author="nilesj" created="Fri, 18 Dec 2020 19:48:32 +0000"  >&lt;p&gt;Hey Cory,&lt;/p&gt;

&lt;p&gt;When set to a&#160;&lt;b&gt;fixed&lt;/b&gt; LRU size, a 2.12.6 client will complete write actions in a DoM directory in a time about equal to: (number of files to process / lru_size) * lru_max_age). Essentially it completes work as the max age is hit, 200 (or whatever number) tasks at a time.&lt;/p&gt;

&lt;p&gt;When set to a &lt;b&gt;dynamic&lt;/b&gt; LRU size (0), a 2.12.6 client will work as expected, except that it will leave every single lock open until they hit the max_age limit (by default 65 minutes). Obviously this is less than idea for a large scale system with a bunch of clients all at 50k locks. This is the basis of our workaround. Set a dynamic LRU size and set a max_age of 30s or so to time them out quickly. Not ideal, but it&apos;ll work for now.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;The determination on something being fixed between 2.12.6 and 2.14 was based on our reproducer finishing in a normal amount of time with a&#160;&lt;b&gt;fixed&lt;/b&gt; LRU size (2000) on 2.14, rather than in (number of files to process / lru_size) * lru_max_age), as we were seeing with 2.12.6. Since I don&apos;t think I said it above, even with lru_size=2000 on 2.12.6, we were still seeing issues where it would process about 2000 files, hang until those 2000 locks hit the max_age value, and then proceed. The issue isn&apos;t just limited to low lru_size settings.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;To be clear, have you run the experiment with default lru_size and lru_max_age? Does the LTS client behave poorly? Or, does it match the non-DoM performance?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Yes. A 2.12.6 LTS client works great with default (0) lru_size, except that it keeps all the locks open until max_age. This LU is specifically about the bug as it relates to fixed mdc lru_size settings.&lt;/p&gt;</comment>
                            <comment id="288062" author="spitzcor" created="Fri, 18 Dec 2020 20:27:30 +0000"  >&lt;p&gt;Thanks for the clarification.  Good sleuthing too!&lt;br/&gt;
May I ask what harm there is with the large (default) lru_max_age?  You say that it is bad that lots of clients may have lots of locks.  Is the server not able to handle the lock pressure?  Does back pressure not get applied to the clients?  Are the servers unable to revoke locks upon client request in a timely manner?  I guess I just don&apos;t understand why it is inherently bad to use the defaults.  Could you explain more?  Thanks!&lt;/p&gt;</comment>
                            <comment id="288064" author="nilesj" created="Fri, 18 Dec 2020 21:06:26 +0000"  >&lt;p&gt;Unfortunately that&apos;s where my knowledge ends. I do know that a large number of locks puts memory pressure on the MDSs, but from Andreas&apos; comment above, it seems like it &lt;em&gt;should&lt;/em&gt; start applying back pressure to the clients at some point?&lt;/p&gt;

&lt;p&gt;Historically, on our large systems we&apos;ve had to limit the lru_size to prevent overload issues with the MDS. This was the info that we were operating off of, but maybe that&apos;s not the case any more.&lt;/p&gt;</comment>
                            <comment id="288301" author="tappro" created="Wed, 23 Dec 2020 15:10:39 +0000"  >&lt;p&gt;I am able to reproduce that issue on initial 2.12.5 release with 3.10 kernel RHEL client and also checked that all works with the latest 2.12.6 version. It seems there is patch in the middle that fixed the issue. I will run&#160;&lt;tt&gt;git bisect&lt;/tt&gt;&#160;to find it, if that is what we need.&lt;/p&gt;

&lt;p&gt;With latest 2.12.6 I have no problems with fixes &lt;tt&gt;lru_size=100&lt;/tt&gt; but maybe my testset is not just big&lt;/p&gt;</comment>
                            <comment id="288304" author="nilesj" created="Wed, 23 Dec 2020 15:29:28 +0000"  >&lt;p&gt;Glad you&apos;re able to reproduce on 2.12.5. I do find it a bit odd that we experience problems with 2.12.6 while you don&apos;t; perhaps it&apos;s the larger dataset, like you mention. I think it would be beneficial to figure out what code changed to fix the issue for you in 2.12.6, as it may reveal why we still see issues. Probably not the highest priority work though.&lt;/p&gt;</comment>
                            <comment id="288317" author="simmonsja" created="Wed, 23 Dec 2020 18:12:24 +0000"  >&lt;p&gt;I think the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11518&quot; title=&quot;lock_count is exceeding lru_size&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11518&quot;&gt;&lt;del&gt;LU-11518&lt;/del&gt;&lt;/a&gt; work should resolve the rest of the problems.&lt;/p&gt;</comment>
                            <comment id="288385" author="tappro" created="Thu, 24 Dec 2020 11:14:03 +0000"  >&lt;p&gt;For anyone interested, patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11518&quot; title=&quot;lock_count is exceeding lru_size&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11518&quot;&gt;&lt;del&gt;LU-11518&lt;/del&gt;&lt;/a&gt; &lt;a href=&quot;https://review.whamcloud.com/41008&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/41008&lt;/a&gt;&#160;is the one solving problem in 2.12.6 for me. After it untar is not freezing anymore when lru_size has fixed size.&#160;&lt;/p&gt;</comment>
                            <comment id="291376" author="adilger" created="Fri, 5 Feb 2021 22:49:40 +0000"  >&lt;p&gt;Cory previously asked:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;May I ask what harm there is with the large (default) &lt;tt&gt;lru_max_age&lt;/tt&gt;? You say that it is bad that lots of clients may have lots of locks. Is the server not able to handle the lock pressure? Does back pressure not get applied to the clients? Are the servers unable to revoke locks upon client request in a timely manner? I guess I just don&apos;t understand why it is inherently bad to use the defaults. Could you explain more? Thanks!&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I think there are two things going on here.  Having a large &lt;tt&gt;lru_max_age&lt;/tt&gt; means that &lt;b&gt;unused&lt;/b&gt; locks (and potentially data cached under those locks) may linger on the client for a long time.  That consumes memory on the MDS and OSS for every lock that every client holds, which could probably be better used somewhere else.  Also, there is more work needed at recovery time if the MDS/OSS crashes to recover those locks.  Also, having a large number of locks on the client or server adds some overhead to all lock processing due to having more locks to deal with because of longer hash collision chains.&lt;/p&gt;

&lt;p&gt;There is the &quot;dynamic LRU&quot; code that has existed for many years to try an balance MDS lock memory usage vs. client lock requests, but I&apos;ve never really been convinced that it works properly (see e.g. &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7266&quot; title=&quot;Fix LDLM pool to make LRUR working properly&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7266&quot;&gt;LU-7266&lt;/a&gt; and related tickets).  I also think that when clients have so much RAM these days, it can cause a large number of locks to stay in memory for a long time until there is a sudden shortage of memory on the server, and the server only has limited mechanisms to revoke locks from the clients.  It can reduce the &quot;lock volume&quot; (part of the dynamic LRU&quot; functionality) but this is at best a &quot;slow burn&quot; that is intended (if working properly) to keep the steady-state locking traffic in check.  More recently, there was work done under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6529&quot; title=&quot;Server side lock limits to avoid unnecessary memory exhaustion&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6529&quot;&gt;&lt;del&gt;LU-6529&lt;/del&gt;&lt;/a&gt; &quot;&lt;tt&gt;Server side lock limits to avoid unnecessary memory exhaustion&lt;/tt&gt;&quot; to allow more direct reclaim of DLM memory on the server when it is under pressure.  We want to avoid the server cancelling locks that are actively in use by the client, but the server has no real idea about which locks the client is reusing, and which ones were only used once, so it does the best job it can with the information it has, but it is better if the client does a better job of keeping the number of locks under control.&lt;/p&gt;

&lt;p&gt;So there is definitely a balance between being able to cache locks and data on the client vs. sending more RPCs to the server and reducing memory usage on both sides.    That is why having a shorter &lt;tt&gt;lru_max_age&lt;/tt&gt; is useful, but longer term &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11509&quot; title=&quot;LDLM: replace lock LRU with improved cache algorithm&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11509&quot;&gt;LU-11509&lt;/a&gt; &quot;&lt;tt&gt;LDLM: replace lock LRU with improved cache algorithm&lt;/tt&gt;&quot; would improve the selection of which locks to keep cached on the client, and which (possibly newer, but use-once locks) should be dropped.  That is as much a research task as a development effort.&lt;/p&gt;</comment>
                            <comment id="324773" author="simmonsja" created="Tue, 1 Feb 2022 19:43:41 +0000"  >&lt;p&gt;Patch 41008 is ready to land.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="32543">LU-7266</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="53598">LU-11518</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="58604">LU-13413</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="29728">LU-6529</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="53583">LU-11509</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01h8n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>