<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:11:41 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-917] shared single file client IO submission with many cores starves OST DLM locks due to max_rpcs_in_flight</title>
                <link>https://jira.whamcloud.com/browse/LU-917</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;In &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt; the shared single-file IOR testing with 512 threads on 32 clients (16 cores per client) writing 128MB chunks to a file striped over 2 OSTs.  This showed clients timing out on DLM locks.  The threads on the single client are writing to disjoint parts of the file (i.e. each thread has its own DLM extent that is not adjacent to the extents written by other threads on that client).&lt;/p&gt;

&lt;p&gt;For example, to reproduce this workload with 4 clients (A, B, C, D) against 2 OSTs (1, 2):&lt;/p&gt;

&lt;p&gt;Client     ABCDABCDABCD...&lt;br/&gt;
OST        121212121212...&lt;/p&gt;

&lt;p&gt;While this IOR test is running, other tests are also running on different clients to create a very heavy IO load on the OSTs.&lt;/p&gt;

&lt;p&gt;It may be that DLM locks on the OST are not getting any IO requests sent to refresh the DLM locks:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;due to the number of active DLM locks on the client for a single OST being more than the number of rpcs in flight, some of the locks may be starved for sending BRW RPCs under that lock to the OST to refresh the lock timeout&lt;/li&gt;
	&lt;li&gt;due to the IO ordering of the BRW requests on the client, it may be that all of the pages for the lower-offset extent are sent to the OST before the pages for a higher-offset extent are ever sent&lt;/li&gt;
	&lt;li&gt;the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Some solutions that might help this (individually, or in combination):&lt;br/&gt;
1. increase the max_rpcs_in_flight = core count, but I think this is bad in the long run since it can dramatically increase the number of RPCs that need to be handled at one time by each OST&lt;br/&gt;
2. always allow at least one BRW RPC in flight for each lock that is being canceled&lt;br/&gt;
3. prioritize ALL BRW RPCs for a blocked lock in advance of non-blocked BRW requests (e.g. like a high-priority request queue on the client)&lt;br/&gt;
4. both (2) and (3) may be needed in order to avoid starvation as the client core count increases&lt;/p&gt;</description>
                <environment></environment>
        <key id="12654">LU-917</key>
            <summary>shared single file client IO submission with many cores starves OST DLM locks due to max_rpcs_in_flight</summary>
                <type id="7" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/task_agile.png">Technical task</type>
                            <parent id="12519">LU-874</parent>
                                    <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 13 Dec 2011 04:14:39 +0000</created>
                <updated>Thu, 8 Feb 2018 18:21:26 +0000</updated>
                            <resolved>Thu, 8 Feb 2018 18:21:26 +0000</resolved>
                                    <version>Lustre 2.1.0</version>
                    <version>Lustre 2.2.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="30771" author="nrutman" created="Fri, 9 Mar 2012 14:13:02 +0000"  >&lt;p&gt;&quot;the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time&quot;&lt;/p&gt;

&lt;p&gt;You mean the HP thread can&apos;t handle multiple cancel callbacks before some time out?  I was wondering why we don&apos;t reserve more threads for HP reqs, or, alternately, limit the number of threads doing any 1 op (i.e. no more than 75% of threads can be doing ldlm ops, and no more than 75% of threads can be doing io ops), so that we &quot;balance&quot; the load a little better and don&apos;t get stuck in these corner cases.&lt;/p&gt;</comment>
                            <comment id="30775" author="morrone" created="Fri, 9 Mar 2012 17:21:49 +0000"  >&lt;p&gt;Nathan, the issue is that the client is only allowed a fixed number of outstanding rpcs to the ost.  Lets call that N.  Now lets assume that the OST is processing RPCs very slowly (minutes each), but otherwise operating normally.&lt;/p&gt;

&lt;p&gt;If the OST revokes N+1 locks from the client now, the client stands a real risk of being evicted.  In order to avoid eviction the client must constantly have rpcs enqueued on the server for EACH of the revoked locks.  (We fixed some things in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt; to help make even that work.)  Otherwise one of the locks will time out, and the client will be evicted.&lt;/p&gt;

&lt;p&gt;This ticket is looking at ways to alleviate the problem from the client side.  I do worry that these client side solutions increase the load on a server that is already heavily loaded.&lt;/p&gt;

&lt;p&gt;Ultimately, we need to look at making the OST smarter whether or not we decide that client side changes have value.  The OST really needs to assume that if the client is making progress on other revoked locks, then it should extend all locks timers for that client in good faith.&lt;/p&gt;</comment>
                            <comment id="30779" author="nrutman" created="Fri, 9 Mar 2012 18:41:57 +0000"  >&lt;p&gt;There&apos;s a few different issues here; I agree the rpcs_in_flight scenario seems to be one problem, but I was more interested in the limited-server-thread problem (even if it&apos;s not causing &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-874&quot; title=&quot;Client eviction on lock callback timeout &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-874&quot;&gt;&lt;del&gt;LU-874&lt;/del&gt;&lt;/a&gt;) because it is causing other problems as well.  For example, we&apos;re tracking a bug (MRP-455) where we experience cascading client evictions because all MDS threads are stuck pending ldlm enqueues, leaving no room for PING or CONNECT rpcs.  (That one is a direct result of mishandled HP queue, but it made me realize we have no &quot;wiggle room&quot; in the code today.  As with all our bugs, we&apos;ll submit it upstream when we&apos;re done.) &lt;/p&gt;</comment>
                            <comment id="30782" author="morrone" created="Fri, 9 Mar 2012 21:22:53 +0000"  >&lt;p&gt;Why wait until you are done?  I&apos;d certainly like to be made aware of the problem and progress as you go along in a new ticket.&lt;/p&gt;</comment>
                            <comment id="30914" author="nrutman" created="Mon, 12 Mar 2012 18:15:14 +0000"  >&lt;p&gt;It&apos;s difficult to track progress in two different places; our primary tracker is our own internal Jira. &lt;/p&gt;</comment>
                            <comment id="30954" author="morrone" created="Mon, 12 Mar 2012 20:25:38 +0000"  >&lt;p&gt;Nathan, it really does the community a disservice to keep your issues secret.  Telling us an internal Xyratex ticket number is of no use to us.&lt;/p&gt;

&lt;p&gt;I can only imagine that working in secret like this would make it more difficult to get patches landed as well.  If outside developers aren&apos;t tapped into the discussion about the issue all along, it just increases the burden on you to present a complete and detailed explanation of both the problem and the solution.  Should there be a disagreement about approach, you may find that you&apos;ve wasted your time.&lt;/p&gt;

&lt;p&gt;LLNL has the same issues of dealing with multiple trackers.  It is just one that needs to be accepted, I think.  We use our internal tracker to discuss and track issues with admins and users, but keep most of the the technical discussion in jira where the world can see it.&lt;/p&gt;</comment>
                            <comment id="30972" author="nrutman" created="Tue, 13 Mar 2012 00:29:24 +0000"  >&lt;p&gt;Chris, I appreciate your concerns here. There are good reasons why we must keep our bug tracking system internal: the privacy of our customers; our time tracking and billing systems; our requirement to track non-Lustre bugs as well.&lt;br/&gt;
Perhaps something could be set up to automatically mirror Lustre bug comments out to Whamcloud&apos;s system.  Please email me directly nathan_rutman@xyratex.com for further discussion on this topic and let&apos;s leave this poor bug alone &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

</comment>
                            <comment id="31924" author="nrutman" created="Thu, 22 Mar 2012 18:29:40 +0000"  >&lt;p&gt;Xyratex MRP-455 posted in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1239&quot; title=&quot;cascading client evictions&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1239&quot;&gt;&lt;del&gt;LU-1239&lt;/del&gt;&lt;/a&gt; with patch.&lt;/p&gt;</comment>
                            <comment id="220457" author="jay" created="Thu, 8 Feb 2018 18:21:26 +0000"  >&lt;p&gt;close old tickets&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw0uv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10219</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>