<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:01:59 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6643] write hang up with small max_cached_mb</title>
                <link>https://jira.whamcloud.com/browse/LU-6643</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Running multiple WRITEs at the same time with small max_cached_mb results in hang-up.&lt;br/&gt;
According to my survey, it&apos;s because WRITEs eat up lru slots all at once but no one have enough lru slots to start I/O. To make matters worse, no one will release lru slots they&apos;ve reserved until it completes I/O. That&apos;s why all the WRITEs have to wait each other eternally.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 4896   TASK: ffff880c35f1e040  CPU: 3   COMMAND: &quot;dd&quot;
 #0 [ffff880bb9a33788] schedule at ffffffff81528162
 #1 [ffff880bb9a33850] osc_page_init at ffffffffa0b1392d [osc]
 #2 [ffff880bb9a338f0] lov_page_init_raid0 at ffffffffa0b7d481 [lov]
 #3 [ffff880bb9a33960] lov_page_init at ffffffffa0b740c1 [lov]
 #4 [ffff880bb9a33970] cl_page_alloc at ffffffffa0f9b40a [obdclass]
 #5 [ffff880bb9a339d0] cl_page_find at ffffffffa0f9b79e [obdclass]
 #6 [ffff880bb9a33a30] ll_write_begin at ffffffffa18229ec [lustre]
 #7 [ffff880bb9a33ab0] generic_file_buffered_write at ffffffff81120703
 #8 [ffff880bb9a33b80] __generic_file_aio_write at ffffffff81122160
 #9 [ffff880bb9a33c40] vvp_io_write_start at ffffffffa1833f3e [lustre]
#10 [ffff880bb9a33ca0] cl_io_start at ffffffffa0f9d63a [obdclass]
#11 [ffff880bb9a33cd0] cl_io_loop at ffffffffa0fa11c4 [obdclass]
#12 [ffff880bb9a33d00] ll_file_io_generic at ffffffffa17d75c4 [lustre]
#13 [ffff880bb9a33e20] ll_file_aio_write at ffffffffa17d7d13 [lustre]
#14 [ffff880bb9a33e80] ll_file_write at ffffffffa17d83a9 [lustre]
#15 [ffff880bb9a33ef0] vfs_write at ffffffff811893a8
#16 [ffff880bb9a33f30] sys_write at ffffffff81189ca1
#17 [ffff880bb9a33f80] tracesys at ffffffff8100b288 (via system_call)
    RIP: 0000003c990db790  RSP: 00007fffdc56e778  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8100b288  RCX: ffffffffffffffff
    RDX: 0000000000400000  RSI: 00007f4ed96db000  RDI: 0000000000000001
    RBP: 00007f4ed96db000   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 0000000000402003  R11: 0000000000000246  R12: 00007f4ed96dafff
    R13: 0000000000000000  R14: 0000000000400000  R15: 0000000000400000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;My solution to the situation is letting the only one WRITE ignore the lru limitation. it means that the only one WRITE can go ahead then we can expect it to release some lru slots and next one can go ahead. (or someone get to be the next &quot;privileged&quot; WRITE)&lt;/p&gt;

&lt;p&gt;I know it&apos;s a kind of dirty fix but I thought this is better than all the WRITEs hang up.  Actually the &quot;privileged&quot; WRITE exceeds max_cached_mb by its I/O size only, it&apos;s smaller problem than hang-up. &lt;/p&gt;

&lt;p&gt;BTW, you can reproduce the situation easily like setting small max_cached_mb like 4 and running lots of dd commands or something at the same time.&lt;/p&gt;

&lt;p&gt;I attached the backtrace in the situation.&lt;/p&gt;</description>
                <environment></environment>
        <key id="30364">LU-6643</key>
            <summary>write hang up with small max_cached_mb</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="nozaki">Hiroya Nozaki</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Tue, 26 May 2015 06:26:02 +0000</created>
                <updated>Mon, 22 Jan 2018 22:45:05 +0000</updated>
                            <resolved>Mon, 22 Jan 2018 22:45:05 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="116356" author="gerrit" created="Tue, 26 May 2015 06:50:10 +0000"  >&lt;p&gt;Hiroya Nozaki (nozaki.hiroya@jp.fujitsu.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/14932&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14932&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6643&quot; title=&quot;write hang up with small max_cached_mb&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6643&quot;&gt;&lt;del&gt;LU-6643&lt;/del&gt;&lt;/a&gt; llite: write hang up with small max_cached_mb&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 29997fd8157dc5b293db17f460583fc76b63c361&lt;/p&gt;</comment>
                            <comment id="116406" author="jay" created="Tue, 26 May 2015 16:30:10 +0000"  >&lt;p&gt;hmm.. if extra slots are allowed under some situation, why do you set that pathological max_cached_mb in the first place? &lt;/p&gt;

&lt;p&gt;Anyway, if this really needs fixing, coo_page_init() should take a parameter(or an extra flag in cl_page) to tell OSC how to handle the situation if there is no LRU slots. For readahead, it isn&apos;t necessary to sleep wait for LRU slots if it runs out; the same policy can be applied to write with non-empty write queue on the LLITE layer.&lt;/p&gt;</comment>
                            <comment id="116486" author="nozaki" created="Wed, 27 May 2015 08:16:24 +0000"  > &lt;blockquote&gt;
&lt;p&gt;if extra slots are allowed under some situation, why do you set that pathological max_cached_mb in the first place? &lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Precisely.&lt;br/&gt;
4MIB is a kind of extreme case, but the situation can be reproduced with 64MiB, 128MiB and more ... if multiple writes are running. I developed a feature in my company, Single I/O performance improvement with multi worker threads in llite layer, which is why sometimes I have to face and deal with this problem, so this patch was a kind of detour at that time. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the same policy can be applied to write with non-empty write queue on the LLITE layer.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;OK, I&apos;ll check it, thanks !&lt;/p&gt;</comment>
                            <comment id="130573" author="adilger" created="Fri, 16 Oct 2015 00:06:38 +0000"  >&lt;p&gt;Could you please retest this with the latest master, since it appears this was fixed with &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5108&quot; title=&quot;osc: Performance tune for LRU&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5108&quot;&gt;&lt;del&gt;LU-5108&lt;/del&gt;&lt;/a&gt; osc: Performance tune for LRU&quot;.  If the problem is gone, please abandon your patch.&lt;/p&gt;</comment>
                            <comment id="130981" author="nozaki" created="Wed, 21 Oct 2015 06:02:23 +0000"  >&lt;p&gt;OK, I abandoned the patch here.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="24834">LU-5108</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="17953" name="foreach_dd_bt.log" size="27375" author="nozaki" created="Tue, 26 May 2015 06:26:03 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxe5z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>