<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:46:27 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4856] osc_lru_reserve()) ASSERTION( atomic_read(cli-&gt;cl_lru_left) &gt;= 0 ) failed</title>
                <link>https://jira.whamcloud.com/browse/LU-4856</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;The atomic_t used to count LRU entries is overflowing on systems with large memory configurations:&lt;/p&gt;

&lt;p&gt;LustreError: 22141:0:(osc_page.c:892:osc_lru_reserve()) ASSERTION(atomic_read(cli-&amp;gt;cl_lru_left) &amp;gt;= 0 ) failed:&lt;/p&gt;

&lt;p&gt;PID: 54214  TASK: ffff88fdef4e4100  CPU: 40  COMMAND: &quot;cat&quot;&lt;br/&gt;
 #3 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823900&amp;#93;&lt;/span&gt; lbug_with_loc at ffffffffa07fedc3 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #4 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823920&amp;#93;&lt;/span&gt; osc_lru_reserve at ffffffffa0c2a28a &lt;span class=&quot;error&quot;&gt;&amp;#91;osc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf08239a0&amp;#93;&lt;/span&gt; cl_page_alloc at ffffffffa09a7122 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf08239e0&amp;#93;&lt;/span&gt; cl_page_find0 at ffffffffa09a742d &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #7 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823a40&amp;#93;&lt;/span&gt; lov_page_init_raid0 at ffffffffa0cc0f21 &lt;span class=&quot;error&quot;&gt;&amp;#91;lov&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #8 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823aa0&amp;#93;&lt;/span&gt; cl_page_alloc at ffffffffa09a7122 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
 #9 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823ae0&amp;#93;&lt;/span&gt; cl_page_find0 at ffffffffa09a742d &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
#10 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823b40&amp;#93;&lt;/span&gt; ll_cl_init at ffffffffa0d74123 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
#11 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823bd0&amp;#93;&lt;/span&gt; ll_readpage at ffffffffa0d74485 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
#12 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823c00&amp;#93;&lt;/span&gt; do_generic_file_read at ffffffff810fa39e&lt;br/&gt;
#13 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823c80&amp;#93;&lt;/span&gt; generic_file_aio_read at ffffffff810fad4c&lt;br/&gt;
#14 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823d40&amp;#93;&lt;/span&gt; vvp_io_read_start at ffffffffa0da2fb0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
#15 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823da0&amp;#93;&lt;/span&gt; cl_io_start at ffffffffa09af979 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
#16 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823dd0&amp;#93;&lt;/span&gt; cl_io_loop at ffffffffa09b3d33 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
#17 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823e00&amp;#93;&lt;/span&gt; ll_file_io_generic at ffffffffa0d49c32 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
#18 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823e70&amp;#93;&lt;/span&gt; ll_file_aio_read at ffffffffa0d4a3b3 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
#19 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823ec0&amp;#93;&lt;/span&gt; ll_file_read at ffffffffa0d4aec3 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
#20 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823f10&amp;#93;&lt;/span&gt; vfs_read at ffffffff8115b237&lt;br/&gt;
#21 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88fdf0823f40&amp;#93;&lt;/span&gt; sys_read at ffffffff8115b3a3&lt;/p&gt;

&lt;p&gt;In this case, the atomic_t (signed int) held:&lt;br/&gt;
crash&amp;gt; pd &lt;b&gt;(int&lt;/b&gt;)0xffff943de11780fc&lt;br/&gt;
$10 = -1506317746&lt;/p&gt;

&lt;p&gt;We&apos;ve triggered this specific problem with configurations down to 11TB of physmem.  A 10.5TB system can cat a small file without crashing.&lt;/p&gt;

&lt;p&gt;I noticed several other cases where page counts are handled using a signed int, and suspect anything more than 4TB is problematic.  The kernel itself is consistently using unsigned long for page counts on all architectures.&lt;/p&gt;</description>
                <environment></environment>
        <key id="24044">LU-4856</key>
            <summary>osc_lru_reserve()) ASSERTION( atomic_read(cli-&gt;cl_lru_left) &gt;= 0 ) failed</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="yujian">Jian Yu</assignee>
                                    <reporter username="schamp">Stephen Champion</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Thu, 3 Apr 2014 03:00:56 +0000</created>
                <updated>Thu, 1 Oct 2015 14:23:07 +0000</updated>
                            <resolved>Tue, 30 Sep 2014 13:32:42 +0000</resolved>
                                    <version>Lustre 2.5.0</version>
                    <version>Lustre 2.6.0</version>
                    <version>Lustre 2.4.2</version>
                                    <fixVersion>Lustre 2.7.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="80914" author="jay" created="Thu, 3 Apr 2014 03:26:53 +0000"  >&lt;p&gt;we can use atomic64_t instead.&lt;/p&gt;</comment>
                            <comment id="81403" author="schamp" created="Thu, 10 Apr 2014 23:26:16 +0000"  >&lt;p&gt;I&apos;ve been digging at this, trying to identify the changes required.  To support large memory systems, all global accounting of pages needs to be done with 64 types.  Just tracing usage of cfs_num_physpages (which cl_lru_left is derived from), the problem snowballs quickly, and affects almost every subsystem in Lustre.&lt;/p&gt;

&lt;p&gt;Some casting will be required, but it should not be a problem to use 32 bit counters for page vectors.  I doubt any networks support 8 TB transactions yet.&lt;/p&gt;</comment>
                            <comment id="83449" author="schamp" created="Wed, 7 May 2014 21:21:48 +0000"  >&lt;p&gt;I have been working on a patch against master to address easily identified overflow hazards.  This cascaded into lock management as well.&lt;/p&gt;

&lt;p&gt;I am about to give it a whirl on internal systems to make sure I didn&apos;t break anything, then allocate time on a system with 16T of memory to make sure it addresses the problem.  I won&apos;t be able to run acceptance on the large system anytime soon, but will do some basic functionality testing.&lt;/p&gt;

&lt;p&gt;I will also need to cleanup for coding standards.&lt;/p&gt;</comment>
                            <comment id="85391" author="schamp" created="Sat, 31 May 2014 01:32:21 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/10537/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10537/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="85536" author="schamp" created="Tue, 3 Jun 2014 03:29:42 +0000"  >&lt;p&gt;I setup an i686 build environment and worked through the initial errors.&lt;/p&gt;

&lt;p&gt;The kernel does not implement atomic64_add_unless on this arch, so I&apos;ll have to find a way around this problem.  I will push the updated patch for feedback, but there will certainly be additional revisions, possibly major.&lt;/p&gt;
</comment>
                            <comment id="90069" author="jfc" created="Fri, 25 Jul 2014 17:02:13 +0000"  >&lt;p&gt;Hello Stephen,&lt;br/&gt;
Do you want us to keep this ticket open?&lt;/p&gt;

&lt;p&gt;Many thanks,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="90075" author="schamp" created="Fri, 25 Jul 2014 17:52:18 +0000"  >&lt;p&gt;Yes please.  The patch needs to have i686 build problems addressed, and I need to sync up with everyone who offered comments.  I expect to get back to it during the week of Aug 5.&lt;/p&gt;</comment>
                            <comment id="92568" author="schamp" created="Wed, 27 Aug 2014 03:12:14 +0000"  >&lt;p&gt;I pushed a new revision of the patch this morning.  I expected tests to start automatically - do I need to add Test-Parameters?&lt;/p&gt;

&lt;p&gt;I will testing on my own x86_64 / IB test environment today, but do not have a means to test i686.&lt;br/&gt;
This has not yet been tested on a large memory system.  We are starting that process.&lt;/p&gt;</comment>
                            <comment id="92662" author="pjones" created="Wed, 27 Aug 2014 21:55:32 +0000"  >&lt;p&gt;Hi Steve&lt;/p&gt;

&lt;p&gt;It&apos;s started testing now. Just higher than usual load on the test system. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="93256" author="schamp" created="Thu, 4 Sep 2014 23:55:12 +0000"  >&lt;p&gt;Revision 4 of &lt;a href=&quot;http://review.whamcloud.com/#/c/10537/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10537/&lt;/a&gt; is tested, working.&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;rpm -q lustre-client&lt;br/&gt;
lustre-client-2.6.51-3.0.101_0.35_default_gc69b1a0&lt;/li&gt;
	&lt;li&gt;grep ^processor /proc/cpuinfo | wc -l&lt;br/&gt;
3072&lt;/li&gt;
	&lt;li&gt;grep ^MemTotal /proc/meminfo&lt;br/&gt;
MemTotal:       32825421388 kB&lt;/li&gt;
	&lt;li&gt;mount -t lustre mds1-esa@tcp0:/esa-uv /mnt/esa-uv&lt;/li&gt;
	&lt;li&gt;cd /mnt/esa-uv/schamp&lt;/li&gt;
	&lt;li&gt;ls -l&lt;br/&gt;
total 3145740&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 schamp sgiemp_00 1073741824 Sep  4 14:36 foo.1&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 schamp sgiemp_00 1073741824 Sep  4 15:20 foo.2&lt;/li&gt;
	&lt;li&gt;cp foo.2 foo.3&lt;/li&gt;
	&lt;li&gt;ls -l&lt;br/&gt;
total 3145740&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 schamp sgiemp_00 1073741824 Sep  4 14:36 foo.1&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 schamp sgiemp_00 1073741824 Sep  4 15:20 foo.2&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root   root      1073741824 Sep  4 18:24 foo.3&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="93666" author="schamp" created="Wed, 10 Sep 2014 08:08:27 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/10537/5&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10537/5&lt;/a&gt; is confirmed as resolving this problem on a 32TB system.&lt;br/&gt;
I also ran sanity and sanityn without serious failure.&lt;/p&gt;</comment>
                            <comment id="94188" author="jfc" created="Tue, 16 Sep 2014 22:48:28 +0000"  >&lt;p&gt;Stephen,&lt;br/&gt;
Can you please review the comments on patch set 5?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="94194" author="schamp" created="Tue, 16 Sep 2014 23:19:50 +0000"  >&lt;p&gt;In the middle of it right now.  Had to rebase to master again.&lt;/p&gt;

&lt;p&gt;I am hesitant to simply #define the lprocfs_.._long functions to _u64 functions, as sign conversion hazards might catch unsuspecting users.  Seems like a great way to introduce very obscure bugs.&lt;/p&gt;

&lt;p&gt;I think I can eliminate the introduction of the long function by using the _64 functions in the cases my patch was using them.&lt;br/&gt;
This does force 32 bit systems to unnecessarily use 64 bit types, but not in critical paths.  This is what I have started on.&lt;/p&gt;</comment>
                            <comment id="94352" author="schamp" created="Thu, 18 Sep 2014 06:49:28 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/10537/7&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10537/7&lt;/a&gt; eliminates the lprocfs_..long functions entirely.&lt;/p&gt;</comment>
                            <comment id="94394" author="jfc" created="Thu, 18 Sep 2014 15:35:42 +0000"  >&lt;p&gt;Made it through autotest.&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="95283" author="pjones" created="Tue, 30 Sep 2014 13:32:42 +0000"  >&lt;p&gt;Landed for 2.7&lt;/p&gt;</comment>
                            <comment id="96146" author="yujian" created="Fri, 10 Oct 2014 19:52:39 +0000"  >&lt;p&gt;Here is the patch for master branch to resolve the issue that if &quot;val&quot; is larger than 2^32 on a 32-bit system, the code in proc_max_dirty_pages_in_mb() may truncate &quot;val&quot; when assigning it to obd_max_dirty_pages: &lt;a href=&quot;http://review.whamcloud.com/12269/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12269/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="121017" author="manish" created="Fri, 10 Jul 2015 17:59:48 +0000"  >&lt;p&gt;Hi &lt;/p&gt;

&lt;p&gt;We are seeing same issues with SLES 11 + SP3 and we using Lustre version 2.4.3&lt;/p&gt;

&lt;p&gt;luster client  installed with 2048 core SGI UV1000 running:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; cat /etc/SuSE-release

SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3
hungabee:~ # lsb_release -a
LSB Version: core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch
Distributor ID: SUSE LINUX
Description: SUSE Linux Enterprise Server 11 (x86_64)
Release: 11
Codename: n/a
hungabee:~ # rpm -qa | egrep &quot;(lustre|ofed)&quot;
lustre-client-modules-2.4.3-3.0.101_0.29_default
ofed-doc-1.5.4.1-0.11.5
ofed-1.5.4.1-0.11.5
ofed-kmp-trace-1.5.4.1_3.0.76_0.11-0.11.5
ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
lustre-client-2.4.3-3.0.101_0.29_default
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So can we have a backport path for Lustre client v2.4.3 and is this patch included in 2.5.x branches if not then can we have backport patch for 2.5 branch too.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
                   Manish&lt;/p&gt;</comment>
                            <comment id="128999" author="gerrit" created="Thu, 1 Oct 2015 14:23:07 +0000"  >&lt;p&gt;Gr&#233;goire Pichon (gregoire.pichon@bull.net) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/16697&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/16697&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4856&quot; title=&quot;osc_lru_reserve()) ASSERTION( atomic_read(cli-&amp;gt;cl_lru_left) &amp;gt;= 0 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4856&quot;&gt;&lt;del&gt;LU-4856&lt;/del&gt;&lt;/a&gt; misc: Reduce exposure to overflow on page counters.&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_5&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: f14f45c4e52246efe2c478b87c703705a30b3774&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwj5z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13394</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>