<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:24:30 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16156] stale read during IOR test due LU-14541</title>
                <link>https://jira.whamcloud.com/browse/LU-16156</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;It looks &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14541&quot; title=&quot;Memory reclaim caused a stale data read&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14541&quot;&gt;&lt;del&gt;LU-14541&lt;/del&gt;&lt;/a&gt; returned and last patch isn&apos;t enough.&lt;br/&gt;
reproduced on cray-2.15-int with last 14541 patch applied.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000080:00010000:3.0:1661448429.922359:0:17969:0:(file.c:5884:ll_layout_lock_set()) ### file [0x200044d2e:0x22f:0x0](000000008945c220) being reconfigured ns: kjcf04-MDT0000-mdc-ffff88879388d000 lock: 00000000265351e4/0x2ea
4f3791074a3c7 lrc: 3/1,0 mode: CR/CR res: [0x200044d2e:0x22f:0x0].0x0 bits 0x8/0x0 rrc: 2 type: IBT gid 0 flags: 0x0 nid: local remote: 0x76fc642919c90d07 expref: -99 pid: 17969 timeout: 0 lvb_type: 3
00000080:00200000:3.0:1661448429.922913:0:17969:0:(file.c:5779:ll_layout_conf()) [0x200044d2e:0x22f:0x0]: layout version change: 4294967294 -&amp;gt; 15
DoM
00010000:00010000:3.0:1661448429.926955:0:17969:0:(ldlm_lock.c:902:ldlm_lock_decref_internal()) ### add lock into lru list ns: kjcf04-MDT0000-mdc-ffff88879388d000 lock: 00000000ca0cb036/0x2ea4f3791074a3e3 lrc: 3/0,0 mode: P
R/PR res: [0x200044d2e:0x22f:0x0].0x0 bits 0x40/0x0 rrc: 3 type: IBT gid 0 flags: 0x800020000000000 nid: local remote: 0x76fc642919c90d3f expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00010000:00010000:3.0:1661448429.926982:0:17969:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 3/1,0 mode: --/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [0-&amp;gt;0] (req 0-&amp;gt;0) gid 0 flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00010000:00010000:3.0:1661448429.926985:0:17969:0:(ldlm_request.c:1028:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x0 ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 3/1,0 mode: --/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [1048576-&amp;gt;4194303] (req 1048576-&amp;gt;4194303) gid 0 flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00010000:00010000:3.0:1661448429.926994:0:17969:0:(ldlm_request.c:1115:ldlm_cli_enqueue()) ### sending request ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 3/1,0 mode: --/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [1048576-&amp;gt;4194303] (req 1048576-&amp;gt;4194303) gid 0 flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00010000:00010000:3.0:1661448429.927377:0:17969:0:(ldlm_resource.c:1686:ldlm_resource_add_lock()) ### About to add this lock ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 4/1,0 mode: --/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [1048576-&amp;gt;4194303] (req 1048576-&amp;gt;4194303) gid 0 flags: 0x10000000020000 nid: local remote: 0x10ff82d82b217a69 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00000080:00200000:3.0:1661448430.113932:0:17969:0:(vvp_io.c:312:vvp_io_fini()) [0x200044d2e:0x22f:0x0] ignore/verify layout 0/0, layout version 15 need write layout 0, restore needed 0
00000008:00100000:17.0:1661448430.236630:0:10311:0:(osc_request.c:1804:osc_brw_prep_request()) brw rpc 00000000ff43cadf - object 0x0:633777469 offset 104857600&amp;lt;&amp;gt;109056000
00000100:00100000:23.0:1661448430.292928:0:10312:0:(client.c:2220:ptlrpc_check_set()) Completed RPC req@00000000ff43cadf pname:cluuid:pid:xid:nid:opc:job:rc ptlrpcd_00_04:2f7f0639-3f88-4f3f-9faf-e0b16a082f10:10312:1741723660248256:10.17.100.60@o2ib:3::0/4194304
00000080:00200000:3.0:1661448430.310692:0:17969:0:(vvp_io.c:847:vvp_io_read_start()) IORfile_4m.00000011: read [100663296, 104857600)
00000080:00200008:3.0:1661448430.321303:0:17969:0:(file.c:1997:ll_file_read_iter()) file IORfile_4m.00000011:[0x200044d2e:0x22f:0x0], ppos: 104857600, count: 4194304
00010000:00010000:3.0:1661448430.314974:0:17969:0:(ldlm_lock.c:909:ldlm_lock_decref_internal()) ### do not add lock into lru list ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 5/1,0 mode: PR/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [0-&amp;gt;18446744073709551615] (req 1048576-&amp;gt;4194303) gid 0 flags: 0x810020000020000 nid: local remote: 0x10ff82d82b217a69 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;new ldlm lock requested/granted, but BRW rpc created with single page hole inside.&lt;br/&gt;
it caused a stale read, first patch with clear uptodate page flag helps and stale data don&apos;t reported.&lt;/p&gt;</description>
                <environment></environment>
        <key id="72344">LU-16156</key>
            <summary>stale read during IOR test due LU-14541</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="shadow">Alexey Lyashkov</reporter>
                        <labels>
                    </labels>
                <created>Wed, 14 Sep 2022 07:17:05 +0000</created>
                <updated>Thu, 1 Dec 2022 08:05:15 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="354776" author="shadow" created="Thu, 1 Dec 2022 08:01:18 +0000"  >&lt;p&gt;In fact, problem is simple while fix is not.&lt;/p&gt;

&lt;p&gt;CLIO broke a kernel MM protocol for the memory reclaim and ll_releasepage checks is racy and buggy. This bug from CLIO begin.&lt;/p&gt;

&lt;p&gt;CLIO hold a vmpage ref until CL page released (aka vvp_page_fini), it cause a situation when memory reclaim or layout change needs to flush a pages, but blocking AST arrived in same time. osc_gang_lookup holds a CL page reference while memory reclaim start to delete a page, but existent check is racy. So ll_releasepage exit after cl_page_delete with extra vmpage reference. &lt;br/&gt;
It caused a __remove_mapping failed to remove a page from mapping.&lt;br/&gt;
From other side blocking AST handler don&apos;t able to run cl_page_own as it moved into CPS_FREEDING by cl_page_delete.&lt;/p&gt;

&lt;p&gt;That all. VM page still in the page cache and any read my pickup this data.&lt;/p&gt;

&lt;p&gt;It&apos;s easy check by adds assert at end of ll_releasepages with check refcount and run a dual mount fsx test from sanityN.&lt;/p&gt;



&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000008:00000001:9.0:1666910059.632700:0:5016:0:(osc_cache.c:3088:osc_page_gang_lookup()) Process entered
but cl_page ref was hold where
bl ast enter and interrrupted by ll_releasepage aka cache flush.
00000020:00000001:8.0:1666910059.632703:0:11668:0:(cl_page.c:545:cl_vmpage_page()) Process leaving (rc=18446624413482391544 : -119660227160072 : ffff932b6eaa97f8)
00000020:00000001:8.0:1666910059.632708:0:11668:0:(cl_page.c:444:cl_page_state_set0()) page@ffff932b6eaa97f8[3 ffff932b5cd3f2b0 1 1 0000000000000000]
00000020:00000001:8.0:1666910059.632709:0:11668:0:(cl_page.c:445:cl_page_state_set0()) page fffff2cc04e941c0 map ffff932c62810218 index 82632 flags 17ffffc0002015 count 3 priv ffff932b6eaa97f8:
00000020:00000001:8.0:1666910059.633545:0:11668:0:(cl_page.c:489:cl_pagevec_put()) page@ffff932b6eaa97f8[2 ffff932b5cd3f2b0 5 1 0000000000000000]
00000020:00000001:8.0:1666910059.633546:0:11668:0:(cl_page.c:490:cl_pagevec_put()) page fffff2cc04e941c0 map ffff932c62810218 index 82632 flags 17ffffc0000015 count 3 priv 0:
00000080:00008000:8.0:1666910059.633548:0:11668:0:(rw26.c:175:ll_releasepage()) page fffff2cc04e941c0 map ffff932c62810218 index 82632 flags 17ffffc0000015 count 3 priv 0: clpage ffff932b6eaa97f8 : 1
ll_releasepage exit and expect to free a cl_page but ref hold by BL AST thread.
and vmpage still with 3 refs while __remove_mapping whats 2.
00000020:00000001:9.0:1666910059.642999:0:5016:0:(cl_page.c:489:cl_pagevec_put()) page@ffff932b6eaa97f8[1 ffff932b5cd3f2b0 5 1 0000000000000000]
00000020:00000001:9.0:1666910059.643000:0:5016:0:(cl_page.c:490:cl_pagevec_put()) page fffff2cc04e941c0 map ffff932c62810218 index 82632 flags 17ffffc0000014 count 2 priv 0:
00000020:00000010:9.0:1666910059.643003:0:5016:0:(cl_page.c:178:__cl_page_free()) slab-freed &apos;cl_page&apos;: 472 at ffff932b6eaa97f8.
cl_page freed -&amp;gt; vmpage ref released, vmpage with 2refs and it may removed from pagecache, but none want&apos;s to do it and uptodate page still in pagecache.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="63444">LU-14541</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="72362">LU-16160</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i0303r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>