<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:17:11 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8396] ll_dirty_page_discard_warn dropping pages during re-connection.</title>
                <link>https://jira.whamcloud.com/browse/LU-8396</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;During some client IO&apos;s being done by a gridftp transfer, we get some files that wind up shorter than they should.  In the log files we see this set of messages:&lt;/p&gt;

&lt;p&gt;Jul 13 03:17:30 gridftp01 kernel: Lustre: 2937:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 3 previous similar messages&lt;br/&gt;
Jul 13 03:17:58 gridftp01 kernel: LustreError: 2902:0:(events.c:201:client_bulk_callback()) event type 2, status -5, desc ffff8807799da000&lt;br/&gt;
Jul 13 03:17:58 gridftp01 kernel: Lustre: osg-OST0010-osc-ffff8802bb781000: Connection restored to 192.168.129.131@tcp8 (at 192.168.129.131@tcp8)&lt;br/&gt;
Jul 13 03:17:58 gridftp01 kernel: LustreError: 11-0: osg-OST0010-osc-ffff8802bb781000: operation ost_read to node 192.168.129.131@tcp8 failed: rc = -107&lt;br/&gt;
Jul 13 03:17:58 gridftp01 kernel: LustreError: 167-0: osg-OST0010-osc-ffff8802bb781000: This client was evicted by osg-OST0010; in progress operations using this service will fail.&lt;br/&gt;
Jul 13 03:17:58 gridftp01 kernel: Lustre: 2937:0:(llite_lib.c:2626:ll_dirty_page_discard_warn()) osg: dirty page discard: 192.168.129.128@tcp8:/osg/fid: &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200008124:0x23e:0x0&amp;#93;&lt;/span&gt;// may get corrupted (rc -108)&lt;/p&gt;

&lt;p&gt;using that FID you can usually retrace the file and see that it is shorter than it should be.  we know this because the file size when it is created is stored in an external database, and errors are shown when it is read and the file checksums are incorrect, and the filesize is wrong.&lt;/p&gt;

&lt;p&gt;File sizes are in the 30-50 MB range.&lt;/p&gt;

&lt;p&gt;I&apos;ve tried working back from the dirty_page error in the code to determine why this specific IO does not get retried after the reconnection that happens almost immediately.  we are trying to figure out why the reconnections are happening, but it would be nice to know why these IO&apos;s dont complete properly.&lt;/p&gt;</description>
                <environment>Servers:&lt;br/&gt;
Lustre 2.8.0 &lt;br/&gt;
kernel 3.10.0-327.3.1.el7_lustre.x86_64&lt;br/&gt;
zfs backed OST&lt;br/&gt;
&lt;br/&gt;
Clients:&lt;br/&gt;
Lustre 2.8.0&lt;br/&gt;
kernel 2.6.32-573.12.1.el6_lustre.x86_64&lt;br/&gt;
&lt;br/&gt;
ipoib network&lt;br/&gt;
Connected Mode, 65520 MTU</environment>
        <key id="38168">LU-8396</key>
            <summary>ll_dirty_page_discard_warn dropping pages during re-connection.</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="karcaw">Evan Felix</reporter>
                        <labels>
                    </labels>
                <created>Thu, 14 Jul 2016 01:14:11 +0000</created>
                <updated>Thu, 14 Jul 2016 16:26:47 +0000</updated>
                                            <version>Lustre 2.8.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="158780" author="adilger" created="Thu, 14 Jul 2016 04:41:59 +0000"  >&lt;p&gt;The message from ll_dirty_page_discard_warn() is a symptom and not a cause. The root cause is that the client was evicted and it is unsafe for these pages to be written after the client has lost the lock, and this is just telling the admin/user that this has happened so you are aware of it.  You need to look at the OST to see why the client was evicted, either because it was unresponsive due to load, or due to network problems.&lt;/p&gt;</comment>
                            <comment id="158844" author="karcaw" created="Thu, 14 Jul 2016 16:26:47 +0000"  >&lt;p&gt;I recreated this error overnight, corrupting about 30 files.&lt;/p&gt;

&lt;p&gt;Messages on an ost: &lt;br/&gt;
Jul 14 00:44:27 ost2 kernel: Lustre: 24815:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1468482255/real 0&amp;#93;&lt;/span&gt;  req@ffff8807b2520c00 x1539223653305916/t0(0) o106-&amp;gt;osg-OST0010@192.168.129.253@tcp8:15/16 lens 296/280 e 0 to 1 dl 1468482267 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;br/&gt;
Jul 14 00:44:27 ost2 kernel: Lustre: 24815:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 1 previous similar message&lt;br/&gt;
Jul 14 00:44:45 ost2 kernel: Lustre: osg-OST0010: Client 1c17e0bb-8bab-418a-c37e-05fd619bc199 (at 192.168.129.253@tcp8) reconnecting&lt;br/&gt;
Jul 14 00:44:45 ost2 kernel: Lustre: osg-OST0010: Connection restored to  (at 192.168.129.253@tcp8)&lt;br/&gt;
Jul 14 00:44:45 ost2 kernel: LustreError: 25363:0:(ldlm_lib.c:3169:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff8808e042b600 x1534080529094832/t0(0) o3-&amp;gt;1c17e0bb-8bab-418a-c37e-05fd619bc199@192.168.129.253@tcp8:519/0 lens 608/432 e 1 to 0 dl 1468482314 ref 1 fl Interpret:/0/0 rc 0/0&lt;br/&gt;
Jul 14 00:44:45 ost2 kernel: LustreError: 25363:0:(ldlm_lib.c:3169:target_bulk_io()) Skipped 8 previous similar messages&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: LustreError: 23640:0:(events.c:447:server_bulk_callback()) event type 5, status -5, desc ffff880f502a0000&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: Lustre: osg-OST0010: Client 0cb248f6-b529-6f71-f9c9-385f43487517 (at 192.168.129.200@tcp8) reconnecting&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: Lustre: osg-OST0010: Connection restored to  (at 192.168.129.200@tcp8)&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: LustreError: 24831:0:(ldlm_lib.c:3169:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff8808d1755d00 x1539686089732876/t0(0) o3-&amp;gt;0cb248f6-b529-6f71-f9c9-385f43487517@192.168.129.200@tcp8:521/0 lens 608/432 e 1 to 0 dl 1468482316 ref 1 fl Interpret:/0/0 rc 0/0&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: LustreError: 24831:0:(ldlm_lib.c:3169:target_bulk_io()) Skipped 5 previous similar messages&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: Lustre: osg-OST0010: Bulk IO read error with 0cb248f6-b529-6f71-f9c9-385f43487517 (at 192.168.129.200@tcp8), client will retry: rc -110&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: LustreError: 24815:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 192.168.129.253@tcp8) returned error from glimpse AST (req status &lt;del&gt;107 rc -107), evict it ns: filter-osg-OST0010_UUID lock: ffff8808865f9800/0x9d142acb34c6d17 lrc: 4/0,0 mode: PW/PW res: &lt;span class=&quot;error&quot;&gt;&amp;#91;0xc96be:0x0:0x0&amp;#93;&lt;/span&gt;.0x0 rrc: 2 type: EXT &lt;span class=&quot;error&quot;&gt;&amp;#91;0-&amp;gt;18446744073709551615&amp;#93;&lt;/span&gt; (req 1048576&lt;/del&gt;&amp;gt;1310719) flags: 0x40000000000000 nid: 192.168.129.253@tcp8 remote: 0x1e646562be53f775 expref: 221 pid: 23826 timeout: 0 lvb_type: 0&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: LustreError: 138-a: osg-OST0010: A client on nid 192.168.129.253@tcp8 was evicted due to a lock glimpse callback time out: rc -107&lt;br/&gt;
Jul 14 00:44:59 ost2 kernel: LustreError: 24815:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) Skipped 2 previous similar messages&lt;br/&gt;
Jul 14 00:45:10 ost2 kernel: Lustre: osg-OST0010: Connection restored to  (at 192.168.129.253@tcp8)&lt;/p&gt;

&lt;p&gt;sar shows CPU load on ost was about 10% system, 4% user during the time issues were seen.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>dataloss</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyhfz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>