<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:47:47 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11886] OSTs and MDTs become unreachable under load, 2.11</title>
                <link>https://jira.whamcloud.com/browse/LU-11886</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi everyone,&lt;br/&gt;
 &#160;&lt;br/&gt;
 a v2.11.0 installation, there is an ongoing file transfer in Lustre running on one of the client nodes. Every now and then the process stops with I/O errors and then the file systems doesn&apos;t respond. After a while the access to the FS resumes, but the file transfer aborts with an error by that time.&lt;br/&gt;
 &#160;&lt;br/&gt;
 `lctl get_param*.*state` shows a history of evictions of targets:&lt;br/&gt;
 &#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mdc.scratch-MDT0000-mdc-ffff8ecec4bf2800.state=
current_state: FULL
state_history:
&#160;- [ 1546444758, CONNECTING ]
&#160;- [ 1546444758, FULL ]
&#160;&amp;lt;...&amp;gt;
&#160;- [ 1546594365, DISCONN ]
&#160;- [ 1546594388, CONNECTING ]
&#160;- [ 1546594388, RECOVER ]
&#160;- [ 1546594388, FULL ]

mgc.MGC10.149.0.183@o2ib.state=
current_state: FULL
state_history:
&#160;- [ 1546444756, CONNECTING ]
&#160;- [ 1546444756, FULL ]
&#160;&amp;lt;...&amp;gt;
&#160;- [ 1546594354, DISCONN ]
&#160;- [ 1546594363, CONNECTING ]
&#160;- [ 1546594363, EVICTED ]
&#160;- [ 1546594363, RECOVER ]   - [ 1546594363, FULL ]
osc.scratch-OST0000-osc-ffff8ecec4bf2800.state=
current_state: CONNECTING
state_history:
&#160;- [ 1548316546, DISCONN ]
&#160;- [ 1548316551, CONNECTING ]
&#160;&amp;lt;...&amp;gt;
&#160;- [ 1548316851, CONNECTING ]
&#160;- [ 1548316921, DISCONN ]
&#160;- [ 1548316926, CONNECTING ]
&#160;- [ 1548316996, DISCONN ]
&#160;- [ 1548317001, CONNECTING ]
&#160;- [ 1548317071, DISCONN ]
&#160;- [ 1548317076, CONNECTING ] 

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and so on for all the affected by the file transfer OSTs.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;In the OSS&apos;s logs there are all sorts of connectivity problems reported both by Lustre:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from 10.149.255.238@o2ib arrived at 1548314992 with bad export cookie 13579970295878280357
ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) Skipped 919 previous similar messages
ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 10.149.255.238@o2ib&#160; ns: filter-scratch-OST0003_UUID lock: ffff888fa2924800/0xbc75bf08ca486964 lrc: 4/0,0 mode: PW/PW res: [0x19b6e36:0x0:0x0].0x0 rrc: 3 type: EXT [0-&amp;gt;18446744073709551615] (req 716800-&amp;gt;720895) flags: 0x60000400020020 nid: 10.149.255.238@o2ib remote: 0xe347df981d48a71c expref: 55885 pid: 38849 timeout: 3689597 lvb_type: 0
ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) Skipped 6 previous similar messages
ods01: Jan 24 08:33:15 ods01 kernel: LustreError: dumping log to /tmp/lustre-log.1548315195.186010
ods01: Jan 24 08:44:41 ods01 kernel: Lustre: 186010:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548314995/real 1548315881]&#160; req@ffff88975cc9d100 x1619663992901664/t0(0) o104-&amp;gt;scratch-OST0002@10.149.255.238@o2ib:15/16 lens 296/224 e 0 to 1 dl 1548315006 ref 1 fl Rpc:eXSN/0/ffffffff rc -11/-1&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;and LNet:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Skipped 13 previous similar messages
ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Timed out RDMA with 10.149.255.254@o2ib (52): c: 
0, oc: 0, rc: 63
ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Skipped 13 previous similar messages&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;while `opareport` doesn&apos;t detect any Omni-Path errors at all.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;From the client side:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jan 24 05:43:03 admin kernel: LustreError: 133-1: scratch-OST0001-osc-ffff8ecec4bf2800: BAD READ CHECKSUM: from 10.149.0.185@o2ib via 0@&amp;lt;0:0&amp;gt; inode [0x200002366:0x160f1:0x0] object 0x0:4243598 extent [0-28671], client 0, server ac9ffb9, cksum_type 4
Jan 24 05:43:03 admin kernel: LustreError: 3867:0:(osc_request.c:1681:osc_brw_redo_request()) @@@ redo for recoverable error -11&#160; req@ffff8ecb19cd0000 x1621600181397872/t0(0) o3-&amp;gt;scratch-OST0001-osc-ffff8ecec4bf2800@10.149.0.185@o2ib:6/4 lens 608/408 e 1 to 1 dl 1548304253 ref 2 fl Interpret:ReXM/0/0 rc 26860/26860&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;br/&gt;
 and on another client:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1548316966/real 1548316966]&#160; req@ffff995a6dab4e00 x1623278997809600/t0(0) o101-&amp;gt;scratch-OST0024-osc-ffff995226fca000@10.149.0.191@o2ib:28/4 lens 3584/400 e 0 to 1 dl 1548317011 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 2324 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Kind regards,&lt;/p&gt;

&lt;p&gt;Konstantin&lt;/p&gt;</description>
                <environment>Lustre 2.11.0, ZFS 0.7.9, CentOS 7.5, Omni-Path</environment>
        <key id="54662">LU-11886</key>
            <summary>OSTs and MDTs become unreachable under load, 2.11</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="cv_eng">cv_eng</reporter>
                        <labels>
                    </labels>
                <created>Fri, 25 Jan 2019 11:48:55 +0000</created>
                <updated>Fri, 25 Jan 2019 16:39:10 +0000</updated>
                            <resolved>Fri, 25 Jan 2019 16:39:10 +0000</resolved>
                                    <version>Lustre 2.11.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="240701" author="cv_eng" created="Fri, 25 Jan 2019 11:54:36 +0000"  >&lt;p&gt;The problem stays when using different client nodes and OSTs (attached to different OSSs) as well.&lt;/p&gt;

&lt;p&gt;Also reproducible with the mdtest benchmark, at 1 hour into the test:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;-- started at 01/24/2019 08:24:19 --
mdtest-1.9.3 was launched with 96 total task(s) on 3 node(s)
Command line used: /home/kku/iotests/benchs/ior-mdtest/src/mdtest &quot;-i&quot; &quot;3&quot; &quot;-I&quot; &quot;128&quot; &quot;-z&quot; &quot;3&quot; &quot;-b&quot; &quot;5&quot; &quot;-u&quot; &quot;-w&quot; &quot;512k&quot; &quot;-d&quot; &quot;/scratch/mdtest2/test&quot;
Path: /scratch/mdtest2
FS: 1753.2 TiB Used FS: 6.0% Inodes: 911.7 Mi Used Inodes: 8.0%
96 tasks, 1916928 files/directories
01/24/2019 09:28:49: Process 9: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 11: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 19: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 23: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 25: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 1: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 24: FAILED in mdtest_stat, unable to stat file: Input/output error
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 11 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;




&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Kind regards,&lt;/p&gt;

&lt;p&gt;Konstantin&lt;/p&gt;</comment>
                            <comment id="240716" author="pjones" created="Fri, 25 Jan 2019 16:30:52 +0000"  >&lt;p&gt;Hi there&lt;/p&gt;

&lt;p&gt;Is this a test system that you are experimenting with or a customer deployment? If the latter, which site?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="240718" author="cv_eng" created="Fri, 25 Jan 2019 16:36:28 +0000"  >&lt;p&gt;Hi Peter,&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;that&apos;s a production system without a maintenance contract.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&amp;#8211;&lt;/p&gt;

&lt;p&gt;Konstantin&lt;/p&gt;</comment>
                            <comment id="240719" author="pjones" created="Fri, 25 Jan 2019 16:39:10 +0000"  >&lt;p&gt;ok then you are better off enquiring to the mailing lists than the community issue tracker&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00aa7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>