<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:08:48 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7425] client evicted with high rate</title>
                <link>https://jira.whamcloud.com/browse/LU-7425</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Error occurred during soak testing of build &apos;20151112&apos; (see: &lt;a href=&quot;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151112&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151112&lt;/a&gt;). DNE is enabled. OST had been formatted using &lt;em&gt;zfs&lt;/em&gt;, MDTs using &lt;em&gt;ldiskfs&lt;/em&gt; as backend.&lt;/p&gt;

&lt;p&gt;Approximately 20% of the jobs executed during soak testing fail with the following typical error shown below. The error selected for the eviction (8 jobs affected) reads as:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lola-2.log:Nov 13 01:18:11 lola-2 kernel: LustreError: 0:0:(ldlm_lockd.c:342:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.1.133@o2ib100  ns: filter-soaked-OST0000_UUID lock: ffff8801cb9ded00/0x4823861a3a714ead lrc: 3/0,0 mode: PR/PR res: [0x2d2261b:0x0:0x0].0x0 rrc: 5 type: EXT [105534627840-&amp;gt;18446744073709551615] (req 105534627840-&amp;gt;105556344831) flags: 0x60000000030020 nid: 192.168.1.133@o2ib100 remote: 0x1b74614c7d18de8f expref: 20 pid: 9256 timeout: 4311080944 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lola-31.log:Nov 13 01:18:12 lola-31 kernel: LustreError: 11-0: soaked-OST0000-osc-ffff8808657d2800: op
eration ost_read to node 192.168.1.102@o2ib10 failed: rc = -107
lola-31.log:Nov 13 01:18:12 lola-31 kernel: Lustre: soaked-OST0000-osc-ffff8808657d2800: Connection to
 soaked-OST0000 (at 192.168.1.102@o2ib10) was lost; in progress operations using this service will wai
t for recovery to complete
lola-31.log:Nov 13 01:18:12 lola-31 kernel: Lustre: Skipped 1 previous similar message
lola-31.log:Nov 13 01:18:12 lola-31 kernel: LustreError: 167-0: soaked-OST0000-osc-ffff8808657d2800: This client was evicted by soaked-OST0000; in progress operations using this service will fail.
lola-31.log:Nov 13 01:18:22 lola-31 kernel: Lustre: soaked-OST0000-osc-ffff8808657d2800: Connection restored to 192.168.1.102@o2ib10 (at 192.168.1.102@o2ib10)
lola-33.log:Nov 13 01:18:12 lola-33 kernel: LustreError: 11-0: soaked-OST0000-osc-ffff88033a596800: operation ldlm_enqueue to node 192.168.1.102@o2ib10 failed: rc = -107
lola-33.log:Nov 13 01:18:12 lola-33 kernel: Lustre: soaked-OST0000-osc-ffff88033a596800: Connection to soaked-OST0000 (at 192.168.1.102@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
lola-33.log:Nov 13 01:18:12 lola-33 kernel: Lustre: Skipped 1 previous similar message
lola-33.log:Nov 13 01:18:12 lola-33 kernel: LustreError: 167-0: soaked-OST0000-osc-ffff88033a596800: This client was evicted by soaked-OST0000; in progress operations using this service will fail.
lola-33.log:Nov 13 01:18:12 lola-33 kernel: Lustre: 32171:0:(llite_lib.c:2628:ll_dirty_page_discard_warn()) soaked: dirty page discard: 192.168.1.108@o2ib10:192.168.1.109@o2ib10:/soaked/fid: [0x24006e1d4:0xd58d:0x0]/ may get corrupted (rc -108)
lola-33.log:Nov 13 01:19:01 lola-33 kernel: Lustre: soaked-OST0000-osc-ffff88033a596800: Connection restored to 192.168.1.102@o2ib10 (at 192.168.1.102@o2ib10)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;No other messages, kernel debug logs were written on other nodes. No recovery went on, nor any error was injected during the lifetime of the jobs.&lt;/p&gt;

&lt;p&gt;The syslog, console and kernel debug logs of OSS node (lola-2), soak clients involved and slurm job log files had been attached to the ticket.&lt;/p&gt;</description>
                <environment>lola&lt;br/&gt;
build: 2.7.62-40-gebda41d, ebda41d8de7956f19fd27f86208c668e43c6957c + patches</environment>
        <key id="33125">LU-7425</key>
            <summary>client evicted with high rate</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="heckes">Frank Heckes</reporter>
                        <labels>
                            <label>soak</label>
                    </labels>
                <created>Fri, 13 Nov 2015 13:04:19 +0000</created>
                <updated>Mon, 8 Feb 2016 13:13:42 +0000</updated>
                            <resolved>Mon, 8 Feb 2016 13:13:42 +0000</resolved>
                                                    <fixVersion>Lustre 2.9.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="133436" author="heckes" created="Fri, 13 Nov 2015 13:06:02 +0000"  >&lt;p&gt;I double check the IB fabric and found no indication for errors in general and especially not for the nodes in question.&lt;/p&gt;</comment>
                            <comment id="133438" author="heckes" created="Fri, 13 Nov 2015 13:41:21 +0000"  >&lt;p&gt;slurm job log files containing the time stamp of the event as most important information.&lt;/p&gt;</comment>
                            <comment id="133615" author="pjones" created="Mon, 16 Nov 2015 18:19:55 +0000"  >&lt;p&gt;Bruno&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="134703" author="bfaccini" created="Mon, 30 Nov 2015 09:50:39 +0000"  >&lt;p&gt;I have spent some time to parse the provided logs/infos, but unfortunately the Lustre debug logs of OSS/Clients do not contain something related and interesting at the time of the problem, no log record/line with eviction msg nor referencing the concerned remote/local lock.&lt;/p&gt;

&lt;p&gt;Frank, since you indicate this is a frequent issue on soaktest system, I wonder if you could re-run the same workload with maximum debug mask enabled on Clients/Servers ?&lt;/p&gt;
</comment>
                            <comment id="134726" author="heckes" created="Mon, 30 Nov 2015 16:24:17 +0000"  >&lt;p&gt;Yes, this would be possilbe, but have to be synchronized with Di which currently needs a debug mask which is less verbose.&lt;br/&gt;
I&apos;ll update the tickets as soon as I found an agreement with Di.&lt;/p&gt;</comment>
                            <comment id="141156" author="bfaccini" created="Thu, 4 Feb 2016 14:25:51 +0000"  >&lt;p&gt;Hello Franck,&lt;br/&gt;
Is this still occurring during soak testing??&lt;br/&gt;
If not, we may want to close/resolve this ticket as CanNotReproduce, what do you think??&lt;/p&gt;</comment>
                            <comment id="141175" author="heckes" created="Thu, 4 Feb 2016 16:07:14 +0000"  >&lt;p&gt;Hi Bruno, I agree, we don&apos;t observed the scenario again. Let&apos;s close it with the state you suggested.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="19596" name="console-lola-2.log.bz2" size="91053" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                            <attachment id="19597" name="console-lola-31.log.bz2" size="40446" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                            <attachment id="19598" name="console-lola-33.log.bz2" size="23391" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                            <attachment id="19605" name="jobs.tar" size="1341440" author="heckes" created="Fri, 13 Nov 2015 13:41:20 +0000"/>
                            <attachment id="19599" name="lola-2-lustre-log-job-crash.log.bz2" size="3278958" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                            <attachment id="19600" name="lola-31-lustre-log-job-crash.log.bz2" size="268" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                            <attachment id="19601" name="lola-33-lustre-log-job-crash.log.bz2" size="3576512" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                            <attachment id="19602" name="messages-lola-2.log.bz2" size="534421" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                            <attachment id="19603" name="messages-lola-31.log.bz2" size="467860" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                            <attachment id="19604" name="messages-lola-33.log.bz2" size="459202" author="heckes" created="Fri, 13 Nov 2015 13:39:09 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxszz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>