<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:49:35 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12091] DNE/DOM: client evictions -108</title>
                <link>https://jira.whamcloud.com/browse/LU-12091</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Lustre versions involved:&lt;br/&gt;
 Clients (Sherlock): 2.12.0 + patches from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11964&quot; title=&quot;Heavy load and soft lockups on MDS with DOM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11964&quot;&gt;&lt;del&gt;LU-11964&lt;/del&gt;&lt;/a&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;mdc: prevent glimpse lock count grow&amp;#93;&lt;/span&gt;&lt;br/&gt;
 Servers (Fir): 2.12.0 + patches from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12037&quot; title=&quot;Possible DNE issue leading to hung filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12037&quot;&gt;&lt;del&gt;LU-12037&lt;/del&gt;&lt;/a&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;mdt: call mdt_dom_discard_data() after rename unlock&amp;#93;&lt;/span&gt;+&lt;span class=&quot;error&quot;&gt;&amp;#91;mdt: add option for cross-MDT rename&amp;#93;&lt;/span&gt;&lt;/p&gt;


&lt;p&gt;Last night, while &lt;tt&gt;fir-md1-s1&lt;/tt&gt; was relatively quiet, we had a lot of call traces showing up on the second MDS &lt;tt&gt;fir-md1-s2&lt;/tt&gt; (serving MDT0001 and MDT0003) , the first trace was:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 19 21:19:02 fir-md1-s2 kernel: LustreError: 90840:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.9.108.46@o2ib4  ns: mdt-fir-MDT0001_UUID lock: ffff8ecffb20f500/0xefacb2c18c9ab3c7 lrc: 3/0,0 mode: PW/PW res: [0x24000dd55:0x1cca7:0x0].0x0 bits 0x40/0x0 rrc: 79 type: IBT flags: 0x60200400000020 nid: 10.9.108.46@o2ib4 remote: 0x8374126d39604757 expref: 38560 pid: 91529 timeout: 900085 lvb_type: 0
Mar 19 21:19:43 fir-md1-s2 kernel: Lustre: 91243:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/-5), not sending early reply
                                     req@ffff8eb9af2cb000 x1627853341443984/t0(0) o101-&amp;gt;df44ff7c-4e8a-070f-774f-84780b4dab3d@10.9.108.48@o2ib4:18/0 lens 600/3264 e 0 to 0 dl 1553055588 ref 2 fl Interpret:/0/0 rc 0/0
Mar 19 21:19:43 fir-md1-s2 kernel: Lustre: 91243:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 31 previous similar messages
Mar 19 21:19:48 fir-md1-s2 kernel: LustreError: 90840:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.9.108.44@o2ib4  ns: mdt-fir-MDT0001_UUID lock: ffff8eba9122a400/0xefacb2c18c9ab3dc lrc: 3/0,0 mode: PW/PW res: [0x24000dd55:0x1cca7:0x0].0x0 bits 0x40/0x0 rrc: 77 type: IBT flags: 0x60200400000020 nid: 10.9.108.44@o2ib4 remote: 0xe63432aea7141892 expref: 920 pid: 91353 timeout: 900131 lvb_type: 0
Mar 19 21:19:53 fir-md1-s2 kernel: LNet: Service thread pid 91651 was inactive for 200.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 19 21:19:53 fir-md1-s2 kernel: Pid: 91651, comm: mdt03_027 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 19 21:19:53 fir-md1-s2 kernel: Call Trace:
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc0f930bd&amp;gt;] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc0f93dcc&amp;gt;] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14e64bb&amp;gt;] mdt_object_local_lock+0x50b/0xb20 [mdt]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14e6b40&amp;gt;] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14e6f0c&amp;gt;] mdt_reint_object_lock+0x2c/0x60 [mdt]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14fe1ac&amp;gt;] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc1501b68&amp;gt;] mdt_reint_setattr+0x6c8/0x12d0 [mdt]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc1503c53&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14e2143&amp;gt;] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14ed4a7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc103035a&amp;gt;] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc0fd492b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffc0fd825c&amp;gt;] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffb36c1c31&amp;gt;] kthread+0xd1/0xe0
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffb3d74c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
Mar 19 21:19:53 fir-md1-s2 kernel:  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
Mar 19 21:19:53 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1553055593.91651
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;a thing I noticed, a high lock count on MDT0001:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ clush -w@mds &apos;lctl get_param ldlm.namespaces.mdt-fir-MDT000*_UUID.lock_count&apos;
 fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=972046
 fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=480758
 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=3661720
 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=186735
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;m attaching the kernel logs for &lt;tt&gt;fir-md1-s2&lt;/tt&gt; of last night as &lt;tt&gt;fir-md1-s2-kern.log&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;This is not like in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12037&quot; title=&quot;Possible DNE issue leading to hung filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12037&quot;&gt;&lt;del&gt;LU-12037&lt;/del&gt;&lt;/a&gt;, as I don&apos;t think the filesystem wasn&apos;t globally stuck. It was more focused on some clients, like:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;&lt;tt&gt;sh-101-09&lt;/tt&gt; 10.9.101.9@o2ib4 &#8211; logs attached as &lt;tt&gt;sh-101-09-kern.log&lt;/tt&gt;&lt;/li&gt;
	&lt;li&gt;&lt;tt&gt;sh-101-19&lt;/tt&gt; 10.9.101.19@o2ib4 &#8211; logs attached as &lt;tt&gt;sh-101-19-kern.log&lt;/tt&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;on &lt;tt&gt;sh-101-09&lt;/tt&gt;, the lustre eviction impacted running jobs as shown in this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 20 02:22:15 sh-101-09 kernel: LustreError: 167-0: fir-MDT0001-mdc-ffff9dc7b03c0000: This client was evicted by fir-MDT0001; in progress operations using this service will fail.
Mar 20 02:22:15 sh-101-09 kernel: LustreError: 78084:0:(llite_lib.c:1551:ll_md_setattr()) md_setattr fails: rc = -5
Mar 20 02:22:15 sh-101-09 kernel: LustreError: 78079:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240005ab2:0x1e49e:0x0] error: rc = -5
Mar 20 02:22:15 sh-101-09 kernel: Lustre: 93503:0:(llite_lib.c:2733:ll_dirty_page_discard_warn()) fir: dirty page discard: 10.0.10.51@o2ib7:10.0.10.52@o2ib7:/fir/fid: [0x24000ed00:0x10:0x0]// may get corrupted (rc -108)
Mar 20 02:22:15 sh-101-09 kernel: LustreError: 107259:0:(vvp_io.c:1495:vvp_io_init()) fir: refresh file layout [0x24000ed00:0x10:0x0] error -108.
Mar 20 02:22:20 sh-101-09 kernel: julia[117508]: segfault at 18 ip 00007fc5da76f3d2 sp 00007fff3acf86e0 error 4 in libhdf5.so.100.0.1[7fc5da68f000+37c000]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There is a also a call trace involving &lt;tt&gt;mdt_dom_discard_data&lt;/tt&gt;:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Mar 20 09:16:05 fir-md1-s2 kernel: Pid: 14629, comm: mdt02_105 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Mar 20 09:16:05 fir-md1-s2 kernel: Call Trace:
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc0f930bd&amp;gt;] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc0f93dcc&amp;gt;] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc1523ef1&amp;gt;] mdt_dom_discard_data+0x101/0x130 [mdt]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14fecb1&amp;gt;] mdt_reint_unlink+0x331/0x14b0 [mdt]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc1503c53&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14e2143&amp;gt;] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc14ed4a7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc103035a&amp;gt;] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc0fd492b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffc0fd825c&amp;gt;] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffb36c1c31&amp;gt;] kthread+0xd1/0xe0
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffb3d74c24&amp;gt;] ret_from_fork_nospec_begin+0xe/0x21
Mar 20 09:16:05 fir-md1-s2 kernel:  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However, since 09:16 it seems to have stopped. We have rebooted sh-101-09 and a few others this morning so that may have helped perhaps?&lt;/p&gt;</description>
                <environment>Clients: 2.12.0+&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11964&quot; title=&quot;Heavy load and soft lockups on MDS with DOM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11964&quot;&gt;&lt;strike&gt;LU-11964&lt;/strike&gt;&lt;/a&gt;, Servers: 2.12.0+&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12037&quot; title=&quot;Possible DNE issue leading to hung filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12037&quot;&gt;&lt;strike&gt;LU-12037&lt;/strike&gt;&lt;/a&gt; (3.10.0-957.1.3.el7_lustre.x86_64), CentOS 7.6</environment>
        <key id="55202">LU-12091</key>
            <summary>DNE/DOM: client evictions -108</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Wed, 20 Mar 2019 17:20:42 +0000</created>
                <updated>Sat, 23 Mar 2019 14:54:05 +0000</updated>
                            <resolved>Sat, 23 Mar 2019 14:53:48 +0000</resolved>
                                    <version>Lustre 2.12.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="244420" author="pjones" created="Thu, 21 Mar 2019 14:41:49 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Can you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="244475" author="laisiyao" created="Fri, 22 Mar 2019 01:00:19 +0000"  >&lt;p&gt;This looks to be same issue addressed by &lt;a href=&quot;https://review.whamcloud.com/#/c/34071/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/34071/&lt;/a&gt;, we can wait to see how it goes when that one is ready to test.&lt;/p&gt;</comment>
                            <comment id="244486" author="sthiell" created="Fri, 22 Mar 2019 03:47:59 +0000"  >&lt;p&gt;Hi Lai &#8211; Thanks for checking and the update. I&apos;ll keep an eye on this patch&apos;s activity. I really hope you&apos;ll find a proper way to fix this.&lt;/p&gt;</comment>
                            <comment id="244587" author="pjones" created="Sat, 23 Mar 2019 14:53:48 +0000"  >&lt;p&gt;Duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11359&quot; title=&quot;racer test 1 times out with client hung in dir_create.sh, ls, &#8230; and MDS in ldlm_completion_ast()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11359&quot;&gt;&lt;del&gt;LU-11359&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="53269">LU-11359</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="32278" name="fir-md1-s2-kern.log" size="384122" author="sthiell" created="Wed, 20 Mar 2019 17:20:21 +0000"/>
                            <attachment id="32277" name="sh-101-09-kern.log" size="38841" author="sthiell" created="Wed, 20 Mar 2019 17:20:33 +0000"/>
                            <attachment id="32276" name="sh-101-19-kern.log" size="803730" author="sthiell" created="Wed, 20 Mar 2019 17:20:40 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00dlj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>