<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:39:43 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10961] Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4</title>
                <link>https://jira.whamcloud.com/browse/LU-10961</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We are seeing repeated hard hang on clients after server failover. &lt;br/&gt;
&apos;df&apos; on a client will hang, user tasks do no complete. So far no hard faults, the node just grinds to a halt. Yesterday this occurred on soak-17 and soak-23. I have dumped stacks on both nodes, and crash dumps are available on soak. &lt;br/&gt;
We see: &lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;connections to one or more osts drop, and the client does not re-connect:
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Apr 27 03:28:42 soak-23 kernel: Lustre: 2084:0:(client.c:2099:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; sent delay: [sent 1524799714/real 0]  req@ffff8808fee67500 x1598738343197024/t0(0) o400-&amp;gt;soaked-OST000b-osc-ffff8807f6ba0800@192.168.1.107@o2ib:28/4 lens 224/224 e 0 to 1 dl 1524799721 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Apr 27 03:28:42 soak-23 kernel: Lustre: soaked-OST0011-osc-ffff8807f6ba0800: Connection to soaked-OST0011 (at 192.168.1.107@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete
Apr 27 03:28:42 soak-23 kernel: Lustre: Skipped 3 previous similar messages
Apr 27 03:28:42 soak-23 kernel: Lustre: 2084:0:(client.c:2099:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Apr 27 03:28:42 soak-23 kernel: Lustre: soaked-OST000b-osc-ffff8807f6ba0800: Connection to soaked-OST000b (at 192.168.1.107@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;As of 1700 hours (14 hours after failover) the node still has not reconnected to this OST.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We also see repeated errors referencing the MDT:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Apr 27 17:25:42 soak-23 kernel: LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The error appears very repeatable. Logs and stack traces are attached. &lt;/p&gt;</description>
                <environment>soak cluster </environment>
        <key id="52029">LU-10961</key>
            <summary>Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="cliffw">Cliff White</reporter>
                        <labels>
                            <label>soak</label>
                    </labels>
                <created>Fri, 27 Apr 2018 17:54:10 +0000</created>
                <updated>Mon, 10 Sep 2018 17:43:38 +0000</updated>
                            <resolved>Mon, 10 Sep 2018 17:43:38 +0000</resolved>
                                    <version>Lustre 2.12.0</version>
                                    <fixVersion>Lustre 2.12.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="226948" author="cliffw" created="Mon, 30 Apr 2018 15:33:14 +0000"  >&lt;p&gt;During the weekend, most of the clients hung. Issued the command &apos;umount -f&apos; all clients except soak-24 successfully un-mounted Lustre. &lt;br/&gt;
Dumped stacks, lustre log and crash dumped soak-24, stacks and lustre-log attached. &lt;br/&gt;
At this point this is seriously hampering soak testing. &lt;/p&gt;</comment>
                            <comment id="226953" author="cliffw" created="Mon, 30 Apr 2018 17:14:34 +0000"  >&lt;p&gt;Some new errors in the latest hang:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[431310.219421] traps: lfs[157180] trap divide error ip:40dc1a sp:7ffda599e940 error:0 in lfs[400000+20000]
[431346.590086] traps: lfs[157395] trap divide error ip:40dc1a sp:7ffee73a3900 error:0 in lfs[400000+20000]
[431385.007477] Lustre: 154503:0:(client.c:2099:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; sent delay: [sent 1525103886/real 0]  req@ffff8805bcb47200 x1599185360888640/t0(0) o4-&amp;gt;soaked-MDT0000-mdc-ffff880c502a0000@192.168.1.108@o2ib:13/10 lens 608/448 e 0 to 1 dl 1525103893 ref 3 fl Rpc:X/0/ffffffff rc 0/-1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="228376" author="cliffw" created="Tue, 22 May 2018 22:23:32 +0000"  >&lt;p&gt;Seeing this error again on current master. version=2.11.52_28_ge2d7e67&lt;/p&gt;</comment>
                            <comment id="228433" author="cliffw" created="Wed, 23 May 2018 14:22:14 +0000"  >&lt;p&gt;More logs. &lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[12385.187946] LustreError: 22349:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6151550@{[0 -&amp;gt; 0/255], [3|0|+|rpc|wihY|ffff9998b4fb5040], [28672|1|+|-|ffff999a22957980|256|ffff999b27449fa0]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12385.187949] LustreError: 12910:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6151130@{[0 -&amp;gt; 0/255], [2|0|+|cache|wihuY|ffff9998b4fb75c0], [28672|1|+|-|ffff999a22957bc0|256|          (&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt;)]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12385.187951] LustreError: 12901:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6150e70@{[0 -&amp;gt; 0/255], [3|0|+|rpc|wihY|ffff9998b4fb5400], [28672|1|+|-|ffff99998672c140|256|ffff9997bfc20000]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12385.187953] LustreError: 12749:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c61500b0 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff999986729b00/0x9ba73c642656dc6b lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8ccb:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be10ed expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12385.187955] LustreError: 22349:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6151550 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff999a22957980/0x9ba73c642656dd13 lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8cd0:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be2a07 expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12385.187958] LustreError: 12910:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6151130 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff999a22957bc0/0x9ba73c642656dd8a lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8cd5:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be408f expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12385.187960] LustreError: 12901:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6150e70 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff99998672c140/0x9ba73c642656dcaa lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8ccd:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be18bf expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12385.187980] LustreError: 12751:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6150420@{[0 -&amp;gt; 0/255], [2|0|+|cache|wihuY|ffff9998b4fb4000], [28672|1|+|-|ffff999a22954800|256|          (&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt;)]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12385.187986] LustreError: 12751:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6150420 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff999a22954800/0x9ba73c642656dd6e lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8cd4:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be3965 expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12385.188015] LustreError: 16607:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6150a50@{[0 -&amp;gt; 0/255], [3|0|+|rpc|wihY|ffff9998b4fb7c00], [28672|1|+|-|ffff99998672ad00|256|ffff9997bfc26eb0]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12385.188021] LustreError: 16607:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6150a50 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff99998672ad00/0x9ba73c642656dc02 lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8cc7:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be004d expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12385.188032] LustreError: 16605:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6151d90@{[0 -&amp;gt; 0/255], [3|0|+|rpc|wihY|ffff9998b4fb4a00], [28672|1|+|-|ffff99998672e0c0|256|ffff999b6bb62f70]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12385.188038] LustreError: 16605:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6151d90 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff99998672e0c0/0x9ba73c642656dc48 lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8cc9:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be0adb expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12385.188259] LustreError: 12677:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6150210@{[0 -&amp;gt; 0/255], [3|0|+|rpc|wihY|ffff9998b4fb6800], [28672|1|+|-|ffff99998672d580|256|ffff999b6b272f70]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12385.188264] LustreError: 12677:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6150210 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff99998672d580/0x9ba73c642656dbd1 lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8cc5:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88bdf6b4 expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12385.188911] LustreError: 12746:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6151340@{[0 -&amp;gt; 0/255], [3|0|+|rpc|wihY|ffff9998b4fb6300], [28672|1|+|-|ffff999a22956540|256|ffff999f5b654f10]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12385.188918] LustreError: 12746:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6151340 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff999a22956540/0x9ba73c642656ddd7 lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8cd7:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be48d1 expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12388.654048] LustreError: 12904:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c61502c0 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff999986728d80/0x9ba73c642656dc25 lrc: 4/0,0 mode: PW/PW res: [0x200010e09:0x8cc8:0x0].0x0 bits 0x40/0x40 rrc: 2 type: IBT flags: 0x449400000000 nid: local remote: 0xaa9e212e88be05da expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[12484.834610] INFO: task simul:25858 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
[12484.857231] &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
[12484.884393] simul           D ffff999910485ee0     0 25858  25851 0x00000080
[12484.908994] Call Trace:
[12484.918386]  [&amp;lt;ffffffffbb029e72&amp;gt;] ? path_lookupat+0x122/0x8b0
[12484.938662]  [&amp;lt;ffffffffbb514e99&amp;gt;] schedule_preempt_disabled+0x29/0x70
[12484.961219]  [&amp;lt;ffffffffbb512c57&amp;gt;] __mutex_lock_slowpath+0xc7/0x1d0
[12484.982900]  [&amp;lt;ffffffffbb51203f&amp;gt;] mutex_lock+0x1f/0x2f
[12485.001116]  [&amp;lt;ffffffffbb0272d0&amp;gt;] lock_rename+0xc0/0xe0
[12485.019600]  [&amp;lt;ffffffffbb02d39f&amp;gt;] SYSC_renameat2+0x22f/0x5a0
[12485.039503]  [&amp;lt;ffffffffbaed849c&amp;gt;] ? update_curr+0x14c/0x1e0
[12485.059063]  [&amp;lt;ffffffffbaed62dc&amp;gt;] ? set_next_entity+0x3c/0xe0
[12485.079200]  [&amp;lt;ffffffffbb51367e&amp;gt;] ? __schedule+0x14e/0xa20
[12485.098433]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12485.121395]  [&amp;lt;ffffffffbb52076f&amp;gt;] ? system_call_after_swapgs+0xbc/0x160
[12485.144330]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12485.167214]  [&amp;lt;ffffffffbb52076f&amp;gt;] ? system_call_after_swapgs+0xbc/0x160
[12485.190098]  [&amp;lt;ffffffffbb02e58e&amp;gt;] SyS_renameat2+0xe/0x10
[12485.208651]  [&amp;lt;ffffffffbb02e5ce&amp;gt;] SyS_rename+0x1e/0x20
[12485.226616]  [&amp;lt;ffffffffbb52082f&amp;gt;] system_call_fastpath+0x1c/0x21
[12485.247435]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12605.265688] INFO: task simul:25858 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
[12605.343807] &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
[12605.438557] simul           D ffff999910485ee0     0 25858  25851 0x00000080
[12605.524156] Call Trace:
[12605.554383]  [&amp;lt;ffffffffbb029e72&amp;gt;] ? path_lookupat+0x122/0x8b0
[12605.624122]  [&amp;lt;ffffffffbb514e99&amp;gt;] schedule_preempt_disabled+0x29/0x70
[12605.702164]  [&amp;lt;ffffffffbb512c57&amp;gt;] __mutex_lock_slowpath+0xc7/0x1d0
[12605.777070]  [&amp;lt;ffffffffbb51203f&amp;gt;] mutex_lock+0x1f/0x2f
[12605.839473]  [&amp;lt;ffffffffbb0272d0&amp;gt;] lock_rename+0xc0/0xe0
[12605.902894]  [&amp;lt;ffffffffbb02d39f&amp;gt;] SYSC_renameat2+0x22f/0x5a0
[12605.971484]  [&amp;lt;ffffffffbaed849c&amp;gt;] ? update_curr+0x14c/0x1e0
[12606.039000]  [&amp;lt;ffffffffbaed62dc&amp;gt;] ? set_next_entity+0x3c/0xe0
[12606.108587]  [&amp;lt;ffffffffbb51367e&amp;gt;] ? __schedule+0x14e/0xa20
[12606.175047]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12606.254985]  [&amp;lt;ffffffffbb52076f&amp;gt;] ? system_call_after_swapgs+0xbc/0x160
[12606.334925]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12606.414838]  [&amp;lt;ffffffffbb52076f&amp;gt;] ? system_call_after_swapgs+0xbc/0x160
[12606.494700]  [&amp;lt;ffffffffbb02e58e&amp;gt;] SyS_renameat2+0xe/0x10
[12606.558964]  [&amp;lt;ffffffffbb02e5ce&amp;gt;] SyS_rename+0x1e/0x20
[12606.621092]  [&amp;lt;ffffffffbb52082f&amp;gt;] system_call_fastpath+0x1c/0x21
[12606.693633]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12726.768761] INFO: task simul:25858 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
[12726.776720] &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
[12726.786099] simul           D ffff999910485ee0     0 25858  25851 0x00000080
[12726.794629] Call Trace:
[12726.794629] Call Trace:
[12726.797974]  [&amp;lt;ffffffffbb029e72&amp;gt;] ? path_lookupat+0x122/0x8b0
[12726.804992]  [&amp;lt;ffffffffbb514e99&amp;gt;] schedule_preempt_disabled+0x29/0x70
[12726.812770]  [&amp;lt;ffffffffbb512c57&amp;gt;] __mutex_lock_slowpath+0xc7/0x1d0
[12726.820246]  [&amp;lt;ffffffffbb51203f&amp;gt;] mutex_lock+0x1f/0x2f
[12726.826543]  [&amp;lt;ffffffffbb0272d0&amp;gt;] lock_rename+0xc0/0xe0
[12726.832932]  [&amp;lt;ffffffffbb02d39f&amp;gt;] SYSC_renameat2+0x22f/0x5a0
[12726.839803]  [&amp;lt;ffffffffbaed849c&amp;gt;] ? update_curr+0x14c/0x1e0
[12726.846573]  [&amp;lt;ffffffffbaed62dc&amp;gt;] ? set_next_entity+0x3c/0xe0
[12726.853530]  [&amp;lt;ffffffffbb51367e&amp;gt;] ? __schedule+0x14e/0xa20
[12726.860202]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12726.868137]  [&amp;lt;ffffffffbb52076f&amp;gt;] ? system_call_after_swapgs+0xbc/0x160
[12726.876068]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12726.883992]  [&amp;lt;ffffffffbb52076f&amp;gt;] ? system_call_after_swapgs+0xbc/0x160
[12726.891915]  [&amp;lt;ffffffffbb02e58e&amp;gt;] SyS_renameat2+0xe/0x10
[12726.898373]  [&amp;lt;ffffffffbb02e5ce&amp;gt;] SyS_rename+0x1e/0x20
[12726.904628]  [&amp;lt;ffffffffbb52082f&amp;gt;] system_call_fastpath+0x1c/0x21
[12726.911868]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12846.915800] INFO: task simul:25858 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
[12846.937671] &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
[12846.964035] simul           D ffff999910485ee0     0 25858  25851 0x00000080
[12846.987832] Call Trace:
[12846.996459]  [&amp;lt;ffffffffbb029e72&amp;gt;] ? path_lookupat+0x122/0x8b0
[12847.015963]  [&amp;lt;ffffffffbb514e99&amp;gt;] schedule_preempt_disabled+0x29/0x70
[12847.037750]  [&amp;lt;ffffffffbb512c57&amp;gt;] __mutex_lock_slowpath+0xc7/0x1d0
[12847.058669]  [&amp;lt;ffffffffbb51203f&amp;gt;] mutex_lock+0x1f/0x2f
[12847.076143]  [&amp;lt;ffffffffbb0272d0&amp;gt;] lock_rename+0xc0/0xe0
[12847.093894]  [&amp;lt;ffffffffbb02d39f&amp;gt;] SYSC_renameat2+0x22f/0x5a0
[12847.113076]  [&amp;lt;ffffffffbaed849c&amp;gt;] ? update_curr+0x14c/0x1e0
[12847.131964]  [&amp;lt;ffffffffbaed62dc&amp;gt;] ? set_next_entity+0x3c/0xe0
[12847.151421]  [&amp;lt;ffffffffbb51367e&amp;gt;] ? __schedule+0x14e/0xa20
[12847.170022]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12847.192344]  [&amp;lt;ffffffffbb52076f&amp;gt;] ? system_call_after_swapgs+0xbc/0x160
[12847.214662]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
[12847.236971]  [&amp;lt;ffffffffbb52076f&amp;gt;] ? system_call_after_swapgs+0xbc/0x160
[12847.259272]  [&amp;lt;ffffffffbb02e58e&amp;gt;] SyS_renameat2+0xe/0x10
[12847.277283]  [&amp;lt;ffffffffbb02e5ce&amp;gt;] SyS_rename+0x1e/0x20
[12847.294717]  [&amp;lt;ffffffffbb52082f&amp;gt;] system_call_fastpath+0x1c/0x21
[12847.315019]  [&amp;lt;ffffffffbb52077b&amp;gt;] ? system_call_after_swapgs+0xc8/0x160
12977.554496] LustreError: 12274:0:(osc_cache.c:955:osc_extent_wait()) extent ffff9997c6151290@{[0 -&amp;gt; 0/255], [2|0|+|cache|wihuY|ffff9998b4fb6940], [28672|1|+|-|ffff99998672b180|256|          (&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt;)]} soaked-MDT0000-mdc-ffff999f58e9f000: wait ext to 0 timedout, recovery in progress?
[12977.587096] LustreError: 12274:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff9997c6151290 ns: soaked-MDT0000-mdc-ffff999f58e9f000 lock: ffff99998672b180/0x9ba73c642656db99 lrc: 3/0,0 mode: PW/PW res: [0x200010e09:0x8cc4:0x0].0x0 bits 0x40/0x0 rrc: 2 type: IBT flags: 0x49400000000 nid: local remote: 0xaa9e212e88bdf0a2 expref: -99 pid: 24926 timeout: 0 lvb_type: 0
[24406.001241] LustreError: 22890:0:(vvp_io.c:1495:vvp_io_init()) soaked: refresh file layout [0x200010dc3:0x1d336:0x0] error -4.
[25696.331238] LustreError: 24884:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000401:0x2:0x0] error: rc = -4
[27088.300797] perf: interrupt took too &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; (5045 &amp;gt; 4917), lowering kernel.perf_event_max_sample_rate to 39000
[62130.364898] perf: interrupt took too &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; (6328 &amp;gt; 6306), lowering kernel.perf_event_max_sample_rate to 31000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="228464" author="pjones" created="Wed, 23 May 2018 17:58:07 +0000"  >&lt;p&gt;Dmitry&lt;/p&gt;

&lt;p&gt;Could you please investigate?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="228579" author="cliffw" created="Thu, 24 May 2018 20:42:15 +0000"  >&lt;p&gt;Working backwards to find a &apos;good&apos; version. Seeing the error once in 2.10.58_139_g630cd49 with a different result code. (EIO instead of EINTR)&lt;br/&gt;
[ 8145.869669] LustreError: 7130:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20002b381:0x341:0x0&amp;#93;&lt;/span&gt; error: rc = -5&lt;/p&gt;</comment>
                            <comment id="228648" author="dmiter" created="Fri, 25 May 2018 21:56:58 +0000"  >&lt;p&gt;After several connection change during recovery it hangs with following states:&lt;br/&gt;
 MGS -&amp;gt; CONNECTING&lt;br/&gt;
 MDT -&amp;gt; REPLAY_LOCKS&lt;br/&gt;
 OST -&amp;gt; CONNECTING&lt;/p&gt;

&lt;p&gt;The last recovery restart was for MGS but it looks REPLAY_LOCKS was not complete.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="228765" author="cliffw" created="Tue, 29 May 2018 13:20:54 +0000"  >&lt;p&gt;So, is there a fix, or something to be done? Is the more data you require? This is halting soak testing ATM&lt;/p&gt;</comment>
                            <comment id="228768" author="dmiter" created="Tue, 29 May 2018 13:37:17 +0000"  >&lt;p&gt;I cannot find out the root cause yet. I&apos;m looking into logs and try to understand what was wrong.&lt;/p&gt;

&lt;p&gt;Mingwile could you check the value of following param:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lctl get_param osc.*.pinger_recov&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;What activity was on this cluster during this hangs?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="228769" author="cliffw" created="Tue, 29 May 2018 14:53:16 +0000"  >&lt;p&gt;Cluster was running a large mix of jobs for the stress tests. multiple failovers were occurring&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;]# lctl get_param osc.*.pinger_recov
osc.soaked-OST0000-osc-ffff880823116800.pinger_recov=1
osc.soaked-OST0001-osc-ffff880823116800.pinger_recov=1
osc.soaked-OST0002-osc-ffff880823116800.pinger_recov=1
osc.soaked-OST0003-osc-ffff880823116800.pinger_recov=1
osc.soaked-OST0004-osc-ffff880823116800.pinger_recov=1
osc.soaked-OST0005-osc-ffff880823116800.pinger_recov=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="228772" author="cliffw" created="Tue, 29 May 2018 16:10:00 +0000"  >&lt;p&gt;I can repeat the error and crash dump the systems, if that will help&lt;/p&gt;</comment>
                            <comment id="228774" author="dmiter" created="Tue, 29 May 2018 16:25:42 +0000"  >&lt;p&gt;It would be good if you can increase the level of debug logs during this issue.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lctl set_param debug=&lt;span class=&quot;code-quote&quot;&gt;&quot;+error +warning +info&quot;&lt;/span&gt; debug_mb=1024
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="228802" author="cliffw" created="Tue, 29 May 2018 20:36:08 +0000"  >&lt;p&gt;Repeated the error with more debug, lustre log attached. There is a crash dump available on spirit - I dumped stack and forced a crash dump on a client. &lt;/p&gt;</comment>
                            <comment id="228867" author="cliffw" created="Wed, 30 May 2018 20:34:37 +0000"  >&lt;p&gt;I sm testing today without MDS failures (OSS failover only) I believe this is triggered by MDS failover.&lt;/p&gt;</comment>
                            <comment id="228912" author="cliffw" created="Thu, 31 May 2018 14:39:36 +0000"  >&lt;p&gt;Ran successfully for 24 without MDS failover - So, i believe the problem is only triggered by MDS failover. &lt;br/&gt;
Next steps?&lt;/p&gt;</comment>
                            <comment id="228933" author="dmiter" created="Thu, 31 May 2018 18:19:57 +0000"  >&lt;p&gt;&#160;I see the recovery process and locks replay.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:1498:ptlrpc_import_recovery_state_machine()) replay requested by soaked-MDT0000_UUID
(recover.c:88:ptlrpc_replay_next()) &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt; ffff8804024da000 from soaked-MDT0000_UUID committed 25770229782 last 25770229782
(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:1502:ptlrpc_import_recovery_state_machine()) ffff8804024da000 soaked-MDT0000_UUID: changing &lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt; state from REPLAY to REPLAY_LOCKS
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;But then it looks unexpected communication:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;(osc_cache.c:955:osc_extent_wait()) extent ffff880412a6be40@{[0 -&amp;gt; 0/255], [3|0|+|rpc|wihY|ffff880058d16bc0], [28672|1|+|-|ffff8803ce09f500|256|ffff880827146eb0]} soaked-MDT0000-mdc-ffff88083f83d000: wait ext to 0 timedout, recovery in progress?
(osc_cache.c:955:osc_extent_wait()) ### extent: ffff880412a6be40 ns: soaked-MDT0000-mdc-ffff88083f83d000 lock: ffff8803ce09f500/0xe64f32d5b6631b06 lrc: 3/0,0 mode: PW/PW res: [0x20000240e:0x1ba06:0x0].0x0 bits 0x40/0x0 rrc: 3 type: IBT flags: 0x49400000000 nid: local remote: 0xfcd90309686d82d6 expref: -99 pid: 28799 timeout: 0 lvb_type: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;After this it hangs.&lt;/p&gt;

&lt;p&gt;Could you attach logs from MDSes?&lt;/p&gt;</comment>
                            <comment id="228934" author="cliffw" created="Thu, 31 May 2018 18:20:45 +0000"  >&lt;p&gt;Yes, give me a bit to reproduce the error. Currently only one MDS&lt;/p&gt;</comment>
                            <comment id="228935" author="dmiter" created="Thu, 31 May 2018 18:21:30 +0000"  >&lt;p&gt;Do you have DOM configured?&lt;/p&gt;</comment>
                            <comment id="228936" author="cliffw" created="Thu, 31 May 2018 18:23:15 +0000"  >&lt;p&gt;Yes, a percentage of the tests uses DOM striping, basically striping is randomly chosen, either normal striping, a PFL layout or a DOM layout. &lt;/p&gt;</comment>
                            <comment id="228937" author="cliffw" created="Thu, 31 May 2018 18:24:29 +0000"  >&lt;p&gt;It works like this:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;coin=$((RANDOM %% 2))
toss=$((RANDOM %% 2))
# &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 2.8 testing
# coin=1
# Use shift to get most random result (last bit)
# Set directory striping
&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; [ $coin -eq 1 ];then
        lfs setstripe -c $((RANDOM %% (nrdt + 1))) $dirname
&lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; [ $toss -eq 1 ];then
                lfs setstripe -E 32M -c 1 -S 1M -E 1G -c 4 -E -1 -c -1 -S 4m $dirname
        &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
                lfs setstripe -E 1M -L mdt -E EOF -c -1 $dirname
        fi
fi
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="228938" author="dmiter" created="Thu, 31 May 2018 18:26:36 +0000"  >&lt;p&gt;Could you also check with DOM disabled? I tend to think this is issue with DOM.&lt;/p&gt;</comment>
                            <comment id="228939" author="cliffw" created="Thu, 31 May 2018 18:28:22 +0000"  >&lt;p&gt;Yes, i can do this.&lt;/p&gt;</comment>
                            <comment id="228940" author="dmiter" created="Thu, 31 May 2018 18:37:10 +0000"  >&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; osc_extent_wait(&lt;span class=&quot;code-keyword&quot;&gt;const&lt;/span&gt; struct lu_env *env, struct osc_extent *ext,
			   &lt;span class=&quot;code-keyword&quot;&gt;enum&lt;/span&gt; osc_extent_state state)
{
	struct osc_object *obj = ext-&amp;gt;oe_obj;
	struct l_wait_info lwi = LWI_TIMEOUT_INTR(cfs_time_seconds(600), NULL,
						  LWI_ON_SIGNAL_NOOP, NULL);
	&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; rc = 0;
	ENTRY;

	osc_object_lock(obj);
	LASSERT(sanity_check_nolock(ext) == 0);
	/* `Kick&apos; &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; extent only &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; the caller is waiting &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; it to be
	 * written out. */
	&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (state == OES_INV &amp;amp;&amp;amp; !ext-&amp;gt;oe_urgent &amp;amp;&amp;amp; !ext-&amp;gt;oe_hp) {
		&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ext-&amp;gt;oe_state == OES_ACTIVE) {
			ext-&amp;gt;oe_urgent = 1;
		} &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ext-&amp;gt;oe_state == OES_CACHE) {
			ext-&amp;gt;oe_urgent = 1;
			osc_extent_hold(ext);
			rc = 1;
		}
	}
	osc_object_unlock(obj);
	&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc == 1)
		osc_extent_release(env, ext);

	&lt;span class=&quot;code-comment&quot;&gt;/* wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; the extent until its state becomes @state */&lt;/span&gt;
	rc = l_wait_event(ext-&amp;gt;oe_waitq, extent_wait_cb(ext, state), &amp;amp;lwi);
	&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc == -ETIMEDOUT) {
		OSC_EXTENT_DUMP(D_ERROR, ext,
			&lt;span class=&quot;code-quote&quot;&gt;&quot;%s: wait ext to %u timedout, recovery in progress?\n&quot;&lt;/span&gt;,
			cli_name(osc_cli(obj)), state);

		lwi = LWI_INTR(NULL, NULL);
		rc = l_wait_event(ext-&amp;gt;oe_waitq, extent_wait_cb(ext, state),   &amp;lt;===== It waits here forever.
				  &amp;amp;lwi);
	}
	&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (rc == 0 &amp;amp;&amp;amp; ext-&amp;gt;oe_rc &amp;lt; 0)
		rc = ext-&amp;gt;oe_rc;
	RETURN(rc);
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="228941" author="cliffw" created="Thu, 31 May 2018 19:02:00 +0000"  >&lt;p&gt;i have started soak with only normal striping, no PFL or DOM. Will let that run for awhile, then add PFL back in.&lt;/p&gt;</comment>
                            <comment id="228951" author="cliffw" created="Fri, 1 Jun 2018 00:10:59 +0000"  >&lt;p&gt;Ran through 9 MDT restarts/failovers ( + 6 hours) with no errors.&lt;br/&gt;
I am enabling DOM striping and restarting&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lfs setstripe -E 1M -L mdt -E EOF -c -1 $dirname
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="229038" author="cliffw" created="Mon, 4 Jun 2018 14:55:29 +0000"  >&lt;p&gt;Test with only DOM, got the failure. Tested with only PFL, got the failure. Restarting now and will get some log dumps.&lt;/p&gt;</comment>
                            <comment id="229086" author="cliffw" created="Tue, 5 Jun 2018 14:07:55 +0000"  >&lt;p&gt;Replicated error, but wasn&apos;t able to catch the log after the immediate failure, i fell asleep. However, crash-dumped the MDS with full debug. crash dumped the node with full debug. &lt;/p&gt;</comment>
                            <comment id="229087" author="cliffw" created="Tue, 5 Jun 2018 14:09:05 +0000"  >&lt;p&gt;Lustre logs attached&lt;/p&gt;</comment>
                            <comment id="229127" author="dmiter" created="Tue, 5 Jun 2018 17:46:28 +0000"  >&lt;p&gt;Where MDS logs and crash are placed? Can I see them?&lt;/p&gt;</comment>
                            <comment id="229130" author="cliffw" created="Tue, 5 Jun 2018 17:49:10 +0000"  >&lt;p&gt;The crash dump and all logs are on the spirit cluster, do you have a login? /scratch/logs and /scratch/dumps. &lt;/p&gt;</comment>
                            <comment id="229132" author="cliffw" created="Tue, 5 Jun 2018 17:52:37 +0000"  >&lt;p&gt;I have attached the console and system logs from the MDS (soak-8) to this bug. &lt;/p&gt;</comment>
                            <comment id="229147" author="dmiter" created="Tue, 5 Jun 2018 18:31:21 +0000"  >&lt;p&gt;Sorry, I don&apos;t have login on spirit. Could you copy it to somewhere on onyx?&lt;/p&gt;</comment>
                            <comment id="229148" author="cliffw" created="Tue, 5 Jun 2018 18:37:19 +0000"  >&lt;p&gt;You should be able to get a Spirit account quickly, file a DCO ticket and label it account-mgmnt. Usually happens in minutes - we&apos;ve advised DCO, and they are ready to do it now. All they need is your public ssh key&lt;/p&gt;</comment>
                            <comment id="229150" author="cliffw" created="Tue, 5 Jun 2018 18:47:33 +0000"  >&lt;p&gt;In the meantime, the crash dump is on onyx - /home/cliffwhi/lu-10961 - if you can&apos;t reach it, point me to a better directory and I&apos;ll put it there. &lt;/p&gt;</comment>
                            <comment id="229153" author="dmiter" created="Tue, 5 Jun 2018 18:58:27 +0000"  >&lt;p&gt;Thanks. I submit the DCO ticket and now I&apos;m coping core from onyx...&lt;/p&gt;</comment>
                            <comment id="229154" author="cliffw" created="Tue, 5 Jun 2018 19:05:42 +0000"  >&lt;p&gt;Thanks, I forced the core dump several hours after the fault, it&apos;s difficult to catch as the fault normally occurs in the middle of my night &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="229207" author="tappro" created="Wed, 6 Jun 2018 13:26:44 +0000"  >&lt;p&gt;Cliff, what type of load is used in this testing? Is it something particular, e.g. &apos;dd&apos; or &apos;tar&apos; or mix? I am thinking about possibility to reproduce that with simpler test.&lt;/p&gt;</comment>
                            <comment id="229213" author="cliffw" created="Wed, 6 Jun 2018 14:09:37 +0000"  >&lt;p&gt;I though we had explained soak to you. The test running at the time of failure were:&lt;br/&gt;
blogbench, iorssf, iorfpp, kcompile, mdtestfpp, mdtestssf, simul,&lt;br/&gt;
    fio( random, sequential, SAS simulation)  &lt;br/&gt;
The random mix is distributed across the clients, with the intent of seriously loading each client. It is difficult to tell exactly what is running on a specific node at the time of failure, but generally a node will have 2/3 different jobs running at any given time. &lt;/p&gt;</comment>
                            <comment id="229280" author="tappro" created="Thu, 7 Jun 2018 09:41:55 +0000"  >&lt;p&gt;yes, my question was about just particular time when error happened, so it is not possible to select any specific load causing that. In that case only logs at the moment of failure could make that clear. For this we can try to inject code which will cause error immediately instead of timeout.&lt;/p&gt;

&lt;p&gt;Dmitry showed already place where timeout happens, probably here we can output more debug info and return error immediately instead if waiting, so lustre logs will contain useful data.&lt;/p&gt;</comment>
                            <comment id="229463" author="jamesanunez" created="Tue, 12 Jun 2018 17:45:48 +0000"  >&lt;p&gt;Mike - Would you please upload a patch with any necessary debug information that will cause an error so we can get fail instead of waiting for a timeout?&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="229498" author="gerrit" created="Wed, 13 Jun 2018 15:38:12 +0000"  >&lt;p&gt;Mike Pershin (mike.pershin@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/32710&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32710&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10961&quot; title=&quot;Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10961&quot;&gt;&lt;del&gt;LU-10961&lt;/del&gt;&lt;/a&gt; osc: add debug code&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 523ac2ce306d182b2dc5db7c9a0f401b39124963&lt;/p&gt;</comment>
                            <comment id="229527" author="cliffw" created="Thu, 14 Jun 2018 00:09:23 +0000"  >&lt;p&gt;Hit the bug right away, lustre logs for two clients attached. Also dumped stacks and crash dumped those nodes, bits are on Spirit. Restarting with full debug. &lt;/p&gt;</comment>
                            <comment id="229528" author="cliffw" created="Thu, 14 Jun 2018 00:26:24 +0000"  >&lt;p&gt;In the same incident, soak-18/19 were hung, but not showing the _fini error. Lustre logs attached. Also lustre log from MDS. &lt;/p&gt;</comment>
                            <comment id="229529" author="cliffw" created="Thu, 14 Jun 2018 00:45:48 +0000"  >&lt;p&gt;Console logs from soak-17,42 and MDS (soak-8) attached. &lt;/p&gt;</comment>
                            <comment id="229623" author="pjones" created="Tue, 19 Jun 2018 17:58:10 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Has the information provided any insight?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="229880" author="tappro" created="Tue, 3 Jul 2018 13:16:05 +0000"  >&lt;p&gt;Peter,&lt;/p&gt;

&lt;p&gt;yes, there are clues, I am working on reproducer and patch&lt;/p&gt;</comment>
                            <comment id="229952" author="tappro" created="Thu, 5 Jul 2018 11:23:45 +0000"  >&lt;p&gt;I was able to reproduce lock replay issue with DoM files, patch will be ready soon. The problem is the lock cancellation before lock replay, MDC cancels locks differently than OSC and that causes DoM locks to be handled improperly.&lt;/p&gt;

&lt;p&gt;Meanwhile that doesn&apos;t explain failover failure with PFL-only file because there is no MDC involved, so there can be another reasons for that.&#160;&lt;/p&gt;</comment>
                            <comment id="229970" author="gerrit" created="Thu, 5 Jul 2018 19:05:48 +0000"  >&lt;p&gt;Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/32791&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32791&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10961&quot; title=&quot;Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10961&quot;&gt;&lt;del&gt;LU-10961&lt;/del&gt;&lt;/a&gt; ldlm: don&apos;t cancel DoM locks before replay&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 5c5ef9be5904a7f16aa41a341c561b0c9c6beb42&lt;/p&gt;</comment>
                            <comment id="230452" author="tappro" created="Wed, 18 Jul 2018 11:26:04 +0000"  >&lt;p&gt;There is another problem with replay - if DOM write cause new component instantiation  then replay of both layout change and write RPC failed. I am investigating that.&lt;/p&gt;</comment>
                            <comment id="230512" author="tappro" created="Wed, 18 Jul 2018 23:32:58 +0000"  >&lt;p&gt;It seems that problem is not in DOM but PFL. When  a new component is instantiated during failover then that is not replayed properly. It is worth to create new ticket for that.&lt;br/&gt;
That explains also why Cliff reported problems with PFL-only setup as well.&lt;/p&gt;</comment>
                            <comment id="230967" author="gerrit" created="Thu, 26 Jul 2018 20:12:36 +0000"  >&lt;p&gt;Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/32889&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32889&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10961&quot; title=&quot;Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10961&quot;&gt;&lt;del&gt;LU-10961&lt;/del&gt;&lt;/a&gt; ldlm: extended testing&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 43febdf398958c18590fc6a9d306a615b907305b&lt;/p&gt;</comment>
                            <comment id="233263" author="gerrit" created="Mon, 10 Sep 2018 16:53:42 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/32791/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32791/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10961&quot; title=&quot;Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10961&quot;&gt;&lt;del&gt;LU-10961&lt;/del&gt;&lt;/a&gt; ldlm: don&apos;t cancel DoM locks before replay&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: b44b1ff8c7fc50d5b0438926c89c69eb85817168&lt;/p&gt;</comment>
                            <comment id="233282" author="pjones" created="Mon, 10 Sep 2018 17:43:38 +0000"  >&lt;p&gt;Landed for 2.12&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10120">
                    <name>Blocker</name>
                                                                <inwardlinks description="is blocked by">
                                        <issuelink>
            <issuekey id="52752">LU-11158</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="30338" name="mds.lustre.log.txt.gz" size="18717370" author="cliffw" created="Thu, 14 Jun 2018 00:26:50 +0000"/>
                            <attachment id="30096" name="s-17.client.hang.txt.gz" size="8174328" author="cliffw" created="Fri, 27 Apr 2018 17:53:59 +0000"/>
                            <attachment id="30342" name="soak-17.log.gz" size="287907" author="cliffw" created="Thu, 14 Jun 2018 00:45:24 +0000"/>
                            <attachment id="30336" name="soak-17.lustre.log.txt.gz" size="892" author="cliffw" created="Thu, 14 Jun 2018 00:09:45 +0000"/>
                            <attachment id="30095" name="soak-17.stacktrace.txt" size="566069" author="cliffw" created="Fri, 27 Apr 2018 17:53:55 +0000"/>
                            <attachment id="30339" name="soak-18.lustre.log.txt.gz" size="1757857" author="cliffw" created="Thu, 14 Jun 2018 00:26:41 +0000"/>
                            <attachment id="30340" name="soak-19.lustre.log.txt.gz" size="1596713" author="cliffw" created="Thu, 14 Jun 2018 00:26:41 +0000"/>
                            <attachment id="30298" name="soak-21.06-05-2018.gz" size="18250014" author="cliffw" created="Tue, 5 Jun 2018 14:08:26 +0000"/>
                            <attachment id="30094" name="soak-23.client.hang.txt.gz" size="8305916" author="cliffw" created="Fri, 27 Apr 2018 17:53:59 +0000"/>
                            <attachment id="30093" name="soak-23.stacks.txt" size="587982" author="cliffw" created="Fri, 27 Apr 2018 17:53:55 +0000"/>
                            <attachment id="30105" name="soak-24.0430.txt.gz" size="20265590" author="cliffw" created="Mon, 30 Apr 2018 15:35:11 +0000"/>
                            <attachment id="30106" name="soak-24.stack.txt" size="581070" author="cliffw" created="Mon, 30 Apr 2018 15:34:58 +0000"/>
                            <attachment id="30343" name="soak-42.log.gz" size="363791" author="cliffw" created="Thu, 14 Jun 2018 00:45:24 +0000"/>
                            <attachment id="30337" name="soak-42.lustre.log.txt.gz" size="1193419" author="cliffw" created="Thu, 14 Jun 2018 00:09:45 +0000"/>
                            <attachment id="30276" name="soak-44.fini.txt" size="142606336" author="cliffw" created="Tue, 29 May 2018 20:38:10 +0000"/>
                            <attachment id="30301" name="soak-8.console.log.gz" size="2544880" author="cliffw" created="Tue, 5 Jun 2018 17:52:17 +0000"/>
                            <attachment id="30341" name="soak-8.log.gz" size="155251" author="cliffw" created="Thu, 14 Jun 2018 00:45:24 +0000"/>
                            <attachment id="30297" name="soak-8.lustre.log.2018-06-05.gz" size="28060669" author="cliffw" created="Tue, 5 Jun 2018 14:08:32 +0000"/>
                            <attachment id="30302" name="soak-8.syslog.log.gz" size="3586429" author="cliffw" created="Tue, 5 Jun 2018 17:52:17 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzwh3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>