<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:46:42 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11762] replay-single test 0d fails with  &apos;post-failover df failed&apos;</title>
                <link>https://jira.whamcloud.com/browse/LU-11762</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;r eplay-single test_0d fails with &apos;post-failover df failed&apos; due to all clients being evicted and not recovering. Looking at the logs from a recent failure, &lt;a href=&quot;https://testing.whamcloud.com/test_sets/d34a9c44-fd82-11e8-b970-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/d34a9c44-fd82-11e8-b970-52540065bddc&lt;/a&gt; , in the client test_log, we see there is an problem mounting the file system on the second client (vm4)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Started lustre-MDT0000
Starting client: trevis-26vm3:  -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
CMD: trevis-26vm3 mkdir -p /mnt/lustre
CMD: trevis-26vm3 mount -t lustre -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
trevis-26vm4: error: invalid path &apos;/mnt/lustre&apos;: Input/output error
 replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Looking at the dmesg log from client 2 (vm4), we see the following errors&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[44229.221245] LustreError: 166-1: MGC10.9.5.67@tcp: Connection to MGS (at 10.9.5.67@tcp) was lost; in progress operations using this service will fail
[44254.268743] Lustre: Evicted from MGS (at 10.9.5.67@tcp) after server handle changed from 0x306f28dc59d36b9 to 0x306f28dc59d3cc4
[44425.483787] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007a5ac800: operation mds_reint to node 10.9.5.67@tcp failed: rc = -107
[44429.540695] LustreError: 167-0: lustre-MDT0000-mdc-ffff88007a5ac800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[44429.542381] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
[44429.542384] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 15 previous similar messages
[44429.547526] Lustre: lustre-MDT0000-mdc-ffff88007a5ac800: Connection restored to 10.9.5.67@tcp (at 10.9.5.67@tcp)
[44429.547533] Lustre: Skipped 1 previous similar message
[44429.613758] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In the dmesg log for the MDS (vm6), we see&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[44131.617072] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
[44135.460894] Lustre: 2440:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 0
[44196.726935] Lustre: lustre-MDT0000: Denying connection for new client f33a3fe0-b38c-7f20-7b19-3c32e6a1bff3(at 10.9.5.64@tcp), waiting for 2 known clients (0 recovered, 1 in progress, and 0 evicted) already passed deadline 3:05
[44196.728849] Lustre: Skipped 21 previous similar messages
[44311.673038] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
[44311.673797] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[44311.674391] Lustre: Skipped 1 previous similar message
[44311.675031] Lustre: 2500:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 1
[44311.676331] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
[44311.677355] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) Skipped 2 previous similar messages
[44311.678318] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[44311.679301] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages
[44311.680369] LustreError: 2500:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152
[44311.681531] Lustre: 2500:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted:  req@ffff922b644d6400 x1619523210909360/t0(12884901890) o36-&amp;gt;94cd1843-54cb-a4d4-a0d3-b3519f2b7d2a@10.9.5.65@tcp:356/0 lens 512/0 e 3 to 0 dl 1544506121 ref 1 fl Complete:/4/ffffffff rc 0/-1
[44311.739670] Lustre: lustre-MDT0000: Recovery over after 3:00, of 2 clients 0 recovered and 2 were evicted.
[44311.930592] Lustre: lustre-MDT0000: Connection restored to e9848982-35c6-9607-086a-2eb07fd9bf44 (at 10.9.5.64@tcp)
[44311.931571] Lustre: Skipped 46 previous similar messages
[44315.952804] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We see replay-single test 0c also fail with similar messages in the logs; &lt;a href=&quot;https://testing.whamcloud.com/test_sets/d20239e0-fd79-11e8-a97c-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/d20239e0-fd79-11e8-a97c-52540065bddc&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;More logs for these failures are at&lt;br/&gt;
 &lt;a href=&quot;https://testing.whamcloud.com/test_sets/ea4338ea-fd67-11e8-8a18-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/ea4338ea-fd67-11e8-8a18-52540065bddc&lt;/a&gt;&lt;br/&gt;
 &lt;a href=&quot;https://testing.whamcloud.com/test_sets/9efcb22c-f712-11e8-815b-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/9efcb22c-f712-11e8-815b-52540065bddc&lt;/a&gt;&lt;/p&gt;</description>
                <environment></environment>
        <key id="54262">LU-11762</key>
            <summary>replay-single test 0d fails with  &apos;post-failover df failed&apos;</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="simmonsja">James A Simmons</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                    </labels>
                <created>Tue, 11 Dec 2018 23:20:28 +0000</created>
                <updated>Sat, 17 Oct 2020 13:41:24 +0000</updated>
                            <resolved>Sat, 17 Oct 2020 13:41:24 +0000</resolved>
                                    <version>Lustre 2.12.0</version>
                    <version>Lustre 2.12.1</version>
                    <version>Lustre 2.12.2</version>
                                    <fixVersion>Lustre 2.14.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="238462" author="pjones" created="Wed, 12 Dec 2018 18:37:31 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Can you please assess this issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="238571" author="yujian" created="Fri, 14 Dec 2018 00:07:08 +0000"  >&lt;p&gt;replay-single test 62 hit the similar issue on master branch:&lt;br/&gt;
 &lt;a href=&quot;https://testing.whamcloud.com/test_sets/81a8b6dc-fdf0-11e8-b837-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/81a8b6dc-fdf0-11e8-b837-52540065bddc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note this appears to be &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12304&quot; title=&quot;replay-single test_62: &amp;#39;unlinkmany /mnt/lustre/d62.replay-single/f62.replay-single failed&amp;#39; &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12304&quot;&gt;LU-12304&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="239012" author="gerrit" created="Fri, 21 Dec 2018 11:02:45 +0000"  >&lt;p&gt;Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/33907&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33907&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11762&quot; title=&quot;replay-single test 0d fails with  &amp;#39;post-failover df failed&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11762&quot;&gt;&lt;del&gt;LU-11762&lt;/del&gt;&lt;/a&gt; target: don&apos;t exceed hard timeout&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: c87c2948c5ef142375338b396c4e1d7eb70b5dea&lt;/p&gt;</comment>
                            <comment id="243776" author="gerrit" created="Tue, 12 Mar 2019 22:02:25 +0000"  >&lt;p&gt;James Simmons (uja.ornl@yahoo.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/34408&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34408&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11762&quot; title=&quot;replay-single test 0d fails with  &amp;#39;post-failover df failed&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11762&quot;&gt;&lt;del&gt;LU-11762&lt;/del&gt;&lt;/a&gt; ldlm: don&apos;t exceed hard timeout&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 888c8f18c25924ab24d893cf4a1811718679847d&lt;/p&gt;</comment>
                            <comment id="245482" author="sarah" created="Tue, 9 Apr 2019 17:34:48 +0000"  >&lt;p&gt;SOAK hit similar error when running on b2_12-next #6 which cause clients were evicted.&lt;/p&gt;

&lt;p&gt;application error 259372-mdtestfpp.out&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;04/07/2019 10:44:51: Process 1(soak-29.spirit.whamcloud.com): FAILED in create_remove_directory_tree, Unable to create directory: Input/output error
[soak-28][[62764,1],0][btl_tcp_frag.c:238:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[soak-43][[62764,1],6][btl_tcp_frag.c:238:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
slurmstepd: error: *** STEP 259372.0 ON soak-28 CANCELLED AT 2019-04-07T10:45:00 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-29: task 1: Exited with exit code 1
srun: Terminating job step 259372.0
srun: error: soak-28: task 0: Killed
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: error: soak-30: task 2: Killed
srun: error: soak-31: task 3: Killed
srun: error: soak-42: task 5: Killed
srun: error: soak-41: task 4: Killed
srun: error: soak-43: task 6: Killed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;client console (soak-28)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr  7 10:43:47 soak-28 systemd-logind: Removed session 5255.
Apr  7 10:43:47 soak-28 systemd: Removed slice User Slice of root.
Apr  7 10:44:08 soak-28 kernel: Lustre: soaked-MDT0003-mdc-ffffa001df2d3800: Connection to soaked-MDT0003 (at 192.168.1.110@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Apr  7 10:44:08 soak-28 kernel: LustreError: 167-0: soaked-MDT0003-mdc-ffffa001df2d3800: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
Apr  7 10:44:08 soak-28 kernel: Lustre: soaked-MDT0003-mdc-ffffa001df2d3800: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
Apr  7 10:44:48 soak-28 kernel: LustreError: 98728:0:(llite_lib.c:2412:ll_prep_inode()) new_inode -fatal: rc -2
Apr  7 10:44:49 soak-28 kernel: LustreError: 98764:0:(llite_lib.c:2412:ll_prep_inode()) new_inode -fatal: rc -2
Apr  7 10:44:49 soak-28 kernel: LustreError: 98764:0:(llite_lib.c:2412:ll_prep_inode()) Skipped 13 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;MDS console (soak-11)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr  7 10:43:59 soak-11 systemd: Removed slice User Slice of root.
Apr  7 10:43:59 soak-11 kernel: LustreError: 13326:0:(lod_dev.c:434:lod_sub_recovery_thread()) soaked-MDT0003-osd get update log failed: rc = -108
Apr  7 10:43:59 soak-11 kernel: LustreError: 13326:0:(lod_dev.c:434:lod_sub_recovery_thread()) Skipped 2 previous similar messages
Apr  7 10:44:01 soak-11 multipathd: 360080e50001fedb80000015952012962: sdp - rdac checker reports path is up
Apr  7 10:44:01 soak-11 multipathd: 8:240: reinstated
Apr  7 10:44:01 soak-11 multipathd: 360080e50001fedb80000015952012962: load table [0 31247843328 multipath 2 queue_if_no_path retain_attached_hw_handler 1 rdac 2 1 round-robin 0 1 1 8:240 1 round-robin 0 1 1 8:128 1]
Apr  7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to 18fc806b-f2ae-c23f-8207-c2b41766066a (at 192.168.1.129@o2ib)
Apr  7 10:44:05 soak-11 kernel: Lustre: 13330:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Apr  7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: disconnecting 23 stale clients
Apr  7 10:44:05 soak-11 kernel: LustreError: 13330:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 8388608
Apr  7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Denying connection for new client 0a90ab6f-d87f-2ae2-08eb-fb39e165047b(at 192.168.1.136@o2ib), waiting for 24 known clients (1 recovered, 0 in progress, and 23 evicted) already passed deadline 9:50
Apr  7 10:44:05 soak-11 kernel: Lustre: Skipped 3 previous similar messages
Apr  7 10:44:05 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to b662025f-1a48-f603-425e-9a1b0fe94ac8 (at 192.168.1.125@o2ib)
Apr  7 10:44:05 soak-11 kernel: Lustre: Skipped 1 previous similar message
Apr  7 10:44:07 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to f2e27319-0d77-c9dc-e387-bbb2fb32fd1e (at 192.168.1.130@o2ib)
Apr  7 10:44:07 soak-11 kernel: Lustre: Skipped 5 previous similar messages
Apr  7 10:44:09 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to ea0c92cc-ca60-a13b-ca50-b173ec24752c (at 192.168.1.116@o2ib)
Apr  7 10:44:09 soak-11 kernel: Lustre: Skipped 10 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="245490" author="simmonsja" created="Tue, 9 Apr 2019 19:48:02 +0000"  >&lt;p&gt;We have a potential fix for this as well as another fix that is needed for recovery. Both will need to be landed.&lt;/p&gt;</comment>
                            <comment id="247098" author="jamesanunez" created="Tue, 14 May 2019 01:08:57 +0000"  >&lt;p&gt;We&apos;re seeing something similar with replay-single test 62 for ldiskfs/DNE for 2.12.2 RC1 at &lt;a href=&quot;https://testing.whamcloud.com/test_sets/78994818-753c-11e9-a6f9-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/78994818-753c-11e9-a6f9-52540065bddc&lt;/a&gt; . We see the following in the client 2 dmesg&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[64633.042303] Lustre: DEBUG MARKER: == replay-single test 0d: expired recovery with no clients =========================================== 09:24:46 (1557653086)
[64633.892025] Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
[64634.219536] Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
[64647.402309] LustreError: 166-1: MGC10.2.4.96@tcp: Connection to MGS (at 10.2.4.96@tcp) was lost; in progress operations using this service will fail
[64655.316668] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-30vm4.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64655.543605] Lustre: DEBUG MARKER: onyx-30vm4.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64657.419063] Lustre: Evicted from MGS (at 10.2.4.96@tcp) after server handle changed from 0x4e283717d3799726 to 0x4e283717d3799d54
[64657.421383] LustreError: 17190:0:(import.c:1267:ptlrpc_connect_interpret()) lustre-MDT0000_UUID went back in time (transno 4295093706 was previously committed, server now claims 4295093699)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
[64837.930298] LustreError: 11-0: lustre-MDT0000-mdc-ffff91839c315000: operation mds_reint to node 10.2.4.96@tcp failed: rc = -107
[64842.704776] LustreError: 167-0: lustre-MDT0000-mdc-ffff91839c315000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[64843.836846] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2&amp;gt;/dev/null
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;m not sure if this error has the same cause as others in this ticket. The &quot;went back in time&quot; error message was not mentioned in this ticket. &lt;/p&gt;

&lt;p&gt;and why in the world are we referring to bugzilla.lustre.org?&lt;/p&gt;</comment>
                            <comment id="249573" author="simmonsja" created="Thu, 20 Jun 2019 14:27:11 +0000"  >&lt;p&gt;So I just talked to James Numez and he has reported this bug has not been seen in the last 4 weeks on master. I&apos;m thinking its possible patch&#160;&lt;a href=&quot;https://review.whamcloud.com/34710&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34710&#160;&lt;/a&gt;resolved this. I do like to do a cleanup patch since that code is not easy to understand.&lt;/p&gt;</comment>
                            <comment id="251187" author="gerrit" created="Fri, 12 Jul 2019 05:20:40 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/34408/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34408/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11762&quot; title=&quot;replay-single test 0d fails with  &amp;#39;post-failover df failed&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11762&quot;&gt;&lt;del&gt;LU-11762&lt;/del&gt;&lt;/a&gt; ldlm: don&apos;t exceed hard timeout&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 8bfe8939d810f5ac16484d3d4b81f829c7d7d0d7&lt;/p&gt;</comment>
                            <comment id="251234" author="simmonsja" created="Fri, 12 Jul 2019 13:23:03 +0000"  >&lt;p&gt;The patch that landed was a cleanup patch but Oleg does see it in his test harness. So I need to work with him to reproduce this problem.&lt;/p&gt;</comment>
                            <comment id="252070" author="gerrit" created="Fri, 26 Jul 2019 12:21:40 +0000"  >&lt;p&gt;Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/35627&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35627&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11762&quot; title=&quot;replay-single test 0d fails with  &amp;#39;post-failover df failed&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11762&quot;&gt;&lt;del&gt;LU-11762&lt;/del&gt;&lt;/a&gt; ldlm: ensure the recovery timer is armed&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 2ae2c2cc41f3c82393e8a19eb4da1c3644846e93&lt;/p&gt;</comment>
                            <comment id="255267" author="simmonsja" created="Mon, 23 Sep 2019 16:25:54 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12769&quot; title=&quot;replay-dual test 0b hangs in client mount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12769&quot;&gt;&lt;del&gt;LU-12769&lt;/del&gt;&lt;/a&gt; has a real fix.&lt;/p&gt;</comment>
                            <comment id="259236" author="gerrit" created="Thu, 5 Dec 2019 19:06:56 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/36936&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36936&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11762&quot; title=&quot;replay-single test 0d fails with  &amp;#39;post-failover df failed&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11762&quot;&gt;&lt;del&gt;LU-11762&lt;/del&gt;&lt;/a&gt; ldlm: don&apos;t exceed hard timeout&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 373b1cb9232caa4457d63dd04bf6d16af83f493f&lt;/p&gt;</comment>
                            <comment id="259289" author="gerrit" created="Fri, 6 Dec 2019 01:06:48 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/35627/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35627/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11762&quot; title=&quot;replay-single test 0d fails with  &amp;#39;post-failover df failed&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11762&quot;&gt;&lt;del&gt;LU-11762&lt;/del&gt;&lt;/a&gt; ldlm: ensure the recovery timer is armed&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: fe5c801657f9ddb5e148bb6076e476df6ba31bba&lt;/p&gt;</comment>
                            <comment id="260580" author="gerrit" created="Fri, 3 Jan 2020 23:41:36 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/36936/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36936/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11762&quot; title=&quot;replay-single test 0d fails with  &amp;#39;post-failover df failed&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11762&quot;&gt;&lt;del&gt;LU-11762&lt;/del&gt;&lt;/a&gt; ldlm: don&apos;t exceed hard timeout&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: bc27e0f6efbdbd256c6459d15391754ce1b36d32&lt;/p&gt;</comment>
                            <comment id="260603" author="gerrit" created="Sat, 4 Jan 2020 04:59:49 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/37141&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/37141&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11762&quot; title=&quot;replay-single test 0d fails with  &amp;#39;post-failover df failed&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11762&quot;&gt;&lt;del&gt;LU-11762&lt;/del&gt;&lt;/a&gt; ldlm: ensure the recovery timer is armed&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8e9dae2b1aa53b3be114922c825742271be08a0b&lt;/p&gt;</comment>
                            <comment id="277795" author="tappro" created="Thu, 20 Aug 2020 12:27:51 +0000"  >&lt;p&gt;there are several reports pointing to this ticket as reason of failures, check linked tickets:&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13614&quot; title=&quot;replay-single test_117: LBUG: ASSERTION( atomic_read(&amp;amp;obd-&amp;gt;obd_req_replay_clients) == 0 ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13614&quot;&gt;&lt;del&gt;LU-13614&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13339&quot; title=&quot;patch for LU-11762 causes an assertion in replay-dual&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13339&quot;&gt;&lt;del&gt;LU-13339&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="277814" author="simmonsja" created="Thu, 20 Aug 2020 16:52:20 +0000"  >&lt;p&gt;Hongchao Zhang, if we revert this patch do you see replay-single 0d start to fail again?&#160;&lt;/p&gt;</comment>
                            <comment id="282506" author="simmonsja" created="Sat, 17 Oct 2020 13:41:05 +0000"  >&lt;p&gt;&#160;A new patch landed to fix this problem -&lt;a href=&quot;https://review.whamcloud.com/#/c/39532/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/39532/&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="51976">LU-10950</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="55753">LU-12340</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="56919">LU-12769</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="59396">LU-13614</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="54282">LU-11771</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="58298">LU-13339</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="42994">LU-9019</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i007tr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>