<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:12:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1036] 1.8.7&lt;-&gt;2.1.54 Test failure on test suite replay-single (52)</title>
                <link>https://jira.whamcloud.com/browse/LU-1036</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for sarah &amp;lt;sarah@whamcloud.com&amp;gt;&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite run: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/be8e0662-4784-11e1-9a77-5254004bbbd3&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/be8e0662-4784-11e1-9a77-5254004bbbd3&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;client 1.8.7 &amp;lt;--&amp;gt; server 2.1.54&lt;/p&gt;

</description>
                <environment></environment>
        <key id="12991">LU-1036</key>
            <summary>1.8.7&lt;-&gt;2.1.54 Test failure on test suite replay-single (52)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                            <label>yuc2</label>
                    </labels>
                <created>Wed, 25 Jan 2012 13:59:18 +0000</created>
                <updated>Mon, 29 May 2017 03:02:13 +0000</updated>
                            <resolved>Mon, 29 May 2017 03:02:13 +0000</resolved>
                                    <version>Lustre 2.2.0</version>
                    <version>Lustre 2.3.0</version>
                    <version>Lustre 2.4.0</version>
                    <version>Lustre 1.8.9</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="28147" author="pjones" created="Wed, 8 Feb 2012 09:21:36 +0000"  >&lt;p&gt;Bobi&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="28215" author="bobijam" created="Wed, 8 Feb 2012 20:19:33 +0000"  >&lt;p&gt;replay-single test_52() is to test MDS failover and there is no mds log int the maloo, and from client&apos;s log&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;== test 52: time out lock replay (3764) == 03:16:32
Filesystem           1K-blocks      Used Available Use% Mounted on
fat-amd-3-ib@o2ib:/lustre
                      29528544   1349276  26679012   5% /mnt/lustre
fail_loc=0x8000030c
Failing mds on node fat-amd-3-ib
Stopping /mnt/mds (opts:)
affected facets: mds
df pid is 26116
Failover mds to fat-amd-3-ib
03:16:51 (1327490211) waiting for fat-amd-3-ib network 900 secs ...
03:16:51 (1327490211) network interface is UP
Starting mds: -o user_xattr,acl  /dev/sdc1 /mnt/mds
fat-amd-3-ib: debug=0xb3f0405
fat-amd-3-ib: subsystem_debug=0xffb7efff
fat-amd-3-ib: debug_mb=48
Started lustre-MDT0000
client-15-ib: stat: cannot read file system information for `/mnt/lustre&apos;: Interrupted system call
client-15-ib: stat: cannot read file system information for `/mnt/lustre&apos;: Interrupted system call
 replay-single test_52: @@@@@@ FAIL: post-failover df: 1 
Dumping lctl log to /tmp/test_logs/2012-01-24/195852/replay-single.test_52.*.1327490314.log
client-15-ib: Host key verification failed.
client-15-ib: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
client-15-ib: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;we can see the mds successfully recovered, and then lost, looks like for unknown reasons the MDS node was disconnected from the network (ssh cannot verify the host key), IMO, it&apos;s not Lustre&apos;s issue.&lt;/p&gt;</comment>
                            <comment id="28300" author="sarah" created="Thu, 9 Feb 2012 16:34:11 +0000"  >&lt;p&gt;I will try to reproduce it and gather more logs for investigation.&lt;/p&gt;</comment>
                            <comment id="28457" author="sarah" created="Sun, 12 Feb 2012 20:07:01 +0000"  >&lt;p&gt;Hit the same issue again with 1.8.7 RHEL6 client and 2.1.55 RHEL6 server. &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/1d1c9c72-55de-11e1-9aa8-5254004bbbd3&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/1d1c9c72-55de-11e1-9aa8-5254004bbbd3&lt;/a&gt;. Please find the server log from the attached.&lt;/p&gt;</comment>
                            <comment id="28473" author="sarah" created="Sun, 12 Feb 2012 22:59:04 +0000"  >&lt;p&gt;server logs&lt;/p&gt;</comment>
                            <comment id="28491" author="bobijam" created="Mon, 13 Feb 2012 08:20:52 +0000"  >&lt;p&gt;at 1329094435.2981     MDS recovery starts, setting recovery timeout in 60 seconds   (to xxxxx4495.xxx)&lt;br/&gt;
at 1329094450.2345     client recovery thread sent the replay req&lt;br/&gt;
at 1329094450.2352     MDS got the client&apos;s reconnection, and reset recovery timeout in 45 seconds (unchanged to xxxx4495.xxx)&lt;br/&gt;
at 1329094450.2361     MDS received the req and threw the reply as the test designed it&lt;br/&gt;
at 1329094495.2347     MDS recovery timedout, evicting client&lt;br/&gt;
at 1329094525.2345     client timed out the replay req  (75 seconds later since the req has been sent)&lt;/p&gt;</comment>
                            <comment id="28593" author="bobijam" created="Mon, 13 Feb 2012 20:43:37 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-889&quot; title=&quot;rework extent_recovery_timer()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-889&quot;&gt;&lt;del&gt;LU-889&lt;/del&gt;&lt;/a&gt; patch would likely to fix this issue.&lt;/p&gt;</comment>
                            <comment id="28872" author="pjones" created="Thu, 16 Feb 2012 08:35:01 +0000"  >&lt;p&gt;Believed to be a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-889&quot; title=&quot;rework extent_recovery_timer()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-889&quot;&gt;&lt;del&gt;LU-889&lt;/del&gt;&lt;/a&gt; which has now landed for 2.2. Please reopen if this issue reoccurs with the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-889&quot; title=&quot;rework extent_recovery_timer()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-889&quot;&gt;&lt;del&gt;LU-889&lt;/del&gt;&lt;/a&gt; patch in place.&lt;/p&gt;</comment>
                            <comment id="46562" author="yujian" created="Mon, 15 Oct 2012 07:13:07 +0000"  >&lt;p&gt;Lustre Client Build: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b1_8/198&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-b1_8/198&lt;/a&gt;&lt;br/&gt;
Lustre Server Build: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b2_3/36&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-b2_3/36&lt;/a&gt;&lt;br/&gt;
Distro/Arch: RHEL6.3/x86_64&lt;/p&gt;

&lt;p&gt;replay-single test_52 failed with the same issue:&lt;br/&gt;
&lt;a href=&quot;https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c&lt;/a&gt;&lt;/p&gt;

</comment>
                            <comment id="46627" author="bobijam" created="Tue, 16 Oct 2012 13:31:01 +0000"  >
&lt;p&gt;Here is what I interpret debug logs between evicted client and MDS for the latest maloo report (&lt;a href=&quot;https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;on 1350224228 seconds MDS was set fail loc&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000001:02000400:0.0:1350224248.707485:0:22372:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mds REPLAY BARRIER on lustre-MDT0000
00000001:02000400:0.0:1350224248.778103:0:22396:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mds REPLAY BARRIER on lustre-MDT0000
00000001:02000400:0.0:1350224248.852376:0:22419:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: lctl set_param fail_loc=0x8000030c
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;on 1350224255 MDS stopped&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000020:02000400:0.0:1350224255.305890:0:22537:0:(obd_mount.c:1800:server_put_super()) server umount lustre-MDT0000 complete
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;on 1350224265 start MDS device&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000020:01000004:0.0:1350224265.767002:0:22687:0:(obd_mount.c:2554:lustre_fill_super()) Mounting server from /dev/mapper/lvm--MDS-P1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;on 1350224266 seconds client start to stat lustre filesystem&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:0:1350224266.514726:0:32437:0:(client.c:2095:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc stat:d72a06d1-9e9c-4600-4007-cde887950e68:32437:x1415810084000511:10.10.4.160@tcp:41
00000100:00080000:0:1350224266.514729:0:32437:0:(client.c:2112:ptlrpc_queue_wait()) @@@ &quot;stat&quot; waiting for recovery: (FULL != DISCONN)  req@ffff88007741e000 x1415810084000511/t0 o41-&amp;gt;lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 192/528 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;on 1350224337, client reconnect server&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:0:1350224337.260692:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000631:10.10.4.160@tcp:38
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;server handled it &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:0.0:1350224337.260952:0:22719:0:(service.c:1788:ptlrpc_server_handle_req_in()) got req x1415810084000631
00000100:00100000:0.0:1350224337.260963:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:0+-99:26771:x1415810084000631:12345-10.10.4.163@tcp:38
00010000:00080000:0.0:1350224337.260982:0:22719:0:(ldlm_lib.c:978:target_handle_connect()) lustre-MDT0000: connection from d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp t223338299393 exp (null) cur 1350224337 last 0
00000020:01000000:0.0:1350224337.261024:0:22719:0:(lprocfs_status.c:2007:lprocfs_exp_setup()) using hash ffff88007a6790c0
00000100:00100000:0.0:1350224337.261055:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+5:26771:x1415810084000631:12345-10.10.4.163@tcp:38 Request procesed in 96us (118us total) trans 0 rc 0/0
00000001:00100000:0.0:1350224337.294699:0:22693:0:(target.c:598:lut_cb_new_client()) lustre-MDT0000: committing for initial connect of d72a06d1-9e9c-4600-4007-cde887950e68
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;while client think server evicted it&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00080000:0:1350224337.261722:0:26771:0:(import.c:908:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4)  req@ffff88007cd49400 x1415810084000631/t0 o38-&amp;gt;lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 368/584 e 0 to 1 dl 1350224347 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:00080000:0:1350224337.261726:0:26771:0:(import.c:911:ptlrpc_connect_interpret()) ffff880078705000 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;the reply msg flag is only 4 == MSG_CONNECT_REPLAYABLE, I guess there&apos;s connect flags setting misunderstanding which caused client evicted.&lt;/p&gt;</comment>
                            <comment id="46640" author="green" created="Tue, 16 Oct 2012 19:18:54 +0000"  >&lt;p&gt;Thanks for the timeline, Bobi!&lt;/p&gt;

&lt;p&gt;The reason client thinks the server evicted it was because that&apos;s what happened.&lt;br/&gt;
Recovery window closed on the server at 1350224331:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00010000:02000400:0.0:1350224331.521211:0:22717:0:(ldlm_lib.c:1773:target_recovery_overseer()) lustre-MDT0000: recovery is timed out, evict stale exports
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In fact the vm2 client reconnected much earlier and went through recovery as far as it could (til REPLAY_LOCKS phase) until it hit the wall with server waiting for the other client.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00080000:0:1350224277.260518:0:26771:0:(import.c:1287:ptlrpc_import_recovery_state_machine()) ffff880078705000 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then it proceeds to replay locks:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:0:1350224277.260584:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000532:10.10.4.160@tcp:101
00000100:00100000:0:1350224277.260621:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101
00010000:00010000:0:1350224277.265317:0:26771:0:(ldlm_request.c:1924:replay_lock_interpret()) ### replayed lock: ns: lustre-MDT0000-mdc-ffff88007cd48400 lock: ffff88007b623a00/0xe8bdbcd6a2bd8f2f lrc: 2/0,0 mode: PR/PR res: 8589945617/84 bits 0x3 rrc: 1 type: IBT flags: 0x0 remote: 0x7d703d0db3f157a expref: -99 pid: 32215 timeout: 0
00000100:00100000:0:1350224277.265325:0:26771:0:(client.c:1452:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Note how x1415810084000533 completes right away, but x1415810084000532 is not.&lt;/p&gt;

&lt;p&gt;Switching to the server side, we did get x1415810084000532 request there and send the reply:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:0.0:1350224277.260799:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+7:26771:x1415810084000532:12345-10.10.4.163@tcp:101
00010000:00080000:0.0:1350224277.260806:0:22719:0:(ldlm_lib.c:2245:target_queue_recovery_request()) @@@ not queueing  req@ffff8800689cf850 x1415810084000532/t0(0) o101-&amp;gt;d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/0 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/ffffffff rc 0/-1
...
00010000:02000000:0.0:1350224277.260846:0:22719:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=30c, val=2147483648***
00010000:00020000:0.0:1350224277.261517:0:22719:0:(ldlm_lib.c:2380:target_send_reply_msg()) @@@ dropping reply  req@ffff8800689cf850 x1415810084000532/t0(0) o101-&amp;gt;d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/384 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/0 rc 0/0
00000100:00100000:0.0:1350224277.264622:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+8:26771:x1415810084000532:12345-10.10.4.163@tcp:101 Request procesed in 3825us (3860us total) trans 0 rc 0/0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So req x1415810084000532 is &quot;lost&quot; - the test is workign as expected timing out lock recovery&lt;/p&gt;

&lt;p&gt;now, some 60 seconds later on a client&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000400:0:1350224337.260658:0:26771:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1415810084000532 sent from lustre-MDT0000-mdc-ffff88007cd48400 to NID 10.10.4.160@tcp 60s ago has timed out (60s prior to deadline).
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;By which time server has timed out the recovery already.&lt;/p&gt;

&lt;p&gt;So the problem at hand is that server did not extend the deadline. Now, reading through the code it appears that we only extend the timer on requests placed in various recovery queues, but to get there lock replay must be sent with MSG_REQ_REPLAY_DONE flag set, and only 2.x clients set it.&lt;br/&gt;
1.8 clients don&apos;t set the flag and so the timer is not extended.&lt;/p&gt;

&lt;p&gt;Now, in 1.8 server case there are no separate lock queues at all and I suspect the initial timeout is just much higher (don&apos;t have a log nearby, will try to look for something in maloo shortly) and as such allows enough margin for reconnect to still succeed.&lt;/p&gt;

&lt;p&gt;Hm, actually I just looked into the maloo report and it seems there reconnect arrives ~50 seconds after lost message instead of 60 seconds in case of 2.1 interop. Sadly we don&apos;t collect any debug logs on success so I cannot compare the results. I&apos;ll try to repeat 1.8 test locally and see what&apos;s inside.&lt;/p&gt;

&lt;p&gt;The entire test is somewhat of a fringe case where we have a double failure of sorts - fist server dies and then there is a loss of one of the lock replies, so it does not warrant in my opinion. But I am curious to see why 1.8 to 1.8 case replays sooner than during interop, so I&apos;ll make a final decision after that.&lt;/p&gt;</comment>
                            <comment id="46654" author="yujian" created="Wed, 17 Oct 2012 01:32:16 +0000"  >&lt;p&gt;Lustre Version: 1.8.8-wc1&lt;br/&gt;
Lustre Build: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b1_8/198&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-b1_8/198&lt;/a&gt;&lt;br/&gt;
Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client)&lt;br/&gt;
ENABLE_QUOTA=yes&lt;/p&gt;

&lt;p&gt;replay-single test 52 passed with debug logs gathered:&lt;br/&gt;
&lt;a href=&quot;https://maloo.whamcloud.com/test_sets/555a95bc-181b-11e2-a6a7-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/555a95bc-181b-11e2-a6a7-52540035b04c&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="46656" author="green" created="Wed, 17 Oct 2012 01:59:28 +0000"  >&lt;p&gt;Hm, only 7 seconds, perhaps it needs a bit more ramp up from previous tests, otr might be it just wildly fluctuates, a few more run this time with more preceedign tests would be useful to gauge that I think.&lt;br/&gt;
Thank you!&lt;/p&gt;</comment>
                            <comment id="46665" author="bobijam" created="Wed, 17 Oct 2012 06:53:09 +0000"  >&lt;p&gt;Yujian,&lt;/p&gt;

&lt;p&gt;Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them. &lt;/p&gt;</comment>
                            <comment id="46666" author="yujian" created="Wed, 17 Oct 2012 07:31:03 +0000"  >&lt;blockquote&gt;&lt;p&gt;Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Sure. Let me finish the following testing required by Oleg first:&lt;/p&gt;

&lt;p&gt;1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value&lt;br/&gt;
2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale&lt;/p&gt;

&lt;p&gt;Oleg wanted to see whether the times to recover fluctuate and how big fluctuations are.&lt;/p&gt;

&lt;p&gt;The testing is still ongoing.&lt;/p&gt;</comment>
                            <comment id="46668" author="yujian" created="Wed, 17 Oct 2012 08:17:04 +0000"  >&lt;blockquote&gt;&lt;p&gt;1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Test logs are in /home/yujian/test_logs/2012-10-17/020854 on brent node.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;The first run was finished, test 52 failed with the same issue. The second run is ongoing. Test logs are in /home/yujian/test_logs/2012-10-17/061138 on brent node.&lt;/p&gt;</comment>
                            <comment id="46677" author="green" created="Wed, 17 Oct 2012 12:43:14 +0000"  >&lt;p&gt;I see that 1.8 no-interop times really jump around to, but mostly on the lower end of scale.&lt;/p&gt;

&lt;p&gt;With interop the times also jump around but on a bit higher scale. Also this seems to be a long standing thing since there is this much of a difference between 1.8 and 2.3 in markign lock replay requests, 2.x non interop mode should be much more consistent in having this test passing for those reasons.&lt;/p&gt;

&lt;p&gt;Additionally there appears to be a test problem of some sort triggering at times where I do not se test 52 dropping a lock replay request on mdt at all.&lt;/p&gt;

&lt;p&gt;As suh I think this does not warrant a blocker priority for 2.3 release (or possibly any other).&lt;/p&gt;</comment>
                            <comment id="64445" author="yujian" created="Mon, 19 Aug 2013 02:22:47 +0000"  >&lt;p&gt;Lustre client build: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b1_8/258/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-b1_8/258/&lt;/a&gt; (1.8.9-wc1)&lt;br/&gt;
Lustre server build: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b2_4/32/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-b2_4/32/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;replay-single test_52 and test_62 also hit the same failure:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Started lustre-MDT0000
client-23-ib: stat: cannot read file system information for `/mnt/lustre&apos;: Interrupted system call
client-23-ib: stat: cannot read file system information for `/mnt/lustre&apos;: Interrupted system call
 replay-single test_52: @@@@@@ FAIL: post-failover df: 1
Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_52.*.1376740815.log
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Started lustre-MDT0000
client-23-ib: stat: cannot read file system information for `/mnt/lustre&apos;: Interrupted system call
client-23-ib: stat: cannot read file system information for `/mnt/lustre&apos;: Interrupted system call
 replay-single test_62: @@@@@@ FAIL: post-failover df: 1
Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_62.*.1376741797.log
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Maloo report: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="65803" author="yujian" created="Thu, 5 Sep 2013 07:20:17 +0000"  >&lt;p&gt;Lustre client: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b1_8/258/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-b1_8/258/&lt;/a&gt; (1.8.9-wc1)&lt;br/&gt;
Lustre server: &lt;a href=&quot;http://build.whamcloud.com/job/lustre-b2_4/44/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-b2_4/44/&lt;/a&gt; (2.4.1 RC1)&lt;/p&gt;

&lt;p&gt;replay-single test 62 hit the same failure:&lt;br/&gt;
&lt;a href=&quot;https://maloo.whamcloud.com/test_sets/0c773904-15c2-11e3-87cb-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/0c773904-15c2-11e3-87cb-52540035b04c&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="65811" author="bobijam" created="Thu, 5 Sep 2013 08:26:29 +0000"  >&lt;p&gt;it&apos;s more like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1473&quot; title=&quot;Test failure on test suite replay-single, subtest test_62&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1473&quot;&gt;&lt;del&gt;LU-1473&lt;/del&gt;&lt;/a&gt; issue.&lt;/p&gt;</comment>
                            <comment id="197353" author="adilger" created="Mon, 29 May 2017 03:02:13 +0000"  >&lt;p&gt;Close old ticket.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="12892">LU-994</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="12583">LU-889</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="10831" name="1036.tar.gz" size="5273607" author="sarah" created="Sun, 12 Feb 2012 22:59:04 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv9pz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5109</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>