[LU-1036] 1.8.7<->2.1.54 Test failure on test suite replay-single (52) Created: 25/Jan/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: yuc2

Attachments: File 1036.tar.gz    
Issue Links:
Duplicate
is duplicated by LU-994 1.8<->2.1.54 Test failure on test sui... Resolved
Related
is related to LU-889 rework extent_recovery_timer() Resolved
Severity: 3
Rank (Obsolete): 5109

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/be8e0662-4784-11e1-9a77-5254004bbbd3.

client 1.8.7 <--> server 2.1.54



 Comments   
Comment by Peter Jones [ 08/Feb/12 ]

Bobi

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 08/Feb/12 ]

replay-single test_52() is to test MDS failover and there is no mds log int the maloo, and from client's log

== test 52: time out lock replay (3764) == 03:16:32
Filesystem           1K-blocks      Used Available Use% Mounted on
fat-amd-3-ib@o2ib:/lustre
                      29528544   1349276  26679012   5% /mnt/lustre
fail_loc=0x8000030c
Failing mds on node fat-amd-3-ib
Stopping /mnt/mds (opts:)
affected facets: mds
df pid is 26116
Failover mds to fat-amd-3-ib
03:16:51 (1327490211) waiting for fat-amd-3-ib network 900 secs ...
03:16:51 (1327490211) network interface is UP
Starting mds: -o user_xattr,acl  /dev/sdc1 /mnt/mds
fat-amd-3-ib: debug=0xb3f0405
fat-amd-3-ib: subsystem_debug=0xffb7efff
fat-amd-3-ib: debug_mb=48
Started lustre-MDT0000
client-15-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
client-15-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
 replay-single test_52: @@@@@@ FAIL: post-failover df: 1 
Dumping lctl log to /tmp/test_logs/2012-01-24/195852/replay-single.test_52.*.1327490314.log
client-15-ib: Host key verification failed.
client-15-ib: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
client-15-ib: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]

we can see the mds successfully recovered, and then lost, looks like for unknown reasons the MDS node was disconnected from the network (ssh cannot verify the host key), IMO, it's not Lustre's issue.

Comment by Sarah Liu [ 09/Feb/12 ]

I will try to reproduce it and gather more logs for investigation.

Comment by Sarah Liu [ 12/Feb/12 ]

Hit the same issue again with 1.8.7 RHEL6 client and 2.1.55 RHEL6 server. https://maloo.whamcloud.com/test_sets/1d1c9c72-55de-11e1-9aa8-5254004bbbd3. Please find the server log from the attached.

Comment by Sarah Liu [ 12/Feb/12 ]

server logs

Comment by Zhenyu Xu [ 13/Feb/12 ]

at 1329094435.2981 MDS recovery starts, setting recovery timeout in 60 seconds (to xxxxx4495.xxx)
at 1329094450.2345 client recovery thread sent the replay req
at 1329094450.2352 MDS got the client's reconnection, and reset recovery timeout in 45 seconds (unchanged to xxxx4495.xxx)
at 1329094450.2361 MDS received the req and threw the reply as the test designed it
at 1329094495.2347 MDS recovery timedout, evicting client
at 1329094525.2345 client timed out the replay req (75 seconds later since the req has been sent)

Comment by Zhenyu Xu [ 13/Feb/12 ]

LU-889 patch would likely to fix this issue.

Comment by Peter Jones [ 16/Feb/12 ]

Believed to be a duplicate of LU-889 which has now landed for 2.2. Please reopen if this issue reoccurs with the LU-889 patch in place.

Comment by Jian Yu [ 15/Oct/12 ]

Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/198
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64

replay-single test_52 failed with the same issue:
https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c

Comment by Zhenyu Xu [ 16/Oct/12 ]

Here is what I interpret debug logs between evicted client and MDS for the latest maloo report (https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c)

on 1350224228 seconds MDS was set fail loc

00000001:02000400:0.0:1350224248.707485:0:22372:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mds REPLAY BARRIER on lustre-MDT0000
00000001:02000400:0.0:1350224248.778103:0:22396:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mds REPLAY BARRIER on lustre-MDT0000
00000001:02000400:0.0:1350224248.852376:0:22419:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: lctl set_param fail_loc=0x8000030c

on 1350224255 MDS stopped

00000020:02000400:0.0:1350224255.305890:0:22537:0:(obd_mount.c:1800:server_put_super()) server umount lustre-MDT0000 complete

on 1350224265 start MDS device

00000020:01000004:0.0:1350224265.767002:0:22687:0:(obd_mount.c:2554:lustre_fill_super()) Mounting server from /dev/mapper/lvm--MDS-P1

on 1350224266 seconds client start to stat lustre filesystem

00000100:00100000:0:1350224266.514726:0:32437:0:(client.c:2095:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc stat:d72a06d1-9e9c-4600-4007-cde887950e68:32437:x1415810084000511:10.10.4.160@tcp:41
00000100:00080000:0:1350224266.514729:0:32437:0:(client.c:2112:ptlrpc_queue_wait()) @@@ "stat" waiting for recovery: (FULL != DISCONN)  req@ffff88007741e000 x1415810084000511/t0 o41->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 192/528 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0

on 1350224337, client reconnect server

00000100:00100000:0:1350224337.260692:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000631:10.10.4.160@tcp:38

server handled it

00000100:00100000:0.0:1350224337.260952:0:22719:0:(service.c:1788:ptlrpc_server_handle_req_in()) got req x1415810084000631
00000100:00100000:0.0:1350224337.260963:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:0+-99:26771:x1415810084000631:12345-10.10.4.163@tcp:38
00010000:00080000:0.0:1350224337.260982:0:22719:0:(ldlm_lib.c:978:target_handle_connect()) lustre-MDT0000: connection from d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp t223338299393 exp (null) cur 1350224337 last 0
00000020:01000000:0.0:1350224337.261024:0:22719:0:(lprocfs_status.c:2007:lprocfs_exp_setup()) using hash ffff88007a6790c0
00000100:00100000:0.0:1350224337.261055:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+5:26771:x1415810084000631:12345-10.10.4.163@tcp:38 Request procesed in 96us (118us total) trans 0 rc 0/0
00000001:00100000:0.0:1350224337.294699:0:22693:0:(target.c:598:lut_cb_new_client()) lustre-MDT0000: committing for initial connect of d72a06d1-9e9c-4600-4007-cde887950e68

while client think server evicted it

00000100:00080000:0:1350224337.261722:0:26771:0:(import.c:908:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4)  req@ffff88007cd49400 x1415810084000631/t0 o38->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 368/584 e 0 to 1 dl 1350224347 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:00080000:0:1350224337.261726:0:26771:0:(import.c:911:ptlrpc_connect_interpret()) ffff880078705000 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED

the reply msg flag is only 4 == MSG_CONNECT_REPLAYABLE, I guess there's connect flags setting misunderstanding which caused client evicted.

Comment by Oleg Drokin [ 16/Oct/12 ]

Thanks for the timeline, Bobi!

The reason client thinks the server evicted it was because that's what happened.
Recovery window closed on the server at 1350224331:

00010000:02000400:0.0:1350224331.521211:0:22717:0:(ldlm_lib.c:1773:target_recovery_overseer()) lustre-MDT0000: recovery is timed out, evict stale exports

In fact the vm2 client reconnected much earlier and went through recovery as far as it could (til REPLAY_LOCKS phase) until it hit the wall with server waiting for the other client.

00000100:00080000:0:1350224277.260518:0:26771:0:(import.c:1287:ptlrpc_import_recovery_state_machine()) ffff880078705000 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS

Then it proceeds to replay locks:

00000100:00100000:0:1350224277.260584:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000532:10.10.4.160@tcp:101
00000100:00100000:0:1350224277.260621:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101
00010000:00010000:0:1350224277.265317:0:26771:0:(ldlm_request.c:1924:replay_lock_interpret()) ### replayed lock: ns: lustre-MDT0000-mdc-ffff88007cd48400 lock: ffff88007b623a00/0xe8bdbcd6a2bd8f2f lrc: 2/0,0 mode: PR/PR res: 8589945617/84 bits 0x3 rrc: 1 type: IBT flags: 0x0 remote: 0x7d703d0db3f157a expref: -99 pid: 32215 timeout: 0
00000100:00100000:0:1350224277.265325:0:26771:0:(client.c:1452:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101

Note how x1415810084000533 completes right away, but x1415810084000532 is not.

Switching to the server side, we did get x1415810084000532 request there and send the reply:

00000100:00100000:0.0:1350224277.260799:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+7:26771:x1415810084000532:12345-10.10.4.163@tcp:101
00010000:00080000:0.0:1350224277.260806:0:22719:0:(ldlm_lib.c:2245:target_queue_recovery_request()) @@@ not queueing  req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/0 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/ffffffff rc 0/-1
...
00010000:02000000:0.0:1350224277.260846:0:22719:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=30c, val=2147483648***
00010000:00020000:0.0:1350224277.261517:0:22719:0:(ldlm_lib.c:2380:target_send_reply_msg()) @@@ dropping reply  req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/384 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/0 rc 0/0
00000100:00100000:0.0:1350224277.264622:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+8:26771:x1415810084000532:12345-10.10.4.163@tcp:101 Request procesed in 3825us (3860us total) trans 0 rc 0/0

So req x1415810084000532 is "lost" - the test is workign as expected timing out lock recovery

now, some 60 seconds later on a client

00000100:00000400:0:1350224337.260658:0:26771:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1415810084000532 sent from lustre-MDT0000-mdc-ffff88007cd48400 to NID 10.10.4.160@tcp 60s ago has timed out (60s prior to deadline).

By which time server has timed out the recovery already.

So the problem at hand is that server did not extend the deadline. Now, reading through the code it appears that we only extend the timer on requests placed in various recovery queues, but to get there lock replay must be sent with MSG_REQ_REPLAY_DONE flag set, and only 2.x clients set it.
1.8 clients don't set the flag and so the timer is not extended.

Now, in 1.8 server case there are no separate lock queues at all and I suspect the initial timeout is just much higher (don't have a log nearby, will try to look for something in maloo shortly) and as such allows enough margin for reconnect to still succeed.

Hm, actually I just looked into the maloo report and it seems there reconnect arrives ~50 seconds after lost message instead of 60 seconds in case of 2.1 interop. Sadly we don't collect any debug logs on success so I cannot compare the results. I'll try to repeat 1.8 test locally and see what's inside.

The entire test is somewhat of a fringe case where we have a double failure of sorts - fist server dies and then there is a loss of one of the lock replies, so it does not warrant in my opinion. But I am curious to see why 1.8 to 1.8 case replays sooner than during interop, so I'll make a final decision after that.

Comment by Jian Yu [ 17/Oct/12 ]

Lustre Version: 1.8.8-wc1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/198
Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client)
ENABLE_QUOTA=yes

replay-single test 52 passed with debug logs gathered:
https://maloo.whamcloud.com/test_sets/555a95bc-181b-11e2-a6a7-52540035b04c

Comment by Oleg Drokin [ 17/Oct/12 ]

Hm, only 7 seconds, perhaps it needs a bit more ramp up from previous tests, otr might be it just wildly fluctuates, a few more run this time with more preceedign tests would be useful to gauge that I think.
Thank you!

Comment by Zhenyu Xu [ 17/Oct/12 ]

Yujian,

Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.

Comment by Jian Yu [ 17/Oct/12 ]

Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.

Sure. Let me finish the following testing required by Oleg first:

1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value
2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale

Oleg wanted to see whether the times to recover fluctuate and how big fluctuations are.

The testing is still ongoing.

Comment by Jian Yu [ 17/Oct/12 ]

1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value

Test logs are in /home/yujian/test_logs/2012-10-17/020854 on brent node.

2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale

The first run was finished, test 52 failed with the same issue. The second run is ongoing. Test logs are in /home/yujian/test_logs/2012-10-17/061138 on brent node.

Comment by Oleg Drokin [ 17/Oct/12 ]

I see that 1.8 no-interop times really jump around to, but mostly on the lower end of scale.

With interop the times also jump around but on a bit higher scale. Also this seems to be a long standing thing since there is this much of a difference between 1.8 and 2.3 in markign lock replay requests, 2.x non interop mode should be much more consistent in having this test passing for those reasons.

Additionally there appears to be a test problem of some sort triggering at times where I do not se test 52 dropping a lock replay request on mdt at all.

As suh I think this does not warrant a blocker priority for 2.3 release (or possibly any other).

Comment by Jian Yu [ 19/Aug/13 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1)
Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/32/

replay-single test_52 and test_62 also hit the same failure:

Started lustre-MDT0000
client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
 replay-single test_52: @@@@@@ FAIL: post-failover df: 1
Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_52.*.1376740815.log
Started lustre-MDT0000
client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
 replay-single test_62: @@@@@@ FAIL: post-failover df: 1
Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_62.*.1376741797.log

Maloo report: https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c

Comment by Jian Yu [ 05/Sep/13 ]

Lustre client: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1)
Lustre server: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)

replay-single test 62 hit the same failure:
https://maloo.whamcloud.com/test_sets/0c773904-15c2-11e3-87cb-52540035b04c

Comment by Zhenyu Xu [ 05/Sep/13 ]

it's more like LU-1473 issue.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:12:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.