[LU-1036] 1.8.7<->2.1.54 Test failure on test suite replay-single (52) Created: 25/Jan/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | yuc2 | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 5109 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/be8e0662-4784-11e1-9a77-5254004bbbd3. client 1.8.7 <--> server 2.1.54 |
| Comments |
| Comment by Peter Jones [ 08/Feb/12 ] |
|
Bobi Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 08/Feb/12 ] |
|
replay-single test_52() is to test MDS failover and there is no mds log int the maloo, and from client's log == test 52: time out lock replay (3764) == 03:16:32
Filesystem 1K-blocks Used Available Use% Mounted on
fat-amd-3-ib@o2ib:/lustre
29528544 1349276 26679012 5% /mnt/lustre
fail_loc=0x8000030c
Failing mds on node fat-amd-3-ib
Stopping /mnt/mds (opts:)
affected facets: mds
df pid is 26116
Failover mds to fat-amd-3-ib
03:16:51 (1327490211) waiting for fat-amd-3-ib network 900 secs ...
03:16:51 (1327490211) network interface is UP
Starting mds: -o user_xattr,acl /dev/sdc1 /mnt/mds
fat-amd-3-ib: debug=0xb3f0405
fat-amd-3-ib: subsystem_debug=0xffb7efff
fat-amd-3-ib: debug_mb=48
Started lustre-MDT0000
client-15-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
client-15-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
replay-single test_52: @@@@@@ FAIL: post-failover df: 1
Dumping lctl log to /tmp/test_logs/2012-01-24/195852/replay-single.test_52.*.1327490314.log
client-15-ib: Host key verification failed.
client-15-ib: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
client-15-ib: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
we can see the mds successfully recovered, and then lost, looks like for unknown reasons the MDS node was disconnected from the network (ssh cannot verify the host key), IMO, it's not Lustre's issue. |
| Comment by Sarah Liu [ 09/Feb/12 ] |
|
I will try to reproduce it and gather more logs for investigation. |
| Comment by Sarah Liu [ 12/Feb/12 ] |
|
Hit the same issue again with 1.8.7 RHEL6 client and 2.1.55 RHEL6 server. https://maloo.whamcloud.com/test_sets/1d1c9c72-55de-11e1-9aa8-5254004bbbd3. Please find the server log from the attached. |
| Comment by Sarah Liu [ 12/Feb/12 ] |
|
server logs |
| Comment by Zhenyu Xu [ 13/Feb/12 ] |
|
at 1329094435.2981 MDS recovery starts, setting recovery timeout in 60 seconds (to xxxxx4495.xxx) |
| Comment by Zhenyu Xu [ 13/Feb/12 ] |
|
|
| Comment by Peter Jones [ 16/Feb/12 ] |
|
Believed to be a duplicate of |
| Comment by Jian Yu [ 15/Oct/12 ] |
|
Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/198 replay-single test_52 failed with the same issue: |
| Comment by Zhenyu Xu [ 16/Oct/12 ] |
|
Here is what I interpret debug logs between evicted client and MDS for the latest maloo report (https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c) on 1350224228 seconds MDS was set fail loc 00000001:02000400:0.0:1350224248.707485:0:22372:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mds REPLAY BARRIER on lustre-MDT0000 00000001:02000400:0.0:1350224248.778103:0:22396:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mds REPLAY BARRIER on lustre-MDT0000 00000001:02000400:0.0:1350224248.852376:0:22419:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: lctl set_param fail_loc=0x8000030c on 1350224255 MDS stopped 00000020:02000400:0.0:1350224255.305890:0:22537:0:(obd_mount.c:1800:server_put_super()) server umount lustre-MDT0000 complete on 1350224265 start MDS device 00000020:01000004:0.0:1350224265.767002:0:22687:0:(obd_mount.c:2554:lustre_fill_super()) Mounting server from /dev/mapper/lvm--MDS-P1 on 1350224266 seconds client start to stat lustre filesystem 00000100:00100000:0:1350224266.514726:0:32437:0:(client.c:2095:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc stat:d72a06d1-9e9c-4600-4007-cde887950e68:32437:x1415810084000511:10.10.4.160@tcp:41 00000100:00080000:0:1350224266.514729:0:32437:0:(client.c:2112:ptlrpc_queue_wait()) @@@ "stat" waiting for recovery: (FULL != DISCONN) req@ffff88007741e000 x1415810084000511/t0 o41->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 192/528 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 on 1350224337, client reconnect server 00000100:00100000:0:1350224337.260692:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000631:10.10.4.160@tcp:38 server handled it 00000100:00100000:0.0:1350224337.260952:0:22719:0:(service.c:1788:ptlrpc_server_handle_req_in()) got req x1415810084000631 00000100:00100000:0.0:1350224337.260963:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:0+-99:26771:x1415810084000631:12345-10.10.4.163@tcp:38 00010000:00080000:0.0:1350224337.260982:0:22719:0:(ldlm_lib.c:978:target_handle_connect()) lustre-MDT0000: connection from d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp t223338299393 exp (null) cur 1350224337 last 0 00000020:01000000:0.0:1350224337.261024:0:22719:0:(lprocfs_status.c:2007:lprocfs_exp_setup()) using hash ffff88007a6790c0 00000100:00100000:0.0:1350224337.261055:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+5:26771:x1415810084000631:12345-10.10.4.163@tcp:38 Request procesed in 96us (118us total) trans 0 rc 0/0 00000001:00100000:0.0:1350224337.294699:0:22693:0:(target.c:598:lut_cb_new_client()) lustre-MDT0000: committing for initial connect of d72a06d1-9e9c-4600-4007-cde887950e68 while client think server evicted it 00000100:00080000:0:1350224337.261722:0:26771:0:(import.c:908:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4) req@ffff88007cd49400 x1415810084000631/t0 o38->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 368/584 e 0 to 1 dl 1350224347 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:0:1350224337.261726:0:26771:0:(import.c:911:ptlrpc_connect_interpret()) ffff880078705000 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED the reply msg flag is only 4 == MSG_CONNECT_REPLAYABLE, I guess there's connect flags setting misunderstanding which caused client evicted. |
| Comment by Oleg Drokin [ 16/Oct/12 ] |
|
Thanks for the timeline, Bobi! The reason client thinks the server evicted it was because that's what happened. 00010000:02000400:0.0:1350224331.521211:0:22717:0:(ldlm_lib.c:1773:target_recovery_overseer()) lustre-MDT0000: recovery is timed out, evict stale exports In fact the vm2 client reconnected much earlier and went through recovery as far as it could (til REPLAY_LOCKS phase) until it hit the wall with server waiting for the other client. 00000100:00080000:0:1350224277.260518:0:26771:0:(import.c:1287:ptlrpc_import_recovery_state_machine()) ffff880078705000 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS Then it proceeds to replay locks: 00000100:00100000:0:1350224277.260584:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000532:10.10.4.160@tcp:101 00000100:00100000:0:1350224277.260621:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101 00010000:00010000:0:1350224277.265317:0:26771:0:(ldlm_request.c:1924:replay_lock_interpret()) ### replayed lock: ns: lustre-MDT0000-mdc-ffff88007cd48400 lock: ffff88007b623a00/0xe8bdbcd6a2bd8f2f lrc: 2/0,0 mode: PR/PR res: 8589945617/84 bits 0x3 rrc: 1 type: IBT flags: 0x0 remote: 0x7d703d0db3f157a expref: -99 pid: 32215 timeout: 0 00000100:00100000:0:1350224277.265325:0:26771:0:(client.c:1452:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101 Note how x1415810084000533 completes right away, but x1415810084000532 is not. Switching to the server side, we did get x1415810084000532 request there and send the reply: 00000100:00100000:0.0:1350224277.260799:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+7:26771:x1415810084000532:12345-10.10.4.163@tcp:101 00010000:00080000:0.0:1350224277.260806:0:22719:0:(ldlm_lib.c:2245:target_queue_recovery_request()) @@@ not queueing req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/0 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/ffffffff rc 0/-1 ... 00010000:02000000:0.0:1350224277.260846:0:22719:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=30c, val=2147483648*** 00010000:00020000:0.0:1350224277.261517:0:22719:0:(ldlm_lib.c:2380:target_send_reply_msg()) @@@ dropping reply req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/384 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/0 rc 0/0 00000100:00100000:0.0:1350224277.264622:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+8:26771:x1415810084000532:12345-10.10.4.163@tcp:101 Request procesed in 3825us (3860us total) trans 0 rc 0/0 So req x1415810084000532 is "lost" - the test is workign as expected timing out lock recovery now, some 60 seconds later on a client 00000100:00000400:0:1350224337.260658:0:26771:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1415810084000532 sent from lustre-MDT0000-mdc-ffff88007cd48400 to NID 10.10.4.160@tcp 60s ago has timed out (60s prior to deadline). By which time server has timed out the recovery already. So the problem at hand is that server did not extend the deadline. Now, reading through the code it appears that we only extend the timer on requests placed in various recovery queues, but to get there lock replay must be sent with MSG_REQ_REPLAY_DONE flag set, and only 2.x clients set it. Now, in 1.8 server case there are no separate lock queues at all and I suspect the initial timeout is just much higher (don't have a log nearby, will try to look for something in maloo shortly) and as such allows enough margin for reconnect to still succeed. Hm, actually I just looked into the maloo report and it seems there reconnect arrives ~50 seconds after lost message instead of 60 seconds in case of 2.1 interop. Sadly we don't collect any debug logs on success so I cannot compare the results. I'll try to repeat 1.8 test locally and see what's inside. The entire test is somewhat of a fringe case where we have a double failure of sorts - fist server dies and then there is a loss of one of the lock replies, so it does not warrant in my opinion. But I am curious to see why 1.8 to 1.8 case replays sooner than during interop, so I'll make a final decision after that. |
| Comment by Jian Yu [ 17/Oct/12 ] |
|
Lustre Version: 1.8.8-wc1 replay-single test 52 passed with debug logs gathered: |
| Comment by Oleg Drokin [ 17/Oct/12 ] |
|
Hm, only 7 seconds, perhaps it needs a bit more ramp up from previous tests, otr might be it just wildly fluctuates, a few more run this time with more preceedign tests would be useful to gauge that I think. |
| Comment by Zhenyu Xu [ 17/Oct/12 ] |
|
Yujian, Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them. |
| Comment by Jian Yu [ 17/Oct/12 ] |
Sure. Let me finish the following testing required by Oleg first: 1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value Oleg wanted to see whether the times to recover fluctuate and how big fluctuations are. The testing is still ongoing. |
| Comment by Jian Yu [ 17/Oct/12 ] |
Test logs are in /home/yujian/test_logs/2012-10-17/020854 on brent node.
The first run was finished, test 52 failed with the same issue. The second run is ongoing. Test logs are in /home/yujian/test_logs/2012-10-17/061138 on brent node. |
| Comment by Oleg Drokin [ 17/Oct/12 ] |
|
I see that 1.8 no-interop times really jump around to, but mostly on the lower end of scale. With interop the times also jump around but on a bit higher scale. Also this seems to be a long standing thing since there is this much of a difference between 1.8 and 2.3 in markign lock replay requests, 2.x non interop mode should be much more consistent in having this test passing for those reasons. Additionally there appears to be a test problem of some sort triggering at times where I do not se test 52 dropping a lock replay request on mdt at all. As suh I think this does not warrant a blocker priority for 2.3 release (or possibly any other). |
| Comment by Jian Yu [ 19/Aug/13 ] |
|
Lustre client build: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1) replay-single test_52 and test_62 also hit the same failure: Started lustre-MDT0000 client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_52: @@@@@@ FAIL: post-failover df: 1 Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_52.*.1376740815.log Started lustre-MDT0000 client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_62: @@@@@@ FAIL: post-failover df: 1 Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_62.*.1376741797.log Maloo report: https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c |
| Comment by Jian Yu [ 05/Sep/13 ] |
|
Lustre client: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1) replay-single test 62 hit the same failure: |
| Comment by Zhenyu Xu [ 05/Sep/13 ] |
|
it's more like |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Close old ticket. |