Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.9
-
3
-
5109
Description
This issue was created by maloo for sarah <sarah@whamcloud.com>
This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/be8e0662-4784-11e1-9a77-5254004bbbd3.
client 1.8.7 <--> server 2.1.54
Attachments
Issue Links
Activity
Lustre client: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1)
Lustre server: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
replay-single test 62 hit the same failure:
https://maloo.whamcloud.com/test_sets/0c773904-15c2-11e3-87cb-52540035b04c
Lustre client build: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1)
Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/32/
replay-single test_52 and test_62 also hit the same failure:
Started lustre-MDT0000 client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_52: @@@@@@ FAIL: post-failover df: 1 Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_52.*.1376740815.log
Started lustre-MDT0000 client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_62: @@@@@@ FAIL: post-failover df: 1 Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_62.*.1376741797.log
Maloo report: https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c
I see that 1.8 no-interop times really jump around to, but mostly on the lower end of scale.
With interop the times also jump around but on a bit higher scale. Also this seems to be a long standing thing since there is this much of a difference between 1.8 and 2.3 in markign lock replay requests, 2.x non interop mode should be much more consistent in having this test passing for those reasons.
Additionally there appears to be a test problem of some sort triggering at times where I do not se test 52 dropping a lock replay request on mdt at all.
As suh I think this does not warrant a blocker priority for 2.3 release (or possibly any other).
1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value
Test logs are in /home/yujian/test_logs/2012-10-17/020854 on brent node.
2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale
The first run was finished, test 52 failed with the same issue. The second run is ongoing. Test logs are in /home/yujian/test_logs/2012-10-17/061138 on brent node.
Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.
Sure. Let me finish the following testing required by Oleg first:
1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value
2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale
Oleg wanted to see whether the times to recover fluctuate and how big fluctuations are.
The testing is still ongoing.
Yujian,
Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.
Hm, only 7 seconds, perhaps it needs a bit more ramp up from previous tests, otr might be it just wildly fluctuates, a few more run this time with more preceedign tests would be useful to gauge that I think.
Thank you!
Lustre Version: 1.8.8-wc1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/198
Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client)
ENABLE_QUOTA=yes
replay-single test 52 passed with debug logs gathered:
https://maloo.whamcloud.com/test_sets/555a95bc-181b-11e2-a6a7-52540035b04c
Thanks for the timeline, Bobi!
The reason client thinks the server evicted it was because that's what happened.
Recovery window closed on the server at 1350224331:
00010000:02000400:0.0:1350224331.521211:0:22717:0:(ldlm_lib.c:1773:target_recovery_overseer()) lustre-MDT0000: recovery is timed out, evict stale exports
In fact the vm2 client reconnected much earlier and went through recovery as far as it could (til REPLAY_LOCKS phase) until it hit the wall with server waiting for the other client.
00000100:00080000:0:1350224277.260518:0:26771:0:(import.c:1287:ptlrpc_import_recovery_state_machine()) ffff880078705000 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS
Then it proceeds to replay locks:
00000100:00100000:0:1350224277.260584:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000532:10.10.4.160@tcp:101 00000100:00100000:0:1350224277.260621:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101 00010000:00010000:0:1350224277.265317:0:26771:0:(ldlm_request.c:1924:replay_lock_interpret()) ### replayed lock: ns: lustre-MDT0000-mdc-ffff88007cd48400 lock: ffff88007b623a00/0xe8bdbcd6a2bd8f2f lrc: 2/0,0 mode: PR/PR res: 8589945617/84 bits 0x3 rrc: 1 type: IBT flags: 0x0 remote: 0x7d703d0db3f157a expref: -99 pid: 32215 timeout: 0 00000100:00100000:0:1350224277.265325:0:26771:0:(client.c:1452:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101
Note how x1415810084000533 completes right away, but x1415810084000532 is not.
Switching to the server side, we did get x1415810084000532 request there and send the reply:
00000100:00100000:0.0:1350224277.260799:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+7:26771:x1415810084000532:12345-10.10.4.163@tcp:101 00010000:00080000:0.0:1350224277.260806:0:22719:0:(ldlm_lib.c:2245:target_queue_recovery_request()) @@@ not queueing req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/0 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/ffffffff rc 0/-1 ... 00010000:02000000:0.0:1350224277.260846:0:22719:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=30c, val=2147483648*** 00010000:00020000:0.0:1350224277.261517:0:22719:0:(ldlm_lib.c:2380:target_send_reply_msg()) @@@ dropping reply req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/384 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/0 rc 0/0 00000100:00100000:0.0:1350224277.264622:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+8:26771:x1415810084000532:12345-10.10.4.163@tcp:101 Request procesed in 3825us (3860us total) trans 0 rc 0/0
So req x1415810084000532 is "lost" - the test is workign as expected timing out lock recovery
now, some 60 seconds later on a client
00000100:00000400:0:1350224337.260658:0:26771:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1415810084000532 sent from lustre-MDT0000-mdc-ffff88007cd48400 to NID 10.10.4.160@tcp 60s ago has timed out (60s prior to deadline).
By which time server has timed out the recovery already.
So the problem at hand is that server did not extend the deadline. Now, reading through the code it appears that we only extend the timer on requests placed in various recovery queues, but to get there lock replay must be sent with MSG_REQ_REPLAY_DONE flag set, and only 2.x clients set it.
1.8 clients don't set the flag and so the timer is not extended.
Now, in 1.8 server case there are no separate lock queues at all and I suspect the initial timeout is just much higher (don't have a log nearby, will try to look for something in maloo shortly) and as such allows enough margin for reconnect to still succeed.
Hm, actually I just looked into the maloo report and it seems there reconnect arrives ~50 seconds after lost message instead of 60 seconds in case of 2.1 interop. Sadly we don't collect any debug logs on success so I cannot compare the results. I'll try to repeat 1.8 test locally and see what's inside.
The entire test is somewhat of a fringe case where we have a double failure of sorts - fist server dies and then there is a loss of one of the lock replies, so it does not warrant in my opinion. But I am curious to see why 1.8 to 1.8 case replays sooner than during interop, so I'll make a final decision after that.
Here is what I interpret debug logs between evicted client and MDS for the latest maloo report (https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c)
on 1350224228 seconds MDS was set fail loc
00000001:02000400:0.0:1350224248.707485:0:22372:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mds REPLAY BARRIER on lustre-MDT0000 00000001:02000400:0.0:1350224248.778103:0:22396:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mds REPLAY BARRIER on lustre-MDT0000 00000001:02000400:0.0:1350224248.852376:0:22419:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: lctl set_param fail_loc=0x8000030c
on 1350224255 MDS stopped
00000020:02000400:0.0:1350224255.305890:0:22537:0:(obd_mount.c:1800:server_put_super()) server umount lustre-MDT0000 complete
on 1350224265 start MDS device
00000020:01000004:0.0:1350224265.767002:0:22687:0:(obd_mount.c:2554:lustre_fill_super()) Mounting server from /dev/mapper/lvm--MDS-P1
on 1350224266 seconds client start to stat lustre filesystem
00000100:00100000:0:1350224266.514726:0:32437:0:(client.c:2095:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc stat:d72a06d1-9e9c-4600-4007-cde887950e68:32437:x1415810084000511:10.10.4.160@tcp:41 00000100:00080000:0:1350224266.514729:0:32437:0:(client.c:2112:ptlrpc_queue_wait()) @@@ "stat" waiting for recovery: (FULL != DISCONN) req@ffff88007741e000 x1415810084000511/t0 o41->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 192/528 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
on 1350224337, client reconnect server
00000100:00100000:0:1350224337.260692:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000631:10.10.4.160@tcp:38
server handled it
00000100:00100000:0.0:1350224337.260952:0:22719:0:(service.c:1788:ptlrpc_server_handle_req_in()) got req x1415810084000631 00000100:00100000:0.0:1350224337.260963:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:0+-99:26771:x1415810084000631:12345-10.10.4.163@tcp:38 00010000:00080000:0.0:1350224337.260982:0:22719:0:(ldlm_lib.c:978:target_handle_connect()) lustre-MDT0000: connection from d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp t223338299393 exp (null) cur 1350224337 last 0 00000020:01000000:0.0:1350224337.261024:0:22719:0:(lprocfs_status.c:2007:lprocfs_exp_setup()) using hash ffff88007a6790c0 00000100:00100000:0.0:1350224337.261055:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+5:26771:x1415810084000631:12345-10.10.4.163@tcp:38 Request procesed in 96us (118us total) trans 0 rc 0/0 00000001:00100000:0.0:1350224337.294699:0:22693:0:(target.c:598:lut_cb_new_client()) lustre-MDT0000: committing for initial connect of d72a06d1-9e9c-4600-4007-cde887950e68
while client think server evicted it
00000100:00080000:0:1350224337.261722:0:26771:0:(import.c:908:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4) req@ffff88007cd49400 x1415810084000631/t0 o38->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 368/584 e 0 to 1 dl 1350224347 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:0:1350224337.261726:0:26771:0:(import.c:911:ptlrpc_connect_interpret()) ffff880078705000 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED
the reply msg flag is only 4 == MSG_CONNECT_REPLAYABLE, I guess there's connect flags setting misunderstanding which caused client evicted.
it's more like
LU-1473issue.