Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1036

1.8.7<->2.1.54 Test failure on test suite replay-single (52)

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.9
    • 3
    • 5109

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/be8e0662-4784-11e1-9a77-5254004bbbd3.

      client 1.8.7 <--> server 2.1.54

      Attachments

        Issue Links

          Activity

            [LU-1036] 1.8.7<->2.1.54 Test failure on test suite replay-single (52)
            bobijam Zhenyu Xu added a comment -

            it's more like LU-1473 issue.

            bobijam Zhenyu Xu added a comment - it's more like LU-1473 issue.
            yujian Jian Yu added a comment - Lustre client: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1) Lustre server: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1) replay-single test 62 hit the same failure: https://maloo.whamcloud.com/test_sets/0c773904-15c2-11e3-87cb-52540035b04c
            yujian Jian Yu added a comment -

            Lustre client build: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1)
            Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/32/

            replay-single test_52 and test_62 also hit the same failure:

            Started lustre-MDT0000
            client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
            client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
             replay-single test_52: @@@@@@ FAIL: post-failover df: 1
            Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_52.*.1376740815.log
            
            Started lustre-MDT0000
            client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
            client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
             replay-single test_62: @@@@@@ FAIL: post-failover df: 1
            Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_62.*.1376741797.log
            

            Maloo report: https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c

            yujian Jian Yu added a comment - Lustre client build: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1) Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/32/ replay-single test_52 and test_62 also hit the same failure: Started lustre-MDT0000 client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_52: @@@@@@ FAIL: post-failover df: 1 Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_52.*.1376740815.log Started lustre-MDT0000 client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_62: @@@@@@ FAIL: post-failover df: 1 Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_62.*.1376741797.log Maloo report: https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c
            green Oleg Drokin added a comment -

            I see that 1.8 no-interop times really jump around to, but mostly on the lower end of scale.

            With interop the times also jump around but on a bit higher scale. Also this seems to be a long standing thing since there is this much of a difference between 1.8 and 2.3 in markign lock replay requests, 2.x non interop mode should be much more consistent in having this test passing for those reasons.

            Additionally there appears to be a test problem of some sort triggering at times where I do not se test 52 dropping a lock replay request on mdt at all.

            As suh I think this does not warrant a blocker priority for 2.3 release (or possibly any other).

            green Oleg Drokin added a comment - I see that 1.8 no-interop times really jump around to, but mostly on the lower end of scale. With interop the times also jump around but on a bit higher scale. Also this seems to be a long standing thing since there is this much of a difference between 1.8 and 2.3 in markign lock replay requests, 2.x non interop mode should be much more consistent in having this test passing for those reasons. Additionally there appears to be a test problem of some sort triggering at times where I do not se test 52 dropping a lock replay request on mdt at all. As suh I think this does not warrant a blocker priority for 2.3 release (or possibly any other).
            yujian Jian Yu added a comment - - edited

            1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value

            Test logs are in /home/yujian/test_logs/2012-10-17/020854 on brent node.

            2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale

            The first run was finished, test 52 failed with the same issue. The second run is ongoing. Test logs are in /home/yujian/test_logs/2012-10-17/061138 on brent node.

            yujian Jian Yu added a comment - - edited 1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value Test logs are in /home/yujian/test_logs/2012-10-17/020854 on brent node. 2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale The first run was finished, test 52 failed with the same issue. The second run is ongoing. Test logs are in /home/yujian/test_logs/2012-10-17/061138 on brent node.
            yujian Jian Yu added a comment -

            Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.

            Sure. Let me finish the following testing required by Oleg first:

            1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value
            2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale

            Oleg wanted to see whether the times to recover fluctuate and how big fluctuations are.

            The testing is still ongoing.

            yujian Jian Yu added a comment - Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them. Sure. Let me finish the following testing required by Oleg first: 1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value 2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale Oleg wanted to see whether the times to recover fluctuate and how big fluctuations are. The testing is still ongoing.
            bobijam Zhenyu Xu added a comment -

            Yujian,

            Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.

            bobijam Zhenyu Xu added a comment - Yujian, Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.
            green Oleg Drokin added a comment -

            Hm, only 7 seconds, perhaps it needs a bit more ramp up from previous tests, otr might be it just wildly fluctuates, a few more run this time with more preceedign tests would be useful to gauge that I think.
            Thank you!

            green Oleg Drokin added a comment - Hm, only 7 seconds, perhaps it needs a bit more ramp up from previous tests, otr might be it just wildly fluctuates, a few more run this time with more preceedign tests would be useful to gauge that I think. Thank you!
            yujian Jian Yu added a comment -

            Lustre Version: 1.8.8-wc1
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/198
            Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client)
            ENABLE_QUOTA=yes

            replay-single test 52 passed with debug logs gathered:
            https://maloo.whamcloud.com/test_sets/555a95bc-181b-11e2-a6a7-52540035b04c

            yujian Jian Yu added a comment - Lustre Version: 1.8.8-wc1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/198 Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client) ENABLE_QUOTA=yes replay-single test 52 passed with debug logs gathered: https://maloo.whamcloud.com/test_sets/555a95bc-181b-11e2-a6a7-52540035b04c
            green Oleg Drokin added a comment -

            Thanks for the timeline, Bobi!

            The reason client thinks the server evicted it was because that's what happened.
            Recovery window closed on the server at 1350224331:

            00010000:02000400:0.0:1350224331.521211:0:22717:0:(ldlm_lib.c:1773:target_recovery_overseer()) lustre-MDT0000: recovery is timed out, evict stale exports
            

            In fact the vm2 client reconnected much earlier and went through recovery as far as it could (til REPLAY_LOCKS phase) until it hit the wall with server waiting for the other client.

            00000100:00080000:0:1350224277.260518:0:26771:0:(import.c:1287:ptlrpc_import_recovery_state_machine()) ffff880078705000 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS
            

            Then it proceeds to replay locks:

            00000100:00100000:0:1350224277.260584:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000532:10.10.4.160@tcp:101
            00000100:00100000:0:1350224277.260621:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101
            00010000:00010000:0:1350224277.265317:0:26771:0:(ldlm_request.c:1924:replay_lock_interpret()) ### replayed lock: ns: lustre-MDT0000-mdc-ffff88007cd48400 lock: ffff88007b623a00/0xe8bdbcd6a2bd8f2f lrc: 2/0,0 mode: PR/PR res: 8589945617/84 bits 0x3 rrc: 1 type: IBT flags: 0x0 remote: 0x7d703d0db3f157a expref: -99 pid: 32215 timeout: 0
            00000100:00100000:0:1350224277.265325:0:26771:0:(client.c:1452:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101
            

            Note how x1415810084000533 completes right away, but x1415810084000532 is not.

            Switching to the server side, we did get x1415810084000532 request there and send the reply:

            00000100:00100000:0.0:1350224277.260799:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+7:26771:x1415810084000532:12345-10.10.4.163@tcp:101
            00010000:00080000:0.0:1350224277.260806:0:22719:0:(ldlm_lib.c:2245:target_queue_recovery_request()) @@@ not queueing  req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/0 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/ffffffff rc 0/-1
            ...
            00010000:02000000:0.0:1350224277.260846:0:22719:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=30c, val=2147483648***
            00010000:00020000:0.0:1350224277.261517:0:22719:0:(ldlm_lib.c:2380:target_send_reply_msg()) @@@ dropping reply  req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/384 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/0 rc 0/0
            00000100:00100000:0.0:1350224277.264622:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+8:26771:x1415810084000532:12345-10.10.4.163@tcp:101 Request procesed in 3825us (3860us total) trans 0 rc 0/0
            

            So req x1415810084000532 is "lost" - the test is workign as expected timing out lock recovery

            now, some 60 seconds later on a client

            00000100:00000400:0:1350224337.260658:0:26771:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1415810084000532 sent from lustre-MDT0000-mdc-ffff88007cd48400 to NID 10.10.4.160@tcp 60s ago has timed out (60s prior to deadline).
            

            By which time server has timed out the recovery already.

            So the problem at hand is that server did not extend the deadline. Now, reading through the code it appears that we only extend the timer on requests placed in various recovery queues, but to get there lock replay must be sent with MSG_REQ_REPLAY_DONE flag set, and only 2.x clients set it.
            1.8 clients don't set the flag and so the timer is not extended.

            Now, in 1.8 server case there are no separate lock queues at all and I suspect the initial timeout is just much higher (don't have a log nearby, will try to look for something in maloo shortly) and as such allows enough margin for reconnect to still succeed.

            Hm, actually I just looked into the maloo report and it seems there reconnect arrives ~50 seconds after lost message instead of 60 seconds in case of 2.1 interop. Sadly we don't collect any debug logs on success so I cannot compare the results. I'll try to repeat 1.8 test locally and see what's inside.

            The entire test is somewhat of a fringe case where we have a double failure of sorts - fist server dies and then there is a loss of one of the lock replies, so it does not warrant in my opinion. But I am curious to see why 1.8 to 1.8 case replays sooner than during interop, so I'll make a final decision after that.

            green Oleg Drokin added a comment - Thanks for the timeline, Bobi! The reason client thinks the server evicted it was because that's what happened. Recovery window closed on the server at 1350224331: 00010000:02000400:0.0:1350224331.521211:0:22717:0:(ldlm_lib.c:1773:target_recovery_overseer()) lustre-MDT0000: recovery is timed out, evict stale exports In fact the vm2 client reconnected much earlier and went through recovery as far as it could (til REPLAY_LOCKS phase) until it hit the wall with server waiting for the other client. 00000100:00080000:0:1350224277.260518:0:26771:0:(import.c:1287:ptlrpc_import_recovery_state_machine()) ffff880078705000 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS Then it proceeds to replay locks: 00000100:00100000:0:1350224277.260584:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000532:10.10.4.160@tcp:101 00000100:00100000:0:1350224277.260621:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101 00010000:00010000:0:1350224277.265317:0:26771:0:(ldlm_request.c:1924:replay_lock_interpret()) ### replayed lock: ns: lustre-MDT0000-mdc-ffff88007cd48400 lock: ffff88007b623a00/0xe8bdbcd6a2bd8f2f lrc: 2/0,0 mode: PR/PR res: 8589945617/84 bits 0x3 rrc: 1 type: IBT flags: 0x0 remote: 0x7d703d0db3f157a expref: -99 pid: 32215 timeout: 0 00000100:00100000:0:1350224277.265325:0:26771:0:(client.c:1452:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000533:10.10.4.160@tcp:101 Note how x1415810084000533 completes right away, but x1415810084000532 is not. Switching to the server side, we did get x1415810084000532 request there and send the reply: 00000100:00100000:0.0:1350224277.260799:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+7:26771:x1415810084000532:12345-10.10.4.163@tcp:101 00010000:00080000:0.0:1350224277.260806:0:22719:0:(ldlm_lib.c:2245:target_queue_recovery_request()) @@@ not queueing req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/0 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/ffffffff rc 0/-1 ... 00010000:02000000:0.0:1350224277.260846:0:22719:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=30c, val=2147483648*** 00010000:00020000:0.0:1350224277.261517:0:22719:0:(ldlm_lib.c:2380:target_send_reply_msg()) @@@ dropping reply req@ffff8800689cf850 x1415810084000532/t0(0) o101->d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp:0/0 lens 296/384 e 0 to 0 dl 1350224332 ref 1 fl Interpret:/0/0 rc 0/0 00000100:00100000:0.0:1350224277.264622:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+8:26771:x1415810084000532:12345-10.10.4.163@tcp:101 Request procesed in 3825us (3860us total) trans 0 rc 0/0 So req x1415810084000532 is "lost" - the test is workign as expected timing out lock recovery now, some 60 seconds later on a client 00000100:00000400:0:1350224337.260658:0:26771:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1415810084000532 sent from lustre-MDT0000-mdc-ffff88007cd48400 to NID 10.10.4.160@tcp 60s ago has timed out (60s prior to deadline). By which time server has timed out the recovery already. So the problem at hand is that server did not extend the deadline. Now, reading through the code it appears that we only extend the timer on requests placed in various recovery queues, but to get there lock replay must be sent with MSG_REQ_REPLAY_DONE flag set, and only 2.x clients set it. 1.8 clients don't set the flag and so the timer is not extended. Now, in 1.8 server case there are no separate lock queues at all and I suspect the initial timeout is just much higher (don't have a log nearby, will try to look for something in maloo shortly) and as such allows enough margin for reconnect to still succeed. Hm, actually I just looked into the maloo report and it seems there reconnect arrives ~50 seconds after lost message instead of 60 seconds in case of 2.1 interop. Sadly we don't collect any debug logs on success so I cannot compare the results. I'll try to repeat 1.8 test locally and see what's inside. The entire test is somewhat of a fringe case where we have a double failure of sorts - fist server dies and then there is a loss of one of the lock replies, so it does not warrant in my opinion. But I am curious to see why 1.8 to 1.8 case replays sooner than during interop, so I'll make a final decision after that.
            bobijam Zhenyu Xu added a comment -

            Here is what I interpret debug logs between evicted client and MDS for the latest maloo report (https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c)

            on 1350224228 seconds MDS was set fail loc

            00000001:02000400:0.0:1350224248.707485:0:22372:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mds REPLAY BARRIER on lustre-MDT0000
            00000001:02000400:0.0:1350224248.778103:0:22396:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mds REPLAY BARRIER on lustre-MDT0000
            00000001:02000400:0.0:1350224248.852376:0:22419:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: lctl set_param fail_loc=0x8000030c
            

            on 1350224255 MDS stopped

            00000020:02000400:0.0:1350224255.305890:0:22537:0:(obd_mount.c:1800:server_put_super()) server umount lustre-MDT0000 complete
            

            on 1350224265 start MDS device

            00000020:01000004:0.0:1350224265.767002:0:22687:0:(obd_mount.c:2554:lustre_fill_super()) Mounting server from /dev/mapper/lvm--MDS-P1
            

            on 1350224266 seconds client start to stat lustre filesystem

            00000100:00100000:0:1350224266.514726:0:32437:0:(client.c:2095:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc stat:d72a06d1-9e9c-4600-4007-cde887950e68:32437:x1415810084000511:10.10.4.160@tcp:41
            00000100:00080000:0:1350224266.514729:0:32437:0:(client.c:2112:ptlrpc_queue_wait()) @@@ "stat" waiting for recovery: (FULL != DISCONN)  req@ffff88007741e000 x1415810084000511/t0 o41->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 192/528 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
            

            on 1350224337, client reconnect server

            00000100:00100000:0:1350224337.260692:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000631:10.10.4.160@tcp:38
            

            server handled it

            00000100:00100000:0.0:1350224337.260952:0:22719:0:(service.c:1788:ptlrpc_server_handle_req_in()) got req x1415810084000631
            00000100:00100000:0.0:1350224337.260963:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:0+-99:26771:x1415810084000631:12345-10.10.4.163@tcp:38
            00010000:00080000:0.0:1350224337.260982:0:22719:0:(ldlm_lib.c:978:target_handle_connect()) lustre-MDT0000: connection from d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp t223338299393 exp (null) cur 1350224337 last 0
            00000020:01000000:0.0:1350224337.261024:0:22719:0:(lprocfs_status.c:2007:lprocfs_exp_setup()) using hash ffff88007a6790c0
            00000100:00100000:0.0:1350224337.261055:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+5:26771:x1415810084000631:12345-10.10.4.163@tcp:38 Request procesed in 96us (118us total) trans 0 rc 0/0
            00000001:00100000:0.0:1350224337.294699:0:22693:0:(target.c:598:lut_cb_new_client()) lustre-MDT0000: committing for initial connect of d72a06d1-9e9c-4600-4007-cde887950e68
            

            while client think server evicted it

            00000100:00080000:0:1350224337.261722:0:26771:0:(import.c:908:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4)  req@ffff88007cd49400 x1415810084000631/t0 o38->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 368/584 e 0 to 1 dl 1350224347 ref 1 fl Interpret:RN/0/0 rc 0/0
            00000100:00080000:0:1350224337.261726:0:26771:0:(import.c:911:ptlrpc_connect_interpret()) ffff880078705000 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED
            

            the reply msg flag is only 4 == MSG_CONNECT_REPLAYABLE, I guess there's connect flags setting misunderstanding which caused client evicted.

            bobijam Zhenyu Xu added a comment - Here is what I interpret debug logs between evicted client and MDS for the latest maloo report ( https://maloo.whamcloud.com/test_sets/a728b006-169d-11e2-962d-52540035b04c ) on 1350224228 seconds MDS was set fail loc 00000001:02000400:0.0:1350224248.707485:0:22372:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mds REPLAY BARRIER on lustre-MDT0000 00000001:02000400:0.0:1350224248.778103:0:22396:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mds REPLAY BARRIER on lustre-MDT0000 00000001:02000400:0.0:1350224248.852376:0:22419:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: lctl set_param fail_loc=0x8000030c on 1350224255 MDS stopped 00000020:02000400:0.0:1350224255.305890:0:22537:0:(obd_mount.c:1800:server_put_super()) server umount lustre-MDT0000 complete on 1350224265 start MDS device 00000020:01000004:0.0:1350224265.767002:0:22687:0:(obd_mount.c:2554:lustre_fill_super()) Mounting server from /dev/mapper/lvm--MDS-P1 on 1350224266 seconds client start to stat lustre filesystem 00000100:00100000:0:1350224266.514726:0:32437:0:(client.c:2095:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc stat:d72a06d1-9e9c-4600-4007-cde887950e68:32437:x1415810084000511:10.10.4.160@tcp:41 00000100:00080000:0:1350224266.514729:0:32437:0:(client.c:2112:ptlrpc_queue_wait()) @@@ "stat" waiting for recovery: (FULL != DISCONN) req@ffff88007741e000 x1415810084000511/t0 o41->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 192/528 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 on 1350224337, client reconnect server 00000100:00100000:0:1350224337.260692:0:26771:0:(client.c:1174:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:d72a06d1-9e9c-4600-4007-cde887950e68:26771:x1415810084000631:10.10.4.160@tcp:38 server handled it 00000100:00100000:0.0:1350224337.260952:0:22719:0:(service.c:1788:ptlrpc_server_handle_req_in()) got req x1415810084000631 00000100:00100000:0.0:1350224337.260963:0:22719:0:(service.c:1966:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:0+-99:26771:x1415810084000631:12345-10.10.4.163@tcp:38 00010000:00080000:0.0:1350224337.260982:0:22719:0:(ldlm_lib.c:978:target_handle_connect()) lustre-MDT0000: connection from d72a06d1-9e9c-4600-4007-cde887950e68@10.10.4.163@tcp t223338299393 exp (null) cur 1350224337 last 0 00000020:01000000:0.0:1350224337.261024:0:22719:0:(lprocfs_status.c:2007:lprocfs_exp_setup()) using hash ffff88007a6790c0 00000100:00100000:0.0:1350224337.261055:0:22719:0:(service.c:2010:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:d72a06d1-9e9c-4600-4007-cde887950e68+5:26771:x1415810084000631:12345-10.10.4.163@tcp:38 Request procesed in 96us (118us total) trans 0 rc 0/0 00000001:00100000:0.0:1350224337.294699:0:22693:0:(target.c:598:lut_cb_new_client()) lustre-MDT0000: committing for initial connect of d72a06d1-9e9c-4600-4007-cde887950e68 while client think server evicted it 00000100:00080000:0:1350224337.261722:0:26771:0:(import.c:908:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4) req@ffff88007cd49400 x1415810084000631/t0 o38->lustre-MDT0000_UUID@10.10.4.160@tcp:12/10 lens 368/584 e 0 to 1 dl 1350224347 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:0:1350224337.261726:0:26771:0:(import.c:911:ptlrpc_connect_interpret()) ffff880078705000 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED the reply msg flag is only 4 == MSG_CONNECT_REPLAYABLE, I guess there's connect flags setting misunderstanding which caused client evicted.

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: