Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1036

1.8.7<->2.1.54 Test failure on test suite replay-single (52)

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.9
    • 3
    • 5109

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/be8e0662-4784-11e1-9a77-5254004bbbd3.

      client 1.8.7 <--> server 2.1.54

      Attachments

        Issue Links

          Activity

            [LU-1036] 1.8.7<->2.1.54 Test failure on test suite replay-single (52)

            Close old ticket.

            adilger Andreas Dilger added a comment - Close old ticket.
            bobijam Zhenyu Xu added a comment -

            it's more like LU-1473 issue.

            bobijam Zhenyu Xu added a comment - it's more like LU-1473 issue.
            yujian Jian Yu added a comment - Lustre client: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1) Lustre server: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1) replay-single test 62 hit the same failure: https://maloo.whamcloud.com/test_sets/0c773904-15c2-11e3-87cb-52540035b04c
            yujian Jian Yu added a comment -

            Lustre client build: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1)
            Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/32/

            replay-single test_52 and test_62 also hit the same failure:

            Started lustre-MDT0000
            client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
            client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
             replay-single test_52: @@@@@@ FAIL: post-failover df: 1
            Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_52.*.1376740815.log
            
            Started lustre-MDT0000
            client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
            client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call
             replay-single test_62: @@@@@@ FAIL: post-failover df: 1
            Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_62.*.1376741797.log
            

            Maloo report: https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c

            yujian Jian Yu added a comment - Lustre client build: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1) Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/32/ replay-single test_52 and test_62 also hit the same failure: Started lustre-MDT0000 client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_52: @@@@@@ FAIL: post-failover df: 1 Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_52.*.1376740815.log Started lustre-MDT0000 client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call client-23-ib: stat: cannot read file system information for `/mnt/lustre': Interrupted system call replay-single test_62: @@@@@@ FAIL: post-failover df: 1 Dumping lctl log to /logdir/test_logs/2013-08-16/lustre-b2_4-el6-x86_64-vs-lustre-b1_8-el6-x86_64--full--1_1_1__17447__-70153028489720-184414/replay-single.test_62.*.1376741797.log Maloo report: https://maloo.whamcloud.com/test_sets/c4278c32-07a5-11e3-927d-52540035b04c
            green Oleg Drokin added a comment -

            I see that 1.8 no-interop times really jump around to, but mostly on the lower end of scale.

            With interop the times also jump around but on a bit higher scale. Also this seems to be a long standing thing since there is this much of a difference between 1.8 and 2.3 in markign lock replay requests, 2.x non interop mode should be much more consistent in having this test passing for those reasons.

            Additionally there appears to be a test problem of some sort triggering at times where I do not se test 52 dropping a lock replay request on mdt at all.

            As suh I think this does not warrant a blocker priority for 2.3 release (or possibly any other).

            green Oleg Drokin added a comment - I see that 1.8 no-interop times really jump around to, but mostly on the lower end of scale. With interop the times also jump around but on a bit higher scale. Also this seems to be a long standing thing since there is this much of a difference between 1.8 and 2.3 in markign lock replay requests, 2.x non interop mode should be much more consistent in having this test passing for those reasons. Additionally there appears to be a test problem of some sort triggering at times where I do not se test 52 dropping a lock replay request on mdt at all. As suh I think this does not warrant a blocker priority for 2.3 release (or possibly any other).
            yujian Jian Yu added a comment - - edited

            1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value

            Test logs are in /home/yujian/test_logs/2012-10-17/020854 on brent node.

            2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale

            The first run was finished, test 52 failed with the same issue. The second run is ongoing. Test logs are in /home/yujian/test_logs/2012-10-17/061138 on brent node.

            yujian Jian Yu added a comment - - edited 1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value Test logs are in /home/yujian/test_logs/2012-10-17/020854 on brent node. 2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale The first run was finished, test 52 failed with the same issue. The second run is ongoing. Test logs are in /home/yujian/test_logs/2012-10-17/061138 on brent node.
            yujian Jian Yu added a comment -

            Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.

            Sure. Let me finish the following testing required by Oleg first:

            1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value
            2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale

            Oleg wanted to see whether the times to recover fluctuate and how big fluctuations are.

            The testing is still ongoing.

            yujian Jian Yu added a comment - Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them. Sure. Let me finish the following testing required by Oleg first: 1) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients and servers with default debug value 2) run a full replay-single.sh multiple times on Lustre 1.8.8-wc1 clients with 2.3.0 servers with default debug vale Oleg wanted to see whether the times to recover fluctuate and how big fluctuations are. The testing is still ongoing.
            bobijam Zhenyu Xu added a comment -

            Yujian,

            Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.

            bobijam Zhenyu Xu added a comment - Yujian, Is it possible to run only this test with 1.8client/2.1 server interop and another time with 1.8 server/client and collect -1 logs? So as to compare them.
            green Oleg Drokin added a comment -

            Hm, only 7 seconds, perhaps it needs a bit more ramp up from previous tests, otr might be it just wildly fluctuates, a few more run this time with more preceedign tests would be useful to gauge that I think.
            Thank you!

            green Oleg Drokin added a comment - Hm, only 7 seconds, perhaps it needs a bit more ramp up from previous tests, otr might be it just wildly fluctuates, a few more run this time with more preceedign tests would be useful to gauge that I think. Thank you!
            yujian Jian Yu added a comment -

            Lustre Version: 1.8.8-wc1
            Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/198
            Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client)
            ENABLE_QUOTA=yes

            replay-single test 52 passed with debug logs gathered:
            https://maloo.whamcloud.com/test_sets/555a95bc-181b-11e2-a6a7-52540035b04c

            yujian Jian Yu added a comment - Lustre Version: 1.8.8-wc1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/198 Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client) ENABLE_QUOTA=yes replay-single test 52 passed with debug logs gathered: https://maloo.whamcloud.com/test_sets/555a95bc-181b-11e2-a6a7-52540035b04c

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: