Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1897

Failure on test suite replay-single, test_70b: dbench not found on some of the test nodes

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0, Lustre 2.1.6, Lustre 2.8.0
    • 3
    • 5743

    Description

      This issue was created by maloo for jay <jay@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c4620b6c-fc31-11e1-a4a6-52540035b04c.

      The sub-test test_70b failed with the following error:

      dbench not found on some of client-28vm1,client-28vm2.lab.whamcloud.com !

      Info required for matching: replay-single 70b

      Attachments

        Issue Links

          Activity

            [LU-1897] Failure on test suite replay-single, test_70b: dbench not found on some of the test nodes
            sarah Sarah Liu added a comment -

            close this ticket as the recent instances on master (tag-2.7.63 and tag-2.7.64) are caused by the "no space left on device"

            shadow-52vm6: [769] open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for handle 10033 (No space left on device)
            shadow-52vm6: (770) ERROR: handle 10033 was not found
            shadow-52vm6: Child failed with status 1
            shadow-52vm6: touch: missing file operand
            shadow-52vm6: Try 'touch --help' for more information.
            shadow-52vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 21 sec
            shadow-52vm1:    1       733     0.17 MB/sec  warmup 124 sec  latency 53560.001 ms
            shadow-52vm5:    1       791     0.17 MB/sec  warmup 124 sec  latency 54097.359 ms
            shadow-52vm1:    1       733     0.17 MB/sec  warmup 125 sec  latency 1340.522 ms
            shadow-52vm5:    1       794     0.17 MB/sec  warmup 125 sec  latency 54380.844 ms
            CMD: shadow-52vm1.shadow.whamcloud.com,shadow-52vm5,shadow-52vm6 killall -0 dbench
            shadow-52vm6: dbench: no process found
             replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of shadow-52vm1.shadow.whamcloud.com,shadow-52vm5,shadow-52vm6! 
            shadow-52vm1:    1       733     0.17 MB/sec  warmup 126 sec  latency 2340.713 ms
            shadow-52vm5:    1       794     0.17 MB/sec  warmup 126 sec  latency 1715.056 ms
            
            sarah Sarah Liu added a comment - close this ticket as the recent instances on master (tag-2.7.63 and tag-2.7.64) are caused by the "no space left on device" shadow-52vm6: [769] open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for handle 10033 (No space left on device) shadow-52vm6: (770) ERROR: handle 10033 was not found shadow-52vm6: Child failed with status 1 shadow-52vm6: touch: missing file operand shadow-52vm6: Try 'touch --help' for more information. shadow-52vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 21 sec shadow-52vm1: 1 733 0.17 MB/sec warmup 124 sec latency 53560.001 ms shadow-52vm5: 1 791 0.17 MB/sec warmup 124 sec latency 54097.359 ms shadow-52vm1: 1 733 0.17 MB/sec warmup 125 sec latency 1340.522 ms shadow-52vm5: 1 794 0.17 MB/sec warmup 125 sec latency 54380.844 ms CMD: shadow-52vm1.shadow.whamcloud.com,shadow-52vm5,shadow-52vm6 killall -0 dbench shadow-52vm6: dbench: no process found replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of shadow-52vm1.shadow.whamcloud.com,shadow-52vm5,shadow-52vm6! shadow-52vm1: 1 733 0.17 MB/sec warmup 126 sec latency 2340.713 ms shadow-52vm5: 1 794 0.17 MB/sec warmup 126 sec latency 1715.056 ms

            master, build# 3266, 2.7.64 tag
            Hard Failover:EL7 Server/SLES11 SP3 Client
            https://testing.hpdd.intel.com/test_sets/a8b3fb9e-a077-11e5-8d69-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - master, build# 3266, 2.7.64 tag Hard Failover:EL7 Server/SLES11 SP3 Client https://testing.hpdd.intel.com/test_sets/a8b3fb9e-a077-11e5-8d69-5254006e85c2

            master, build# 3264, 2.7.64 tag
            Hard Failover: EL7 Server/Client - ZFS
            https://testing.hpdd.intel.com/test_sets/5ba6d7bc-9e20-11e5-91b0-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - master, build# 3264, 2.7.64 tag Hard Failover: EL7 Server/Client - ZFS https://testing.hpdd.intel.com/test_sets/5ba6d7bc-9e20-11e5-91b0-5254006e85c2

            Instance found in recent tags 2.7.63, 2.7.64

            standan Saurabh Tandan (Inactive) added a comment - Instance found in recent tags 2.7.63, 2.7.64

            master, build# 3264, 2.7.64 tag
            Hard Failover: EL6.7 Server/Client
            https://testing.hpdd.intel.com/test_sets/80a20678-9edd-11e5-87a9-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - master, build# 3264, 2.7.64 tag Hard Failover: EL6.7 Server/Client https://testing.hpdd.intel.com/test_sets/80a20678-9edd-11e5-87a9-5254006e85c2
            yujian Jian Yu added a comment -
            yujian Jian Yu added a comment - The issue also exists on Lustre b2_1 branch: https://maloo.whamcloud.com/test_sets/a0806aa0-c5ce-11e3-9255-52540035b04c
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4

            http://review.whamcloud.com/5761 version 1 of the patch has been submitted.

            keith Keith Mannthey (Inactive) added a comment - http://review.whamcloud.com/5761 version 1 of the patch has been submitted.

            It looks like dbench didn't start on client-26vm5 fast enough for the first check?
            https://maloo.whamcloud.com/test_logs/6cc6de7a-8eb0-11e2-81eb-52540035b04c

            12 seconds isn't long enough to start dbench on a remote client... Fun. I will submit a patch.

            This is the console from the main client.

            15:52:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started rundbench load pid=1931 ...
            15:52:52:Lustre: DEBUG MARKER: Started rundbench load pid=1931 ...
            15:53:03:Lustre: DEBUG MARKER: killall -0 dbench
            15:53:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_70b: @@@@@@ FAIL: dbench not running on some of client-26vm5,client-26vm6.lab.whamcloud.com! 
            

            In the autotest long you see dbench start some time after this on the other client. It does run just not at the right time.

            keith Keith Mannthey (Inactive) added a comment - It looks like dbench didn't start on client-26vm5 fast enough for the first check? https://maloo.whamcloud.com/test_logs/6cc6de7a-8eb0-11e2-81eb-52540035b04c 12 seconds isn't long enough to start dbench on a remote client... Fun. I will submit a patch. This is the console from the main client. 15:52:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started rundbench load pid=1931 ... 15:52:52:Lustre: DEBUG MARKER: Started rundbench load pid=1931 ... 15:53:03:Lustre: DEBUG MARKER: killall -0 dbench 15:53:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_70b: @@@@@@ FAIL: dbench not running on some of client-26vm5,client-26vm6.lab.whamcloud.com! In the autotest long you see dbench start some time after this on the other client. It does run just not at the right time.
            utopiabound Nathaniel Clark added a comment - I think this failed here: https://maloo.whamcloud.com/test_sets/43757b76-8eb0-11e2-81eb-52540035b04c

            http://review.whamcloud.com/4973 has been merged. Please reopen if the problem sill occurs.

            keith Keith Mannthey (Inactive) added a comment - http://review.whamcloud.com/4973 has been merged. Please reopen if the problem sill occurs.

            People

              keith Keith Mannthey (Inactive)
              maloo Maloo
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: