Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10616

replay-single test_70b fails with 'rundbench load on <hostname(s)> failed!'

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.15.1
    • 3
    • 9223372036854775807

    Description

      replay-single test_70b fails with two error messages

      replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2!
      

      and later

      replay-single test_70b: @@@@@@ FAIL: rundbench load on onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 failed! 
      

      Looking at the suite_log, we see

      CMD: onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 killall -0 dbench
      onyx-31vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory)
      onyx-31vm1: (4) ERROR: handle 16385 was not found
      onyx-31vm1: Child failed with status 1
      onyx-31vm1: dbench: no process found
      onyx-31vm1: dbench: no process found
       replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! 
      

      The only thing that looks suspicious in the console logs is on the MDS1, 3

      [ 5354.241985] Lustre: DEBUG MARKER: Started rundbench load pid=3403 ...
      [ 5354.488828] LustreError: 12371:0:(osd_oi.c:978:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2
      [ 5354.753146] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! 
      

      This test has failed in this way many times, so far, for only full test sessions with DNE configured and ZFS:
      2.10.57 el7 build 3703 – https://testing.hpdd.intel.com/test_sets/46a0b60a-078f-11e8-bd00-52540065bddc
      2.10.57 el7 build 3702 – https://testing.hpdd.intel.com/test_sets/13cdeb9e-0352-11e8-a10a-52540065bddc
      2.10.57 el7 build 3700 - https://testing.hpdd.intel.com/test_sets/fa0a850e-014f-11e8-a6ad-52540065bddc
      2.10.57 el7 build 3697 - https://testing.hpdd.intel.com/test_sets/ebd4b25e-fd83-11e7-a7cd-52540065bddc
      2.10.57 el7 patchless build 59 – https://testing.hpdd.intel.com/test_sets/dee6191a-ffaf-11e7-a6ad-52540065bddc
      2.10.57 el7 patchless build 58 – https://testing.hpdd.intel.com/test_sets/16fa9310-fe7c-11e7-a6ad-52540065bddc
      2.10.56 el7 build 3693 – https://testing.hpdd.intel.com/test_sets/d309f58a-f77b-11e7-bd00-52540065bddc
      2.10.56 el7 patchless build 53 – https://testing.hpdd.intel.com/test_sets/38f48bae-f636-11e7-94c7-52540065bddc
      2.10.56 el7 patchless build 50 – https://testing.hpdd.intel.com/test_sets/c46aeb7c-f228-11e7-8c43-52540065bddc
      2.10.56 el7 build 3685 – https://testing.hpdd.intel.com/test_sets/6c00afc0-e7c0-11e7-8027-52540065bddc
      2.10.56 el7 patchless build 44 – https://testing.hpdd.intel.com/test_sets/53f8d684-e674-11e7-a066-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-10616] replay-single test_70b fails with 'rundbench load on <hostname(s)> failed!'
            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - +1 on master https://testing.whamcloud.com/test_sets/ee775d6f-00db-41b2-ad02-d4ae7e31ce6c
            adilger Andreas Dilger added a comment - +1 on master https://testing.whamcloud.com/test_sets/1c452361-2846-41bd-af35-995e1de3fd99
            adilger Andreas Dilger added a comment - +1 on master https://testing.whamcloud.com/test_sets/b5f87cba-d087-45dc-85ef-e1005ef15186

            Hello,

            I have the same kind of messages on: https://testing.whamcloud.com/sub_tests/a7c49599-b0e2-49e1-a4af-9111f676fdcf

            Except the message:

            trevis-9vm4: [341] open ./clients/client0/~dmtmp/PARADOX/COURSES.DB failed for handle 9977 (Stale file handle)
            
            eaujames Etienne Aujames added a comment - Hello, I have the same kind of messages on: https://testing.whamcloud.com/sub_tests/a7c49599-b0e2-49e1-a4af-9111f676fdcf Except the message: trevis-9vm4: [341] open ./clients/client0/~dmtmp/PARADOX/COURSES.DB failed for handle 9977 (Stale file handle)
            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sets/9333fec4-2406-11ea-b1e8-52540065bddc
            hornc Chris Horn added a comment - +1 https://testing.whamcloud.com/sub_tests/13bce654-fc76-11e9-98f1-52540065bddc
            sarah Sarah Liu added a comment - +1 on master 2.11.51 failover https://testing.hpdd.intel.com/test_sets/7d85d5ce-492f-11e8-960d-52540065bddc

            From John Hammond, it looks like there is an issue with dbench start up as seen in the suite_log

            trevis-11vm1: running 'dbench 1 -t 300' on /mnt/lustre/d70b.replay-single/trevis-11vm1.trevis.hpdd.intel.com at Thu Feb  1 01:34:50 UTC 2018
            trevis-11vm1: dbench PID=30955
            trevis-11vm1: dbench version 4.00 - Copyright Andrew Tridgell 1999-2004
            trevis-11vm1: 
            trevis-11vm1: Running for 300 seconds with load 'client.txt' and minimum warmup 60 secs
            trevis-11vm1: failed to create barrier semaphore 
            trevis-11vm1: 0 of 1 processes prepared for launch   0 sec
            trevis-11vm2: dbench version 4.00 - Copyright Andrew Tridgell 1999-2004
            trevis-11vm2: 
            trevis-11vm2: Running for 300 seconds with load 'client.txt' and minimum warmup 60 secs
            trevis-11vm2: failed to create barrier semaphore 
            trevis-11vm2: 0 of 1 processes prepared for launch   0 sec
            CMD: trevis-11vm1.trevis.hpdd.intel.com,trevis-11vm2 killall -0 dbench
            trevis-11vm1: 1 of 1 processes prepared for launch   0 sec
            trevis-11vm1: releasing clients
            trevis-11vm2: 1 of 1 processes prepared for launch   0 sec
            trevis-11vm2: releasing clients
            trevis-11vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory)
            trevis-11vm1: (4) ERROR: handle 16385 was not found
            trevis-11vm1: Child failed with status 1
            trevis-11vm1: dbench: no process found
            trevis-11vm1: dbench: no process found
            
            jamesanunez James Nunez (Inactive) added a comment - From John Hammond, it looks like there is an issue with dbench start up as seen in the suite_log trevis-11vm1: running 'dbench 1 -t 300' on /mnt/lustre/d70b.replay-single/trevis-11vm1.trevis.hpdd.intel.com at Thu Feb 1 01:34:50 UTC 2018 trevis-11vm1: dbench PID=30955 trevis-11vm1: dbench version 4.00 - Copyright Andrew Tridgell 1999-2004 trevis-11vm1: trevis-11vm1: Running for 300 seconds with load 'client.txt' and minimum warmup 60 secs trevis-11vm1: failed to create barrier semaphore trevis-11vm1: 0 of 1 processes prepared for launch 0 sec trevis-11vm2: dbench version 4.00 - Copyright Andrew Tridgell 1999-2004 trevis-11vm2: trevis-11vm2: Running for 300 seconds with load 'client.txt' and minimum warmup 60 secs trevis-11vm2: failed to create barrier semaphore trevis-11vm2: 0 of 1 processes prepared for launch 0 sec CMD: trevis-11vm1.trevis.hpdd.intel.com,trevis-11vm2 killall -0 dbench trevis-11vm1: 1 of 1 processes prepared for launch 0 sec trevis-11vm1: releasing clients trevis-11vm2: 1 of 1 processes prepared for launch 0 sec trevis-11vm2: releasing clients trevis-11vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory) trevis-11vm1: (4) ERROR: handle 16385 was not found trevis-11vm1: Child failed with status 1 trevis-11vm1: dbench: no process found trevis-11vm1: dbench: no process found

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: