Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10616

replay-single test_70b fails with 'rundbench load on <hostname(s)> failed!'

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.15.1
    • 3
    • 9223372036854775807

    Description

      replay-single test_70b fails with two error messages

      replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2!
      

      and later

      replay-single test_70b: @@@@@@ FAIL: rundbench load on onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 failed! 
      

      Looking at the suite_log, we see

      CMD: onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 killall -0 dbench
      onyx-31vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory)
      onyx-31vm1: (4) ERROR: handle 16385 was not found
      onyx-31vm1: Child failed with status 1
      onyx-31vm1: dbench: no process found
      onyx-31vm1: dbench: no process found
       replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! 
      

      The only thing that looks suspicious in the console logs is on the MDS1, 3

      [ 5354.241985] Lustre: DEBUG MARKER: Started rundbench load pid=3403 ...
      [ 5354.488828] LustreError: 12371:0:(osd_oi.c:978:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2
      [ 5354.753146] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! 
      

      This test has failed in this way many times, so far, for only full test sessions with DNE configured and ZFS:
      2.10.57 el7 build 3703 – https://testing.hpdd.intel.com/test_sets/46a0b60a-078f-11e8-bd00-52540065bddc
      2.10.57 el7 build 3702 – https://testing.hpdd.intel.com/test_sets/13cdeb9e-0352-11e8-a10a-52540065bddc
      2.10.57 el7 build 3700 - https://testing.hpdd.intel.com/test_sets/fa0a850e-014f-11e8-a6ad-52540065bddc
      2.10.57 el7 build 3697 - https://testing.hpdd.intel.com/test_sets/ebd4b25e-fd83-11e7-a7cd-52540065bddc
      2.10.57 el7 patchless build 59 – https://testing.hpdd.intel.com/test_sets/dee6191a-ffaf-11e7-a6ad-52540065bddc
      2.10.57 el7 patchless build 58 – https://testing.hpdd.intel.com/test_sets/16fa9310-fe7c-11e7-a6ad-52540065bddc
      2.10.56 el7 build 3693 – https://testing.hpdd.intel.com/test_sets/d309f58a-f77b-11e7-bd00-52540065bddc
      2.10.56 el7 patchless build 53 – https://testing.hpdd.intel.com/test_sets/38f48bae-f636-11e7-94c7-52540065bddc
      2.10.56 el7 patchless build 50 – https://testing.hpdd.intel.com/test_sets/c46aeb7c-f228-11e7-8c43-52540065bddc
      2.10.56 el7 build 3685 – https://testing.hpdd.intel.com/test_sets/6c00afc0-e7c0-11e7-8027-52540065bddc
      2.10.56 el7 patchless build 44 – https://testing.hpdd.intel.com/test_sets/53f8d684-e674-11e7-a066-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-10616] replay-single test_70b fails with 'rundbench load on <hostname(s)> failed!'

            Lai, should replay-single test_70b be updated to add "stack_trap fail_abort_cleanup" so that it can clean up afterward? However, while the test is doing failover (via test-framework.sh::fail()->facet_failover()) it doesn't look like this subtest is actually aborting recovery, so it shouldn't be seeing this kind of problem.

            This subtest is failing pretty regularly, could you please investigate why it is having problems during recovery? It should be possible to use "Test-Parameters: fortestonly testlist=replay-single env=ONLY=70b,ONLY_REPEAT=100 livedebug" to run 70b until it is hit and then leave the node in that state to log in and debug.

            adilger Andreas Dilger added a comment - Lai, should replay-single test_70b be updated to add " stack_trap fail_abort_cleanup " so that it can clean up afterward? However, while the test is doing failover (via test-framework.sh::fail()->facet_failover() ) it doesn't look like this subtest is actually aborting recovery, so it shouldn't be seeing this kind of problem. This subtest is failing pretty regularly, could you please investigate why it is having problems during recovery? It should be possible to use " Test-Parameters: fortestonly testlist=replay-single env=ONLY=70b,ONLY_REPEAT=100 livedebug " to run 70b until it is hit and then leave the node in that state to log in and debug.
            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - +1 https://testing.whamcloud.com/test_sets/15d4b4b3-8a48-4743-b935-bf96afb0e27d
            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sets/d3c778e5-e533-4a1d-8dce-263b64809701
            qian_wc Qian Yingjin added a comment - +1 on master https://testing.whamcloud.com/test_sets/c154d88e-a784-4023-9c59-f40662559bea
            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - +1 on master https://testing.whamcloud.com/test_sets/ee775d6f-00db-41b2-ad02-d4ae7e31ce6c
            adilger Andreas Dilger added a comment - +1 on master https://testing.whamcloud.com/test_sets/1c452361-2846-41bd-af35-995e1de3fd99
            adilger Andreas Dilger added a comment - +1 on master https://testing.whamcloud.com/test_sets/b5f87cba-d087-45dc-85ef-e1005ef15186

            Hello,

            I have the same kind of messages on: https://testing.whamcloud.com/sub_tests/a7c49599-b0e2-49e1-a4af-9111f676fdcf

            Except the message:

            trevis-9vm4: [341] open ./clients/client0/~dmtmp/PARADOX/COURSES.DB failed for handle 9977 (Stale file handle)
            
            eaujames Etienne Aujames added a comment - Hello, I have the same kind of messages on: https://testing.whamcloud.com/sub_tests/a7c49599-b0e2-49e1-a4af-9111f676fdcf Except the message: trevis-9vm4: [341] open ./clients/client0/~dmtmp/PARADOX/COURSES.DB failed for handle 9977 (Stale file handle)
            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sets/9333fec4-2406-11ea-b1e8-52540065bddc
            hornc Chris Horn added a comment - +1 https://testing.whamcloud.com/sub_tests/13bce654-fc76-11e9-98f1-52540065bddc

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: