Details
Description
replay-single test_70b fails with two error messages
replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2!
and later
replay-single test_70b: @@@@@@ FAIL: rundbench load on onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 failed!
Looking at the suite_log, we see
CMD: onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 killall -0 dbench onyx-31vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory) onyx-31vm1: (4) ERROR: handle 16385 was not found onyx-31vm1: Child failed with status 1 onyx-31vm1: dbench: no process found onyx-31vm1: dbench: no process found replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2!
The only thing that looks suspicious in the console logs is on the MDS1, 3
[ 5354.241985] Lustre: DEBUG MARKER: Started rundbench load pid=3403 ... [ 5354.488828] LustreError: 12371:0:(osd_oi.c:978:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2 [ 5354.753146] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2!
This test has failed in this way many times, so far, for only full test sessions with DNE configured and ZFS:
2.10.57 el7 build 3703 – https://testing.hpdd.intel.com/test_sets/46a0b60a-078f-11e8-bd00-52540065bddc
2.10.57 el7 build 3702 – https://testing.hpdd.intel.com/test_sets/13cdeb9e-0352-11e8-a10a-52540065bddc
2.10.57 el7 build 3700 - https://testing.hpdd.intel.com/test_sets/fa0a850e-014f-11e8-a6ad-52540065bddc
2.10.57 el7 build 3697 - https://testing.hpdd.intel.com/test_sets/ebd4b25e-fd83-11e7-a7cd-52540065bddc
2.10.57 el7 patchless build 59 – https://testing.hpdd.intel.com/test_sets/dee6191a-ffaf-11e7-a6ad-52540065bddc
2.10.57 el7 patchless build 58 – https://testing.hpdd.intel.com/test_sets/16fa9310-fe7c-11e7-a6ad-52540065bddc
2.10.56 el7 build 3693 – https://testing.hpdd.intel.com/test_sets/d309f58a-f77b-11e7-bd00-52540065bddc
2.10.56 el7 patchless build 53 – https://testing.hpdd.intel.com/test_sets/38f48bae-f636-11e7-94c7-52540065bddc
2.10.56 el7 patchless build 50 – https://testing.hpdd.intel.com/test_sets/c46aeb7c-f228-11e7-8c43-52540065bddc
2.10.56 el7 build 3685 – https://testing.hpdd.intel.com/test_sets/6c00afc0-e7c0-11e7-8027-52540065bddc
2.10.56 el7 patchless build 44 – https://testing.hpdd.intel.com/test_sets/53f8d684-e674-11e7-a066-52540065bddc
Attachments
Issue Links
- is duplicated by
-
LU-14791 replay-single: rundbench load on trevis-66vm1.trevis.whamcloud.com,trevis-66vm2 failed!
-
- Resolved
-
-
LU-14813 replay-single: test_70b dbench failed
-
- Resolved
-
- is related to
-
LU-16336 LFSCK should fix inconsistencies caused by recovery abort
-
- Open
-
-
LU-16065 replay-single test_81a: rm remote dir failed
-
- Open
-
- is related to
-
LU-15624 replay-single and ost-pools failed: rm: cannot remove 'd70b.replay-single': Directory not empty
-
- Open
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
Lai, should replay-single test_70b be updated to add "stack_trap fail_abort_cleanup" so that it can clean up afterward? However, while the test is doing failover (via test-framework.sh::fail()->facet_failover()) it doesn't look like this subtest is actually aborting recovery, so it shouldn't be seeing this kind of problem.
This subtest is failing pretty regularly, could you please investigate why it is having problems during recovery? It should be possible to use "Test-Parameters: fortestonly testlist=replay-single env=ONLY=70b,ONLY_REPEAT=100 livedebug" to run 70b until it is hit and then leave the node in that state to log in and debug.