[LU-10616] replay-single test_70b fails with 'rundbench load on <hostname(s)> failed!' Created: 06/Feb/18 Updated: 24/Jan/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.15.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | dne, zfs | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
replay-single test_70b fails with two error messages replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! and later replay-single test_70b: @@@@@@ FAIL: rundbench load on onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 failed! Looking at the suite_log, we see CMD: onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 killall -0 dbench onyx-31vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory) onyx-31vm1: (4) ERROR: handle 16385 was not found onyx-31vm1: Child failed with status 1 onyx-31vm1: dbench: no process found onyx-31vm1: dbench: no process found replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! The only thing that looks suspicious in the console logs is on the MDS1, 3 [ 5354.241985] Lustre: DEBUG MARKER: Started rundbench load pid=3403 ... [ 5354.488828] LustreError: 12371:0:(osd_oi.c:978:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2 [ 5354.753146] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! This test has failed in this way many times, so far, for only full test sessions with DNE configured and ZFS: |
| Comments |
| Comment by James Nunez (Inactive) [ 09/Feb/18 ] |
|
From John Hammond, it looks like there is an issue with dbench start up as seen in the suite_log trevis-11vm1: running 'dbench 1 -t 300' on /mnt/lustre/d70b.replay-single/trevis-11vm1.trevis.hpdd.intel.com at Thu Feb 1 01:34:50 UTC 2018 trevis-11vm1: dbench PID=30955 trevis-11vm1: dbench version 4.00 - Copyright Andrew Tridgell 1999-2004 trevis-11vm1: trevis-11vm1: Running for 300 seconds with load 'client.txt' and minimum warmup 60 secs trevis-11vm1: failed to create barrier semaphore trevis-11vm1: 0 of 1 processes prepared for launch 0 sec trevis-11vm2: dbench version 4.00 - Copyright Andrew Tridgell 1999-2004 trevis-11vm2: trevis-11vm2: Running for 300 seconds with load 'client.txt' and minimum warmup 60 secs trevis-11vm2: failed to create barrier semaphore trevis-11vm2: 0 of 1 processes prepared for launch 0 sec CMD: trevis-11vm1.trevis.hpdd.intel.com,trevis-11vm2 killall -0 dbench trevis-11vm1: 1 of 1 processes prepared for launch 0 sec trevis-11vm1: releasing clients trevis-11vm2: 1 of 1 processes prepared for launch 0 sec trevis-11vm2: releasing clients trevis-11vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory) trevis-11vm1: (4) ERROR: handle 16385 was not found trevis-11vm1: Child failed with status 1 trevis-11vm1: dbench: no process found trevis-11vm1: dbench: no process found |
| Comment by Sarah Liu [ 02/May/18 ] |
|
+1 on master 2.11.51 failover https://testing.hpdd.intel.com/test_sets/7d85d5ce-492f-11e8-960d-52540065bddc |
| Comment by Chris Horn [ 01/Nov/19 ] |
|
+1 https://testing.whamcloud.com/sub_tests/13bce654-fc76-11e9-98f1-52540065bddc |
| Comment by Andreas Dilger [ 07/Jan/20 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/9333fec4-2406-11ea-b1e8-52540065bddc |
| Comment by Etienne Aujames [ 30/Nov/20 ] |
|
Hello, I have the same kind of messages on: https://testing.whamcloud.com/sub_tests/a7c49599-b0e2-49e1-a4af-9111f676fdcf Except the message:
trevis-9vm4: [341] open ./clients/client0/~dmtmp/PARADOX/COURSES.DB failed for handle 9977 (Stale file handle)
|
| Comment by Andreas Dilger [ 19/Jun/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/b5f87cba-d087-45dc-85ef-e1005ef15186 |
| Comment by Andreas Dilger [ 13/Aug/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/1c452361-2846-41bd-af35-995e1de3fd99 |
| Comment by Artem Blagodarenko (Inactive) [ 30/Nov/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/ee775d6f-00db-41b2-ad02-d4ae7e31ce6c |
| Comment by Qian Yingjin [ 27/Dec/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/c154d88e-a784-4023-9c59-f40662559bea |
| Comment by Andreas Dilger [ 18/Jan/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/d3c778e5-e533-4a1d-8dce-263b64809701 |
| Comment by Artem Blagodarenko (Inactive) [ 22/Jan/22 ] |
|
+1 https://testing.whamcloud.com/test_sets/15d4b4b3-8a48-4743-b935-bf96afb0e27d |
| Comment by Andreas Dilger [ 24/Jan/23 ] |
|
Lai, should replay-single test_70b be updated to add "stack_trap fail_abort_cleanup" so that it can clean up afterward? However, while the test is doing failover (via test-framework.sh::fail()->facet_failover()) it doesn't look like this subtest is actually aborting recovery, so it shouldn't be seeing this kind of problem. This subtest is failing pretty regularly, could you please investigate why it is having problems during recovery? It should be possible to use "Test-Parameters: fortestonly testlist=replay-single env=ONLY=70b,ONLY_REPEAT=100 livedebug" to run 70b until it is hit and then leave the node in that state to log in and debug. |