[LU-10616] replay-single test_70b fails with 'rundbench load on <hostname(s)> failed!' Created: 06/Feb/18  Updated: 24/Jan/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.15.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: dne, zfs

Issue Links:
Duplicate
is duplicated by LU-14791 replay-single: rundbench load on trev... Resolved
is duplicated by LU-14813 replay-single: test_70b dbench failed Resolved
Related
is related to LU-15624 replay-single and ost-pools failed: r... Open
is related to LU-16336 LFSCK should fix inconsistencies caus... Open
is related to LU-16065 replay-single test_81a: rm remote dir... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_70b fails with two error messages

replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2!

and later

replay-single test_70b: @@@@@@ FAIL: rundbench load on onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 failed! 

Looking at the suite_log, we see

CMD: onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2 killall -0 dbench
onyx-31vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory)
onyx-31vm1: (4) ERROR: handle 16385 was not found
onyx-31vm1: Child failed with status 1
onyx-31vm1: dbench: no process found
onyx-31vm1: dbench: no process found
 replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! 

The only thing that looks suspicious in the console logs is on the MDS1, 3

[ 5354.241985] Lustre: DEBUG MARKER: Started rundbench load pid=3403 ...
[ 5354.488828] LustreError: 12371:0:(osd_oi.c:978:osd_idc_find_or_init()) lustre-MDT0000: can't lookup: rc = -2
[ 5354.753146] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-31vm1.onyx.hpdd.intel.com,onyx-31vm2! 

This test has failed in this way many times, so far, for only full test sessions with DNE configured and ZFS:
2.10.57 el7 build 3703 – https://testing.hpdd.intel.com/test_sets/46a0b60a-078f-11e8-bd00-52540065bddc
2.10.57 el7 build 3702 – https://testing.hpdd.intel.com/test_sets/13cdeb9e-0352-11e8-a10a-52540065bddc
2.10.57 el7 build 3700 - https://testing.hpdd.intel.com/test_sets/fa0a850e-014f-11e8-a6ad-52540065bddc
2.10.57 el7 build 3697 - https://testing.hpdd.intel.com/test_sets/ebd4b25e-fd83-11e7-a7cd-52540065bddc
2.10.57 el7 patchless build 59 – https://testing.hpdd.intel.com/test_sets/dee6191a-ffaf-11e7-a6ad-52540065bddc
2.10.57 el7 patchless build 58 – https://testing.hpdd.intel.com/test_sets/16fa9310-fe7c-11e7-a6ad-52540065bddc
2.10.56 el7 build 3693 – https://testing.hpdd.intel.com/test_sets/d309f58a-f77b-11e7-bd00-52540065bddc
2.10.56 el7 patchless build 53 – https://testing.hpdd.intel.com/test_sets/38f48bae-f636-11e7-94c7-52540065bddc
2.10.56 el7 patchless build 50 – https://testing.hpdd.intel.com/test_sets/c46aeb7c-f228-11e7-8c43-52540065bddc
2.10.56 el7 build 3685 – https://testing.hpdd.intel.com/test_sets/6c00afc0-e7c0-11e7-8027-52540065bddc
2.10.56 el7 patchless build 44 – https://testing.hpdd.intel.com/test_sets/53f8d684-e674-11e7-a066-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 09/Feb/18 ]

From John Hammond, it looks like there is an issue with dbench start up as seen in the suite_log

trevis-11vm1: running 'dbench 1 -t 300' on /mnt/lustre/d70b.replay-single/trevis-11vm1.trevis.hpdd.intel.com at Thu Feb  1 01:34:50 UTC 2018
trevis-11vm1: dbench PID=30955
trevis-11vm1: dbench version 4.00 - Copyright Andrew Tridgell 1999-2004
trevis-11vm1: 
trevis-11vm1: Running for 300 seconds with load 'client.txt' and minimum warmup 60 secs
trevis-11vm1: failed to create barrier semaphore 
trevis-11vm1: 0 of 1 processes prepared for launch   0 sec
trevis-11vm2: dbench version 4.00 - Copyright Andrew Tridgell 1999-2004
trevis-11vm2: 
trevis-11vm2: Running for 300 seconds with load 'client.txt' and minimum warmup 60 secs
trevis-11vm2: failed to create barrier semaphore 
trevis-11vm2: 0 of 1 processes prepared for launch   0 sec
CMD: trevis-11vm1.trevis.hpdd.intel.com,trevis-11vm2 killall -0 dbench
trevis-11vm1: 1 of 1 processes prepared for launch   0 sec
trevis-11vm1: releasing clients
trevis-11vm2: 1 of 1 processes prepared for launch   0 sec
trevis-11vm2: releasing clients
trevis-11vm1: [3] open ./clients/client0 failed for handle 16385 (No such file or directory)
trevis-11vm1: (4) ERROR: handle 16385 was not found
trevis-11vm1: Child failed with status 1
trevis-11vm1: dbench: no process found
trevis-11vm1: dbench: no process found
Comment by Sarah Liu [ 02/May/18 ]

+1 on master 2.11.51 failover

https://testing.hpdd.intel.com/test_sets/7d85d5ce-492f-11e8-960d-52540065bddc

Comment by Chris Horn [ 01/Nov/19 ]

+1 https://testing.whamcloud.com/sub_tests/13bce654-fc76-11e9-98f1-52540065bddc

Comment by Andreas Dilger [ 07/Jan/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/9333fec4-2406-11ea-b1e8-52540065bddc

Comment by Etienne Aujames [ 30/Nov/20 ]

Hello,

I have the same kind of messages on: https://testing.whamcloud.com/sub_tests/a7c49599-b0e2-49e1-a4af-9111f676fdcf

Except the message:

trevis-9vm4: [341] open ./clients/client0/~dmtmp/PARADOX/COURSES.DB failed for handle 9977 (Stale file handle)
Comment by Andreas Dilger [ 19/Jun/21 ]

+1 on master https://testing.whamcloud.com/test_sets/b5f87cba-d087-45dc-85ef-e1005ef15186

Comment by Andreas Dilger [ 13/Aug/21 ]

+1 on master https://testing.whamcloud.com/test_sets/1c452361-2846-41bd-af35-995e1de3fd99

Comment by Artem Blagodarenko (Inactive) [ 30/Nov/21 ]

+1 on master https://testing.whamcloud.com/test_sets/ee775d6f-00db-41b2-ad02-d4ae7e31ce6c

Comment by Qian Yingjin [ 27/Dec/21 ]

+1 on master https://testing.whamcloud.com/test_sets/c154d88e-a784-4023-9c59-f40662559bea

Comment by Andreas Dilger [ 18/Jan/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/d3c778e5-e533-4a1d-8dce-263b64809701

Comment by Artem Blagodarenko (Inactive) [ 22/Jan/22 ]

+1 https://testing.whamcloud.com/test_sets/15d4b4b3-8a48-4743-b935-bf96afb0e27d

Comment by Andreas Dilger [ 24/Jan/23 ]

Lai, should replay-single test_70b be updated to add "stack_trap fail_abort_cleanup" so that it can clean up afterward? However, while the test is doing failover (via test-framework.sh::fail()->facet_failover()) it doesn't look like this subtest is actually aborting recovery, so it shouldn't be seeing this kind of problem.

This subtest is failing pretty regularly, could you please investigate why it is having problems during recovery? It should be possible to use "Test-Parameters: fortestonly testlist=replay-single env=ONLY=70b,ONLY_REPEAT=100 livedebug" to run 70b until it is hit and then leave the node in that state to log in and debug.

Generated at Sat Feb 10 02:36:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.