[LU-2608] Failure on test suite recovery-small test_100: FAIL: LBUG/LASSERT detected Created: 12/Jan/13  Updated: 22/Apr/13  Resolved: 13/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: HB, zfs

Severity: 3
Rank (Obsolete): 6092

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/9ff8d8c0-5ae9-11e2-b205-52540035b04c.

The sub-test test_100 failed with the following error:

test_100 returned 1

I found many test failures with similar message : "FAIL: LBUG/LASSERT detected" during the ZFS OFED build testing, but cannot find related LBUG or ASSERTION info from dmesg or console logs.



 Comments   
Comment by Sarah Liu [ 12/Jan/13 ]

similar error message:
https://maloo.whamcloud.com/test_sets/9835100e-5ae9-11e2-b205-52540035b04c

https://maloo.whamcloud.com/test_sets/05f6b8c0-5ae7-11e2-b205-52540035b04c test_44c

Comment by Bruno Faccini (Inactive) [ 25/Jan/13 ]

I just had a look in each of the test_log for all 3 test failures you indicated, and I think the "FAIL: LBUG/LASSERT detected" may be a false-positive induced by IB communications error during infos/msgs gather on remote nodes.

Each time the log/errors sequence looks like :

.....
Resetting fail_loc on all nodes...done.  < but sometimes with errors already ... >
pdsh@client-X: client-Y-ib: read: protocol failure: timed out < at least for one node, but often multiples ... >
 <test> <subtest>: @@@@@@ FAIL: LBUG/LASSERT detected 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:3938:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:3961:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4205:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4231:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4104:run_test()
  = /usr/lib64/lustre/tests/<test>.sh:1391:main()
.....

So seems that return-values/failures from do_nodes() must be better checked in check_catastrophe().

Comment by Nathaniel Clark [ 29/Jan/13 ]

http://review.whamcloud.com/5200

Comment by Jodi Levi (Inactive) [ 13/Feb/13 ]

Patch landed to master

Generated at Sat Feb 10 01:26:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.