[LU-2608] Failure on test suite recovery-small test_100: FAIL: LBUG/LASSERT detected Created: 12/Jan/13 Updated: 22/Apr/13 Resolved: 13/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, zfs | ||
| Severity: | 3 |
| Rank (Obsolete): | 6092 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/9ff8d8c0-5ae9-11e2-b205-52540035b04c. The sub-test test_100 failed with the following error:
I found many test failures with similar message : "FAIL: LBUG/LASSERT detected" during the ZFS OFED build testing, but cannot find related LBUG or ASSERTION info from dmesg or console logs. |
| Comments |
| Comment by Sarah Liu [ 12/Jan/13 ] |
|
similar error message: https://maloo.whamcloud.com/test_sets/05f6b8c0-5ae7-11e2-b205-52540035b04c test_44c |
| Comment by Bruno Faccini (Inactive) [ 25/Jan/13 ] |
|
I just had a look in each of the test_log for all 3 test failures you indicated, and I think the "FAIL: LBUG/LASSERT detected" may be a false-positive induced by IB communications error during infos/msgs gather on remote nodes. Each time the log/errors sequence looks like : ..... Resetting fail_loc on all nodes...done. < but sometimes with errors already ... > pdsh@client-X: client-Y-ib: read: protocol failure: timed out < at least for one node, but often multiples ... > <test> <subtest>: @@@@@@ FAIL: LBUG/LASSERT detected Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:3938:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:3961:error() = /usr/lib64/lustre/tests/test-framework.sh:4205:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4231:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4104:run_test() = /usr/lib64/lustre/tests/<test>.sh:1391:main() ..... So seems that return-values/failures from do_nodes() must be better checked in check_catastrophe(). |
| Comment by Nathaniel Clark [ 29/Jan/13 ] |
| Comment by Jodi Levi (Inactive) [ 13/Feb/13 ] |
|
Patch landed to master |