[LU-7105] sanityn test_28 fails with 'error() without useful message, please fix' Created: 04/Sep/15  Updated: 18/Feb/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: always_except, easy, tests
Environment:

autotest


Attachments: File test_28.patch    
Issue Links:
Related
is related to LU-7072 sanityn test_78: Expected set_param t... Resolved
is related to LU-9466 Calls to ‘error’ should have an error... Resolved
is related to LU-1443 Update Test Scripts to Replace all Bu... Closed
Severity: 3
Bugzilla ID: 9,977
Rank (Obsolete): 9223372036854775807

 Description   

sanityn test 28 was recently removed from the ALWAYS_EXCEPT list by accident and is still failing. There is no real error message, but the output from the test on failure is

'error() without useful message, please fix' 

Recently, there are many examples of this test failing and, thus, many logs of the failures. Here are just a couple:
https://testing.hpdd.intel.com/test_sets/d0ec87b2-530f-11e5-8228-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7a462a70-5301-11e5-b798-5254006e85c2

From the test log output, it’s clear that this test needs to be updated; newdev was removed as an option to lctl many years ago:

== sanityn test 28: read/write/truncate file with lost stripes == 08:31:03 (1441355463)
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.0383377 s, 54.7 MB/s
No such command, type help
error: setup: Operation already in progress
error: destroy: invalid objid '12745:0'
destroy OST object <objid> [num [verbose]]
usage: destroy <num> objects, starting at objid <objid>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>

Until we fix the obvious issues, we don’t really know if the original bug/reason for ALWAYS_EXCEPT test 28 is still valid.

In sanityn, the reason for putting this test on the ALWAYS_EXCEPT list is due to bz=9977.



 Comments   
Comment by Andreas Dilger [ 01/Oct/15 ]

It is possible to use fail_loc added for LFSCK to create files that are missing stripes. That would be a lot less heavyweight than configuring the echo_client to delete one object.

Comment by Mikhail Pershin [ 18/Feb/22 ]

While working on unrelated test fixes I was trying to reanimate test_28 by deleting OST object with debugfs but test is still failing. So in general the idea of test is that missing stripe should return error while reading from it but can be recreated by writing to it. It also says something about truncate in test name but there is no truncate in test actually. By using debugfs I remove stripe #2 of file and then get the following:

# read from stripe #1, successful
1048576 bytes (1,0 MB) copied, 0,00574064 s, 183 MB/s
# read from stripe #2 failed as expected
dd: cannot fstat '/mnt/lustre2/f28.sanityn': No such file or directory
# write to both stripes again fails also with ENOENT
dd: failed to open '/mnt/lustre/f28.sanityn': No such file or directory
 sanityn test_28: @@@@@@ FAIL: re-creating write failed 

I am not sure how it should really work actually. Is that really error that write is failed or maybe it shouldn't work and test is just obsolete

patch for test is attached to the ticket

Generated at Sat Feb 10 02:06:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.