[LU-17156] sanityn test_16j: timeout Created: 29/Sep/23 Updated: 18/Nov/23 Resolved: 18/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Arshad Hussain |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/aa93e973-8c6b-4568-a9ce-a2747abd4f7b test_16j failed with the following error: Timeout occurred after 418 minutes, last suite running was sanityn Test session details: VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Arshad Hussain [ 19/Oct/23 ] |
|
From: https://testing.whamcloud.com/test_logs/6315cba4-8e7e-490d-ba12-d4fad74d3784/show_text == sanityn test 16j: race dio with buffered i/o ========== 14:13:36 (1697552016) 1+0 records in 1+0 records out 26214400 bytes (26 MB, 25 MiB) copied, 0.294639 s, 89.0 MB/s CMD: trevis-31vm3 /usr/sbin/lctl get_param -n osd-*.lustre-OST0000 lustre-OST0000_UUID.nonrotational trevis-31vm3: error: get_param: param_path 'lustre-OST0000_UUID/nonrotational': No such file or directory pdsh@trevis-31vm1: trevis-31vm3: ssh exited with exit code 2 /usr/lib64/lustre/tests/test-framework.sh: line 11492: [: : integer expression expected bs: 1024, file_size 26214400 ... It looks like call to zfs_or_rotational() -> ostname_from_index() has returned "lustre-OST0000 lustre-OST0000_UUID" instead of just "lustre-OST0000" |
| Comment by Gerrit Updater [ 03/Nov/23 ] |
|
"Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52973 |
| Comment by Andreas Dilger [ 05/Nov/23 ] |
|
It looks like there is a typo in the expansion: lctl get_param -n osd-*.lustre-OST0000 lustre-OST0000_UUID.nonrotational There is a space in there, and "lustre-OST0000" is shown twice? |
| Comment by Arshad Hussain [ 05/Nov/23 ] |
|
Yes definitely. This looks like test case issue. I was initially thinking some unusual "lfs osts" output leading to this. Looked at the parsing, all seems correct. So, I put a patch to dump more log on error. (I just saw your review on that, will take care of it). There is a space in there, and "lustre-OST0000" is shown twice? Yes. and also, We should not have "_UUID" string there. It should have been remove. |
| Comment by Andreas Dilger [ 05/Nov/23 ] |
|
The _UUID string is just a result of the "echo" command only removing the first instance of _UUID in the output. When there is only a single OST name printed this will not be a problem. I've updated the patch with the fix. It looks like this has been an issue with the zfs_or_rotational() function since it was landed back on 2023-08-24, and it meant that that tests intended not to run on ZFS (because DIO on ZFS is very slow) were being run anyway and sometimes passing, and sometimes timing out. |
| Comment by Patrick Farrell [ 05/Nov/23 ] |
|
To be clear, if these tests run the full version on ZFS (or maybe even just on a rotational disk), they will time out. It just takes too long. So an error in zfs_or_rotational is enough to explain timeouts here. |
| Comment by Arshad Hussain [ 06/Nov/23 ] |
|
<snip> I've updated the patch with the fix... <snip> Andreas, yes saw that. Thanks again!
... It just takes too long. So an error in zfs_or_rotational is enough to explain timeouts here. Correct. Understood. |
| Comment by Gerrit Updater [ 18/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52973/ |
| Comment by Peter Jones [ 18/Nov/23 ] |
|
Landed for 2.16 |