[LU-17156] sanityn test_16j: timeout Created: 29/Sep/23  Updated: 18/Nov/23  Resolved: 18/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Arshad Hussain
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13805 i/o path: Unaligned direct i/o Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/aa93e973-8c6b-4568-a9ce-a2747abd4f7b

test_16j failed with the following error:

Timeout occurred after 418 minutes, last suite running was sanityn

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/99043 - 4.18.0-477.21.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/99043 - 4.18.0-477.21.1.el8_lustre.x86_64

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanityn test_16j - Timeout occurred after 418 minutes, last suite running was sanityn



 Comments   
Comment by Arshad Hussain [ 19/Oct/23 ]

From: https://testing.whamcloud.com/test_logs/6315cba4-8e7e-490d-ba12-d4fad74d3784/show_text

== sanityn test 16j: race dio with buffered i/o ========== 14:13:36 (1697552016)
1+0 records in
1+0 records out
26214400 bytes (26 MB, 25 MiB) copied, 0.294639 s, 89.0 MB/s
CMD: trevis-31vm3 /usr/sbin/lctl get_param -n osd-*.lustre-OST0000 lustre-OST0000_UUID.nonrotational
trevis-31vm3: error: get_param: param_path 'lustre-OST0000_UUID/nonrotational': No such file or directory
pdsh@trevis-31vm1: trevis-31vm3: ssh exited with exit code 2
/usr/lib64/lustre/tests/test-framework.sh: line 11492: [: : integer expression expected
bs: 1024, file_size 26214400
...

It looks like call to zfs_or_rotational() -> ostname_from_index() has returned "lustre-OST0000 lustre-OST0000_UUID" instead of just "lustre-OST0000"

Comment by Gerrit Updater [ 03/Nov/23 ]

"Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52973
Subject: LU-17156 tests: Improve zfs_or_rotational()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4bc1ad986f58e1d29956e274f76a76dc093e0c5

Comment by Andreas Dilger [ 05/Nov/23 ]

It looks like there is a typo in the expansion:

lctl get_param -n osd-*.lustre-OST0000 lustre-OST0000_UUID.nonrotational

There is a space in there, and "lustre-OST0000" is shown twice?

Comment by Arshad Hussain [ 05/Nov/23 ]

Yes definitely. This looks like test case issue. I was initially thinking some unusual "lfs osts" output leading to this. Looked at the parsing, all seems correct. So, I put a patch to dump more log on error. (I just saw your review on that, will take care of it). 

There is a space in there, and "lustre-OST0000" is shown twice?

Yes. and also, We should not have "_UUID" string there. It should have been remove.

Comment by Andreas Dilger [ 05/Nov/23 ]

The _UUID string is just a result of the "echo" command only removing the first instance of _UUID in the output. When there is only a single OST name printed this will not be a problem.

I've updated the patch with the fix. It looks like this has been an issue with the zfs_or_rotational() function since it was landed back on 2023-08-24, and it meant that that tests intended not to run on ZFS (because DIO on ZFS is very slow) were being run anyway and sometimes passing, and sometimes timing out.

Comment by Patrick Farrell [ 05/Nov/23 ]

To be clear, if these tests run the full version on ZFS (or maybe even just on a rotational disk), they will time out.  It just takes too long.  So an error in zfs_or_rotational is enough to explain timeouts here.

Comment by Arshad Hussain [ 06/Nov/23 ]

 

<snip> 
I've updated the patch with the fix...
<snip>

Andreas, yes saw that. Thanks again!

 

... It just takes too long.  So an error in zfs_or_rotational is enough to explain timeouts here.

Correct. Understood.

Comment by Gerrit Updater [ 18/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52973/
Subject: LU-17156 tests: Improve zfs_or_rotational()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b0262a00714f2d77b6d6ba745169c8fa18d38b32

Comment by Peter Jones [ 18/Nov/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:33:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.