[LU-16541] sanity test_64f: buffered io, not write rpc: grants mismatch: 12656640, expected 4218880 Created: 09/Feb/23  Updated: 20/Dec/23  Resolved: 25/Aug/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Patrick Farrell
Resolution: Fixed Votes: 0
Labels: arm

Attachments: PNG File Screen Shot 2023-08-21 at 14.07.18.png    
Issue Links:
Related
is related to LU-16673 sanity test_125: failures with aarch6... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Arshad <arshad.hussain@aeoncomputing.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a0c04b66-87b8-4b06-aee8-8ae97f9e229d

test_64f failed with the following error:

buffered io, not write rpc: grants mismatch: 12656640, expected 4218880

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/92173 - 4.18.0-372.32.1.el8_6.aarch64
servers: https://build.whamcloud.com/job/lustre-reviews/92173 - 4.18.0-348.23.1.el8_lustre.x86_64

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_64f - buffered io, not write rpc: grants mismatch: 12656640, expected 4218880



 Comments   
Comment by Chris Horn [ 02/Mar/23 ]

+1 on master https://testing.whamcloud.com/test_sets/3612a72a-bde4-4fcd-9cb9-4eb4ed7ceab8

Comment by Nikitas Angelinas [ 16/Aug/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/6649c5ad-a4db-40ba-abe4-67822a7227c2

Comment by Aurelien Degremont [ 17/Aug/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/e4028aaf-757e-46c5-9ea9-440a7bda4e21

Comment by James A Simmons [ 21/Aug/23 ]

Sadly its not just  ARM. Looking for grant mismatch you will a bunch of duplicate tickets.

Comment by Andreas Dilger [ 21/Aug/23 ]

Patrick, could you please take a look at this. This subtest is now the top cause of failures, when it previously was only failing on aarch64. Since it is running as part of sanity, this subtest is run 9x per patch review test (unless run with 'trivial') so with a failure rate around 1/16 runs it is almost guaranteed to affect every patch.

it would make sense to run test result searches on a per-week basis to see if you can identify when the subtest first started failing on x86, and then use that to identify culprit patches that landed in that time period:

https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&horizon=518400&status%5B%5D=FAIL&test_set_script_id=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=2a009de0-e0af-11e8-89f8-52540065bddc&source=sub_tests#redirect

Doing a quick search showed that all of the failures on master in the week of 2023-04-16 were for your LU-13805 unaligned DIO patch series at that time, but none of those patches have landed, unless something was split out into a separate patch. However, it may be possible to do some differential analysis between the start of the failures on master vs. b_es6_0 to see when particular patches landed to each branch.

Comment by Patrick Farrell [ 21/Aug/23 ]

Yeah, I actually started running this locally and couldn't reproduce, but I'm happy to give this a try.  I will see if I can figure something out from the landings, but I may also just go directly at the bug as well.

Comment by Andreas Dilger [ 21/Aug/23 ]

It looks like the spike in subtest failures on master started on 2023-08-02.

Comment by Patrick Farrell [ 21/Aug/23 ]

Ah, thank you.

Comment by Gerrit Updater [ 21/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52022
Subject: LU-16541 tests: bisect
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a360210bff4e076ae59171fe32c3656184a5901

Comment by Gerrit Updater [ 21/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52023
Subject: LU-16541 tests: bisect
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a042bf7d62ec99b8b6cc299be998408286dd2b2a

Comment by Oleg Drokin [ 21/Aug/23 ]

he other annoyance with this test failure btw is if it hits, it takes 1 hour for the test to finish which makes it a timeout on janitor:

sanity.test_64f.test_log.oleg346-client.log2023-08-19 04:46  1.6K 
sanity.test_64g.test_log.oleg346-client.log2023-08-19 05:46  1.6K 

Example here, but they ar eall like this. I wonder if the unbounded wait just waits for something else that happens to run for an hour? Nothing obvious in the test output:

http://testing.linuxhacker.ru/lustre-reports/33659/testresults/sanity2-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

the one hour duration also holds for the maloo

Comment by Gerrit Updater [ 22/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52040
Subject: LU-16541 tests: Improve test 64f
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 203bdc67a292bb1daff50d379d7dbfe5a86cc89c

Comment by Gerrit Updater [ 25/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52040/
Subject: LU-16541 tests: Improve test 64f
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 33e4d86a480b860e0a3b4b51c7c6da6ec0159e51

Comment by Peter Jones [ 25/Aug/23 ]

Merged for 2.16

Comment by Gerrit Updater [ 25/Aug/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52096
Subject: LU-16541 tests: Improve test 64f
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 52d69926bb180b4d8e7ebf30f5d27431da544c17

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52096/
Subject: LU-16541 tests: Improve test 64f
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 478ca310b4204b4354245e6261edae6fef0ae497

Generated at Sat Feb 10 03:27:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.