[LU-13706] sanity test_119d: the read rpcs have not completed in 2s Created: 23/Jun/20  Updated: 01/May/23  Resolved: 01/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Patrick Farrell
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16636 some tests don't hit fail_loc Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1d6568f0-1550-4b32-8826-d698df594715

test_119d failed with the following error:

the read rpcs have not completed in 2s

Maybe the 2 second timeout is too short?

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_119d - the read rpcs have not completed in 2s



 Comments   
Comment by Patrick Farrell [ 24/Apr/23 ]

So, I've hit this again recently...

This test is from the 1.8 era, which is fine by itself, but the fail loc this test uses was removed in:

commit fbf5870b9848929d352460f1f005b79c0b5ccc5a
Author: nikita <nikita>
Date:   Fri Nov 7 23:54:43 2008 +0000

    land clio.
    b=14166

(Which, without checking, might be the largest single commit in the repo?  It's certainly a candidate.)

Anyway, I agree this is likely a timeout issue because of occasional delays in the hardware environment, and we haven't really been testing anything here since 1.x.

However, I can see where the fail loc should go in the new code; I recognize the snippet even though everything's been renamed.

But the bug is impossible in new versions - the problem was we weren't waking grant waiters after DIO, but we've got unified handling of sync and async BRW RPCs now.

So the test seems irrelevant to me now - the fail loc doesn't make a lot of sense any more, and it's testing for something we'd catch - if it somehow reoccurred, it would cause major performance problems if not a total hang.  So since (imo) it's not really adding anything with or without the fail_loc (it's definitely not a useful test without the fail loc), I'm going to remove it.

By the way, we had grant consumption for sync/DIO writes in 1.x; the original bug here was in the handling of it.  We lost that functionality in the clio transition (doesn't look intentional, just lost it) and added it back, in, like, 2019:

-        /* Consume write credits even if doing a sync write -
-         * otherwise we may run out of space on OST due to grant. */
-        if (cmd == OBD_BRW_WRITE) {
-                spin_lock(&cli->cl_loi_list_lock);
-                for (i = 0; i < page_count; i++) {
-                        if (cli->cl_avail_grant >= CFS_PAGE_SIZE)
-                                osc_consume_write_grant(cli, pga[i]); 

 

Comment by Gerrit Updater [ 24/Apr/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50731
Subject: LU-13706 tests: remove test 119d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c2d3ea7998bdc3a01069d8a1b4dd0092df03145f

Comment by Gerrit Updater [ 01/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50731/
Subject: LU-13706 tests: remove test 119d
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 59d5bb1558b281d75f1fd4bb360498454228afd7

Comment by Peter Jones [ 01/May/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:03:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.