[LU-13706] sanity test_119d: the read rpcs have not completed in 2s Created: 23/Jun/20 Updated: 01/May/23 Resolved: 01/May/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Patrick Farrell |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1d6568f0-1550-4b32-8826-d698df594715 test_119d failed with the following error: the read rpcs have not completed in 2s Maybe the 2 second timeout is too short? VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Patrick Farrell [ 24/Apr/23 ] |
|
So, I've hit this again recently... This test is from the 1.8 era, which is fine by itself, but the fail loc this test uses was removed in: commit fbf5870b9848929d352460f1f005b79c0b5ccc5a land clio. (Which, without checking, might be the largest single commit in the repo? It's certainly a candidate.) Anyway, I agree this is likely a timeout issue because of occasional delays in the hardware environment, and we haven't really been testing anything here since 1.x. However, I can see where the fail loc should go in the new code; I recognize the snippet even though everything's been renamed. But the bug is impossible in new versions - the problem was we weren't waking grant waiters after DIO, but we've got unified handling of sync and async BRW RPCs now. So the test seems irrelevant to me now - the fail loc doesn't make a lot of sense any more, and it's testing for something we'd catch - if it somehow reoccurred, it would cause major performance problems if not a total hang. So since (imo) it's not really adding anything with or without the fail_loc (it's definitely not a useful test without the fail loc), I'm going to remove it. By the way, we had grant consumption for sync/DIO writes in 1.x; the original bug here was in the handling of it. We lost that functionality in the clio transition (doesn't look intentional, just lost it) and added it back, in, like, 2019: - /* Consume write credits even if doing a sync write -
- * otherwise we may run out of space on OST due to grant. */
- if (cmd == OBD_BRW_WRITE) {
- spin_lock(&cli->cl_loi_list_lock);
- for (i = 0; i < page_count; i++) {
- if (cli->cl_avail_grant >= CFS_PAGE_SIZE)
- osc_consume_write_grant(cli, pga[i]);
|
| Comment by Gerrit Updater [ 24/Apr/23 ] |
|
"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50731 |
| Comment by Gerrit Updater [ 01/May/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50731/ |
| Comment by Peter Jones [ 01/May/23 ] |
|
Landed for 2.16 |