[LU-15086] replay-dual: test_10: @@@@@@ FAIL: test_10 failed with 2 Created: 12/Oct/21 Updated: 20/Nov/21 Resolved: 20/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Vladimir Saveliev <vlaidimir.saveliev@hpe.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2f1edc30-7ddf-4a6c-bc7a-fffbf58435b8 This is relatively regular failure: Error: 'test_10 failed with 2' Failure Rate: 6.00% of most recent 100 runs, 0 skipped (all branches) |
| Comments |
| Comment by Andreas Dilger [ 12/Oct/21 ] |
|
It looks like this has been failing intermittently (once per 4-5 days on full sessions only) since at least 2021-01-01 (the earliest I checked). Since 2021-10-10 it has also failed on 4 review test sessions, and as such is failing more frequently (6x in 2 days). It would be worthwhile to check what changes landed on 10-10 that might have triggered this. |
| Comment by Alex Zhuravlev [ 20/Oct/21 ] |
|
my bisection points to one patch: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION 8a770616a5 5 4 1 BAD LU-14797 sec: add projid to nodemap bbfdc7c167 17 15 2 BAD LU-14739 quota: fix quota with root squash enabled da1d93513f 7 6 1 BAD LU-14475 log: Rewrite some log messages 885b494632 18 17 1 BAD LU-12362 ptlrpc: use wait_woken() in ptlrpcd() 9ec5e2329b 90 90 0 LU-6142 lod: return pools_hash_params to being static. 2a24b6ec67 90 90 0 LU-14734 ldiskfs: improve message for large_dir f7f31f8f96 90 90 0 LU-12567 ptlrpc: handle reply and resend reorder 1a409a3e6a 90 90 0 LU-14711 osc: Do not attempt sending empty pages I think there are two potential issues in - if (timeout == 0) - wait_event_idle(set->set_waitq, - ptlrpcd_check(&env, pc)); - else if (wait_event_idle_timeout(set->set_waitq, - ptlrpcd_check(&env, pc), - cfs_time_seconds(timeout)) - == 0) + + add_wait_queue(&set->set_waitq, &wait); + while (!ptlrpcd_check(&env, pc)) { + int ret; + + if (timeout == 0) + ret = wait_woken(&wait, TASK_IDLE, + MAX_SCHEDULE_TIMEOUT); + else + ret = wait_woken(&wait, TASK_IDLE, + cfs_time_seconds(timeout)); + if (ret != 0) + continue; + /* Timed out */ ptlrpc_expired_set(set); + break; + } + remove_wait_queue(&set->set_waitq, &wait); + 1) timeout is reused after spurious wakeup, so this can last longer than expected |
| Comment by Alex Zhuravlev [ 20/Oct/21 ] |
|
added a simple check to pltrpcd to catch too long waiting: LustreError: 5019:0:(ptlrpcd.c:515:ptlrpcd()) ASSERTION( timeout == 0 || end - start < timeout + 4 ) failed: timeout 10, end 97, start 83, diff 14, count 3 i.e. 10 seconds wait was requested, but actually spent 14 seconds waiting, wait_woken() was called 3 times. yet another interesting example from failed replay-dual: LustreError: 5009:0:(ptlrpcd.c:513:ptlrpcd()) timeout 19, end 711, start 620, diff 91, count 9 |
| Comment by Gerrit Updater [ 20/Oct/21 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45308 |
| Comment by Peter Jones [ 20/Oct/21 ] |
|
Would we be better off reverting |
| Comment by Alex Zhuravlev [ 20/Oct/21 ] |
|
with the patch above I can't reproduce the problem. would like to hear Neil's opinion. |
| Comment by Gerrit Updater [ 20/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45308/ |
| Comment by Peter Jones [ 20/Nov/21 ] |
|
Landed for 2.15 |