[LU-16362] landing async readhead caused a "uncovered page" panic during sanityN runs Created: 02/Dec/22 Updated: 10/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Alexey Lyashkov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Async readhead have skip a "locking" phase c2791674260 (Wang Shilong 2019-01-21 20:23:47 +0800 658) io->ci_state = CIS_LOCKED; But separate cl io created, this caused pages sends outside of original page lock and parallel blocking AST caused an "uncovered page" panic hit. |
| Comments |
| Comment by Alexander Zarochentsev [ 09/Jan/23 ] |
|
LU-16332 is a similar issue. |
| Comment by Gerrit Updater [ 14/Jul/23 ] |
|
"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51677 |
| Comment by Alexey Lyashkov [ 26/Oct/23 ] |
|
once first patch version found an issues, I spent more time to investing it. application started to read a data from a file. It's multi stripe file, but read want to read just first stripe only, and cl io had created to offset range assigned to first stripe. @@ -1195,22 +1198,27 @@ static int lov_io_read_ahead(const struct lu_env *env, if (unlikely(!r0->lo_sub[stripe])) RETURN(-EIO); sub = lov_sub_get(env, lio, lov_comp_index(index, stripe)); if (IS_ERR(sub)) RETURN(PTR_ERR(sub)); /* no RA outside of active stripe */ + if (sub->sub_io.ci_state != CIS_IO_GOING && + sub->sub_io.ci_state != CIS_LOCKED) + LBUG(); I'm not sure what is best in this situation - just disable an RA outside of active region. or make lock pinning better to avoid situation with submitting without lock held. |
| Comment by Alexey Lyashkov [ 29/Nov/23 ] |
|
I looks I understand an issue finally. |
| Comment by Alexey Lyashkov [ 30/Nov/23 ] |
|
It might be artifact of PG_writeback change. page don't unlocked until IO done before it and panic don't hit. |
| Comment by Gerrit Updater [ 10/Jan/24 ] |
|
"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53635 |