[LU-15643] do not loop on OI Scrub on same FID Created: 12/Mar/22 Updated: 22/Sep/23 Resolved: 06/Dec/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | hxr | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
OI Scrub should not re-trigger on the same FID repeatedly. This currently happens for various different reasons, where the same one or two FIDs are causing OI Scrub to be run, but this isn't useful to run Scrub multiple times, and it should just ignore those FIDs. For example, from [Tue Jul 6 03:29:22 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013fd7:0x10a7c:0x0]/1254916266 with flags 0x4a: rc = 0 [Tue Jul 6 04:26:01 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013957:0x1ada1:0x0]/148637947 with flags 0x4a: rc = 0 [Tue Jul 6 05:18:33 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013957:0x1888c:0x0]/148637947 with flags 0x4a: rc = 0 [Tue Jul 6 06:12:26 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013fd7:0x10a7c:0x0]/1254916266 with flags 0x4a: rc = 0 [Tue Jul 6 07:04:40 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013957:0x1888c:0x0]/148637947 with flags 0x4a: rc = 0 [Tue Jul 6 07:56:45 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013957:0x1ada1:0x0]/148637947 with flags 0x4a: rc = 0 it looks like it has found 3 different FIDs and is looping, but similar situations have been hit in many different cases. What should be done is keep track of the FIDs that have triggered auto-scrub (in memory is probably enough), and not-retrigger auto-scrub on those same FIDs. This is useful regardless of what the root cause of the OI scrub looping is. If the "bad FID list" is kept in memory only, then there would still be an OI Scrub triggered when the MDS restarts, so the problem wouldn't be swept under the rug completely, but at least it wouldn't turn a small problem with one file into a major spike in server load and block access to files on the server. |
| Comments |
| Comment by Gerrit Updater [ 17/Mar/22 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46852 |
| Comment by Xing Huang [ 09/Sep/22 ] |
|
2022-09-10: Oleg confirmed that the fix patch introduces 100% failure in sanity-scrub tests |
| Comment by Xing Huang [ 22/Oct/22 ] |
|
2022-10-22: The patch passed Maloo tests, and is being worked on to address Janitor failures. |
| Comment by Gerrit Updater [ 25/Oct/22 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48940 |
| Comment by Xing Huang [ 26/Nov/22 ] |
|
2022-11-26: The patch passed Maloo tests and Janitor tests, and is being reviewed. |
| Comment by Xing Huang [ 03/Dec/22 ] |
|
2022-12-03: The patch passed Maloo tests and Janitor tests, and is ready to land to master. |
| Comment by Gerrit Updater [ 06/Dec/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46852/ |
| Comment by Peter Jones [ 06/Dec/22 ] |
|
Landed for 2.16 |
| Comment by Alex Zhuravlev [ 19/Dec/22 ] |
|
I think this patch causes |