[LU-15643] do not loop on OI Scrub on same FID Created: 12/Mar/22  Updated: 22/Sep/23  Resolved: 06/Dec/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: hxr

Issue Links:
Related
is related to LU-14831 the OI scrub is triggered repeatedly Resolved
is related to LU-16411 convert scrub irreparable list to rha... Open
is related to LU-16380 conf-sanity test_108b: timeout at rea... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

OI Scrub should not re-trigger on the same FID repeatedly. This currently happens for various different reasons, where the same one or two FIDs are causing OI Scrub to be run, but this isn't useful to run Scrub multiple times, and it should just ignore those FIDs. For example, from LU-14831:

[Tue Jul 6 03:29:22 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013fd7:0x10a7c:0x0]/1254916266 with flags 0x4a: rc = 0
[Tue Jul 6 04:26:01 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013957:0x1ada1:0x0]/148637947 with flags 0x4a: rc = 0
[Tue Jul 6 05:18:33 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013957:0x1888c:0x0]/148637947 with flags 0x4a: rc = 0
[Tue Jul 6 06:12:26 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013fd7:0x10a7c:0x0]/1254916266 with flags 0x4a: rc = 0
[Tue Jul 6 07:04:40 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013957:0x1888c:0x0]/148637947 with flags 0x4a: rc = 0
[Tue Jul 6 07:56:45 2021] Lustre: lustre-MDT0001: trigger OI scrub by RPC for [0x240013957:0x1ada1:0x0]/148637947 with flags 0x4a: rc = 0

it looks like it has found 3 different FIDs and is looping, but similar situations have been hit in many different cases.

What should be done is keep track of the FIDs that have triggered auto-scrub (in memory is probably enough), and not-retrigger auto-scrub on those same FIDs. This is useful regardless of what the root cause of the OI scrub looping is. If the "bad FID list" is kept in memory only, then there would still be an OI Scrub triggered when the MDS restarts, so the problem wouldn't be swept under the rug completely, but at least it wouldn't turn a small problem with one file into a major spike in server load and block access to files on the server.



 Comments   
Comment by Gerrit Updater [ 17/Mar/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46852
Subject: LU-15643 osd-ldiskfs: don't trigger scrub on irreparable FIDs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 65e10ed5237d1604c0bc75ec0e116989a80554d0

Comment by Xing Huang [ 09/Sep/22 ]

2022-09-10: Oleg confirmed that the fix patch introduces 100% failure in sanity-scrub tests

Comment by Xing Huang [ 22/Oct/22 ]

2022-10-22: The patch passed Maloo tests, and is being worked on to address Janitor failures.

Comment by Gerrit Updater [ 25/Oct/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48940
Subject: LU-15643 test: collect log for sanity-scrub test_8
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cc960131852b668238d251979bb42313ee94f0c9

Comment by Xing Huang [ 26/Nov/22 ]

2022-11-26: The patch passed Maloo tests and Janitor tests, and is being reviewed.

Comment by Xing Huang [ 03/Dec/22 ]

2022-12-03: The patch passed Maloo tests and Janitor tests, and is ready to land to master.

Comment by Gerrit Updater [ 06/Dec/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46852/
Subject: LU-15643 osd-ldiskfs: don't trigger scrub on irreparable FIDs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 558784caad491be50e93ae60a31d4219a1e038bc

Comment by Peter Jones [ 06/Dec/22 ]

Landed for 2.16

Comment by Alex Zhuravlev [ 19/Dec/22 ]

I think this patch causes LU-16380

Generated at Sat Feb 10 03:20:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.