[LU-8714] too many update logs during soak-test. Created: 17/Oct/16  Updated: 30/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Di Wang Assignee: Di Wang
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Related
is related to LU-8250 MDT recovery stalled on secondary node Resolved
is related to LU-8794 update_log_dir consuming 1.1TB on MDT... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In the last DNE soak test, we found the recovery stuck there for very long time, (> 4 hours). It looks like there are too much update log being left during recovery.

Each MDT has around 80k-100k records, which seems too much,

wangdi-mac01:~ wangdi$ grep -r "mdt_index 1" /tmp/records  | wc
   76104  913248 15353624
wangdi-mac01:~ wangdi$ grep -r "mdt_index 0" /tmp/records  | wc
   91589 1099068 19376239
wangdi-mac01:~ wangdi$ grep -r "mdt_index 2" /tmp/records  | wc
  102798 1233576 21763151
wangdi-mac01:~ wangdi$ grep -r "mdt_index 3" /tmp/records  | wc
   98332 1179984 20821847

Unfortunately, there are not much logs to help me understanding why there are so much logs being left.

But it seems we can make cancellation smarter. In current implementation, when one batchid is committed, it only cancel the update records for this batchid, but we actually can cancel all of update records, whose batchid < current committed batchid. Then even if some update recordss might be left for some reasons, these recordss can still be deleted by later batchid commitment.


Generated at Sat Feb 10 02:19:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.