Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.0
-
None
-
3
-
9223372036854775807
Description
In the last DNE soak test, we found the recovery stuck there for very long time, (> 4 hours). It looks like there are too much update log being left during recovery.
Each MDT has around 80k-100k records, which seems too much,
wangdi-mac01:~ wangdi$ grep -r "mdt_index 1" /tmp/records | wc 76104 913248 15353624 wangdi-mac01:~ wangdi$ grep -r "mdt_index 0" /tmp/records | wc 91589 1099068 19376239 wangdi-mac01:~ wangdi$ grep -r "mdt_index 2" /tmp/records | wc 102798 1233576 21763151 wangdi-mac01:~ wangdi$ grep -r "mdt_index 3" /tmp/records | wc 98332 1179984 20821847
Unfortunately, there are not much logs to help me understanding why there are so much logs being left.
But it seems we can make cancellation smarter. In current implementation, when one batchid is committed, it only cancel the update records for this batchid, but we actually can cancel all of update records, whose batchid < current committed batchid. Then even if some update recordss might be left for some reasons, these recordss can still be deleted by later batchid commitment.