Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
On my local system, sanity test 160s frequently fails. I haven't seen it fail when I run it on its own, but it commonly fails if I run it together with other changelog tests:
$ for i in {0..20}; do MDSCOUNT=2 ./lustre/tests/auster -s sanity --only 160p,160s 2>/dev/null | if grep -q "sanity returned 1" ; then echo "fail" ; else echo "pass"; fi ; done | sort | uniq -c 11 fail 10 pass
I believe the explanation for the failure is that the garbage collection thread that 160s tries to launch does not get launched (when expected) – in fact, it does get launched, but too soon.
It would appear that the first metadata operation that the test performs:
test_mkdir -c $MDSCOUNT -H all_char $DIR/$tdir
sometimes results in the garbage collection thread being launched.
This results in mc_gc_time being set to "now" in mdd_changelog_store():
int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd, struct llog_changelog_rec *rec, struct thandle *th) { ... spin_lock(&mdd->mdd_cl.mc_lock); if (likely(mdd->mdd_changelog_gc && mdd->mdd_cl.mc_gc_task == MDD_CHLG_GC_NONE && ktime_get_real_seconds() - mdd->mdd_cl.mc_gc_time > mdd->mdd_changelog_min_gc_interval)) { ... /* next check in mdd_changelog_min_gc_interval anyway */ mdd->mdd_cl.mc_gc_time = ktime_get_real_seconds(); }
Then, when the metadata operations that are intended to trigger the GC thread occur:
# Generate one more changelog to trigger GC for ((i = 0; i < MDSCOUNT; i++)); do $LFS mkdir -i $i $DIR/$tdir/d$i.3 $DIR/$tdir/d$i.4 || error "create $DIR/$tdir/d$i.3 failed" done
... the thread is NOT launched, because the comparison mdd->mdd_changelog_min_gc_interval >= ktime_get_real_seconds() - mdd->mdd_cl.mc_gc_time fails:
int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd, struct llog_changelog_rec *rec, struct thandle *th) { ... /* time to recover some space ?? */ if (likely(!mdd->mdd_changelog_gc || mdd->mdd_cl.mc_gc_task != MDD_CHLG_GC_NONE || mdd->mdd_changelog_min_gc_interval >= ktime_get_real_seconds() - mdd->mdd_cl.mc_gc_time)) /* save a spin_lock trip */ goto out_put; if (CFS_FAIL_PRECHECK(OBD_FAIL_MDS_CHANGELOG_IDX_PUMP)) { spin_lock(&mdd->mdd_cl.mc_lock); mdd->mdd_cl.mc_index += cfs_fail_val; spin_unlock(&mdd->mdd_cl.mc_lock); } ... }
and the function exits before hitting the fault injection and attempting to launch a GC thread.
I'm not sure why the GC thread sometimes gets triggered too soon, and sometimes not. But it would seem that the simplest solution here is to make the test sleep for an extra second, so that even if the GC thread is triggered at the beginning of the test, enough time will still elapse for it to trigger again 2 seconds later:
$ git diff
diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh
index 4f825ce824..beae714095 100755
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -20947,7 +20947,7 @@ test_160s() {
do_nodes $mdts $LCTL set_param fail_loc=0x16d fail_val=500000000 # ensure we are past the previous changelog_min_gc_interval set above
- local sleep2=$((start + 2 - SECONDS))
+ local sleep2=$((start + 3 - SECONDS))
(( sleep2 > 0 )) && echo "sleep $sleep2 for interval" && sleep $sleep2 # Generate one more changelog to trigger GC
With that change in place, the test reliably passes for me locally:
$ for i in {0..20}; do MDSCOUNT=2 ./lustre/tests/auster -s sanity --only 160p,160s 2>/dev/null | if grep -q "sanity returned 1" ; then echo "fail" ; else echo "pass"; fi ; done | sort | uniq -c 21 pass