Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.11.0, Lustre 2.12.4
-
3
-
9223372036854775807
Description
While investigating of the customer issue, we found that the original trigger for the problem is a compatibility issue between Lustre 2.11 and older Lustre versions. Code introduced by LU-7787 to "clean up orphan object handling" was incomplete. The format for names of orphans in the PENDING dir was changed in Lustre 2.11. The old format names are not recognized by mdd_orphan_destroy() in Lustre 2.11, leading to an endless loop. There's a check for the old format name, used in mdd_orphan_delete(), but that check was not included in mdd_orphan_destroy().
Here is the relevant code segment from mdd_orphan_delete():
rc = dt_delete(env, mdd->mdd_orphans, key, th);
if (rc == -ENOENT) {
key = mdd_orphan_key_fill_20(env, mdo2fid(obj));
rc = dt_delete(env, mdd->mdd_orphans, key, th);
}
This same ENOENT sequence should be included in mdd_orphan_destroy().
It looks like LU-11418 trying to solve the problem, but it removes symptoms, not the root cause.
Hello Yang,
The same situation in the mdd_orphan_destroy(). The mdd_orphan_key_fill() is executed in mdd_orphan_destroy->mdd_orphan_declare_delete() code path. And then this "filled" name is used.
The first symptom of this issue we noticed was the message:
I believe the reason that ENOENT returned is the wrong fid is parsed from the filename because of filename in old format and
mdd_orphan_key_fill_20() needs to be used.
Best regards,
Artem Blagodarenko.