While investigating of the customer issue, we found that the original trigger for the problem is a compatibility issue between Lustre 2.11 and older Lustre versions. Code introduced by
LU-7787 to "clean up orphan object handling" was incomplete. The format for names of orphans in the PENDING dir was changed in Lustre 2.11. The old format names are not recognized by mdd_orphan_destroy() in Lustre 2.11, leading to an endless loop. There's a check for the old format name, used in mdd_orphan_delete(), but that check was not included in mdd_orphan_destroy().
Here is the relevant code segment from mdd_orphan_delete():
This same ENOENT sequence should be included in mdd_orphan_destroy().
It looks like
LU-11418 trying to solve the problem, but it removes symptoms, not the root cause.