When upgrading file systems from Cray's 1.8.6 to 2.4.1, we've seen exactly this bug.
Today, I tested upgrading a file system from WC 1.8.9 (latest as of a months ago) to WC 2.4.0 as released, and I'm seeing this bug.
The patch (http://review.whamcloud.com/#/c/5179/) is definitely present in both cases, but the ".." directory entry is not being placed correctly.
We cause this problem simply by formatting a file system under 1.8, creating a directory with at least one file in it, stopping the file system & adding the dirdata attribute to the MDT, then starting the same file system with 2.4/2.4.1 servers (we have not tried 2.5 or 2.6 yet; planning to test 2.5 just in case) and moving that directory to a new location.
We see exactly the errors described above when we run e2fsck. Looking at the directory block before and after the rename (move of the directory to a new location), we see the old ".." entry is still present but the record length of the "." has been extended, making that effectively unused space. The new ".." entry is placed in the first location where enough space has been found, with the FID included in it. (In our simple test case, it's the last entry in the directory. On real file systems, it ends up at some intermediate point that was previously unused space.)
When we take the same test but instead make several files in the directory and delete the first one, that leaves a larger space after the "." entry available for the new ".." entry. In that case, the new ".." entry is placed in that location - And is the second entry, so everything works fine.
It is very much as though the 5179 patch isn't there at all. (Remember that this problem is seen going between WC 1.8.9 and WC 2.4.0; this is not a problem just in the Cray source.)
My examination of the code hasn't helped me figure out what's wrong there, other than that the ldiskfs_update_dotdot code is extremely complex. I may post some questions about that as well.
When we use e2fsck to 'fix' this, it simply forces the new ".." entry in to the second place, and overwrites the first actual 'file' entry in the directory. The result is a consistent directory that will survive further usage - And a file dumped to lost+found.
Additionally, this bug can cause the MDT to go read-only. It's not quite clear to us what's causing that, but the reported errors from e2fsck all look related, and after running e2fsck, the MDT can be mounted again.
To make clear the importance of this:
It looks like all 1.8 file systems upgraded to 2.4+ (haven't tested 2.5 or 2.6 yet) will experience directory corruption upon enabling the dirdata feature and moving old directories. This will not affect all old directories, but it will affect many of them.
Rick - You can avoid this problem happening to any more directories by unmounting your MDT, turning off dirdata, and remounting. However, those which are damaged are damaged and e2fsck and recovery from lost+found is your only option. The software fix* will prevent further damage, but it won't help with existing damaged directories.
*About that fix: You're almost certainly seeing the closely related https://jira.hpdd.intel.com/browse/LU-5626, rather than
LU-2638.LU-2638was fixed before release of 2.4, so unless you updated to 2.1 from 1.8 and are still running 2.1 (or 2.2 or 2.3, I guess), it'sLU-5626.LU-5626is similar but subtly differently and did get in to 2.4 and 2.5 as released. Note also that if you're hitting this situation and running without the fix, there's also the possibility of kernel panic'ing your MDS, so the workaround of turning off dirdata temporarily is a very good idea (if you can't get a software update quickly).