The situation should be like this:
1) When you upgraded your MDS with the patch http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c, the 64bithash/32bithash issue has been introduced in your system. Because the "osd_thread_info" is reused without totally reset when switch from one RPC processing to another RPC processing.
2) For old client, in spite of 32-bit or 64-bit, as long as it did NOT claim OBD_CONNECT_64BITHASH flags when connected to the MDS, the readdir RPC from such old client would cause that the "osd_thread_info::oti_it_ea::oie_file::f_mode" to be set as "FMODE_32BITHASH". As long as such readdir RPC happened once, the "FMODE_32BITHASH" flags on related "osd_thread_info" would not be cleared until the RPC service restarted on the MDS.
3) As long as the "FMODE_32BITHASH" was set, dir-hash processed by such RPC service thread would use the major hash only - 32bit. That is why we saw the 32bit dirhash returned.
4) No all the RPC service threads' "osd_thread_info::oti_it_ea::oie_file::f_mode" have been set as "FMODE_32BITHASH", depends on whether old clients triggered those RPC service threads to serve readdir RPCs or not. If the RPC service thread had not "FMODE_32BITHASH", then it will generate 64bithash, that is why we also saw some 64bit dirhash returned.
5) The readdir RPC from one client can be served by any RPC (readpage) service thread. So sometimes the readdir RPC was served by the RPC service thread which was set "FMODE_32BITHASH", but sometimes it may be served by the RPC service thread which was NOT set "FMODE_32BITHASH". For a large directory, one "ls -l dir" command may trigger several readdir RPCs, if these RPCs were handled by different RPC service threads, some of them were set "FMODE_32BITHASH" but some of them were NOT, then when client send 32bithash to the RPC service thread, which had NOT "FMODE_32BITHASH", the RPC service thread would explain the 32bithash (from client) as "major = 0, minor = 32bithash", that was wrong. So it cannot locate the right position.
6) For 2.x client, one readdir RPC will fetch back at most 256 pages, but for 1.8 client, only single page per RPC. So the readdir RPCs count for the same sized directory are different. And more readdir RPCs more failure possibility. That is why the failure is more easy to be reproduced on 1.8 client than on 2.x client.
Nice catch! We do have some clients running a 1.8.5-5chaos tag (old BG/P systems), and interestingly enough, they are only mounting the two filesystems that we see this issue on. So it all seems to add up, IMO. The 1.8.5-5chaos clients appear to have tainted a subset of the MDS threads, causing 64bit enabled clients to see this issue when the readdir takes more than 1 RPC, and the readdir is serviced by a mix of tainted and not tainted MDS threads.