While watching the performance reported by Robinhood while it was scanning a test filesystem with 24 million files I noticed something strange. Performance was gradually decreasing over time because the STAGE_GET_INFO_FS stage was taking longer and longer. That's not good.
After a little investigation it was clear that the open("./lustre/fid", O_RDONLY) call was getting on average slower and slower. It could take as little as 0.1ms or as long as 15ms from what I observed. The subsequent GETSTRIPE ioctl() was always fast.
The Lustre debug logs show that all the time is being spent in do_lookup() because there is never a valid dentry on the client. That means for every open we attempt to enqueue a lock for this special file. That lock enqueue fails on the MDT with ELDLM_LOCK_ABORTED but the open still seems to succeed. To make maters worse the client serials the enqueues because they are all IT_OPEN.
To get a handle on what was going on I wrote a trivial reproducer which just opens and closes a specified file repeatedly. If you run just a few iterations and grab the Lustre debug logs you can easily see the failing enqueues.
I suspect if the enqueues were allowed to succeed this would be a non-issue since we would have a valid read lock on the client. However, I haven't tested that and it's not at all clear to me what you guys are planning to do with that .lustre directory. Perhaps you can propose a fix.