I'm trying to get this bug moving forward, since it seems to be stalled at the moment but is probably the #2 or #3 reason for review-zfs test failures. Looking at the client stack traces, they all appear to have lfs stuck either waiting on the MDS to reply to an RPC:
https://testing.hpdd.intel.com/test_sets/603bbaca-5eae-11e4-a2a3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a62db834-4ffa-11e4-9892-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7c03803e-8529-11e4-8e46-5254006e85c2
though others appear to be waiting on the client-side statahead thread to start, so there may be two related issues here:
https://testing.hpdd.intel.com/test_sets/c75ba76e-5ee6-11e4-badb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6b3726e2-8516-11e4-985f-5254006e85c2
Unfortunately, I don't see anything in progress on the MDT at all, though the chance of catching an RPC in progress would be small. I also couldn't find the statahead thread in the client stack traces. I haven't looked at the client debug logs yet, but I expect that they will help narrow down exactly what is happening on the client (is it stuck on the client during first access/statahead, or is there some long-running problem that only makes itself seen when a large number of files have already been accessed?
so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.