Details
Description
Running a 1000-node user job (call it "ben") on the jade cluster results in 'lock callback timer expired' messages on the MDS console and transactions begin taking a very long time or failing entirely when the client is evicted.
The first lock timeouts are seen within 5 minutes of starting the job.
After the MDS stops responding inThe MDS is still up and debug logs can be dumped; I'll attach some.
There is no evidence of network issues; the fabric in the compute cluster appears clean, the router nodes and compute nodes report no peers down, and initially the clients report good connections to the server. Networking monitoring tools also indicate no network issues.