Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
None
-
Lustre 2.4.0-21chaos
-
3
-
12505
Description
It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.
The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.
The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.
We need to work out a plan to fix the timeouts in lustre.