Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4579

Timeout system horribly broken

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Blocker Blocker
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • None
    • Lustre 2.4.0-21chaos
    • 3
    • 12505

      It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

      The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

      The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

      We need to work out a plan to fix the timeouts in lustre.

            green Oleg Drokin
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated:
              Resolved: