Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4579

Timeout system horribly broken

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • None
    • Lustre 2.4.0-21chaos
    • 3
    • 12505

    Description

      It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

      The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

      The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

      We need to work out a plan to fix the timeouts in lustre.

      Attachments

        1. sysrq-t.catalyst141.client.txt
          826 kB
          Ned Bass
        2. sysrq-t.cider-mds1.txt
          1.58 MB
          Ned Bass

        Issue Links

          Activity

            People

              green Oleg Drokin
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: