Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17303

mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • server (asp4) lustre-2.14.0_21.llnl-5.t4.x86_64=
      clients (oslic) lustre-2.12.9_6.llnl-2.t4.x86_64, (ruby) lustre-2.12.9_7.llnl-1.t4.x86_64
      TOSS 4.6-6
    • 3
    • 9223372036854775807

    Description

      mdt-aspls3-MDT0003 is stuck and not responding to clients. It has many (~244) threads stuck in ldlm_completion_ast, stopping and starting lustre does not fix the problem.

      Attachments

        1. asp.dmesg.logs.tgz
          797 kB
        2. orelic.lnet-diag.tgz
          33 kB
        3. ruby.lnet-diag.tgz
          101 kB
        4. ruby1066.log
          33 kB

        Activity

          [LU-17303] mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

          We haven't seen this issue since applying the suggested lnet tunables to our relic clusters.

          defazio Gian-Carlo Defazio added a comment - We haven't seen this issue since applying the suggested lnet tunables to our relic clusters.
          charr Cameron Harr added a comment -

          Serguei, Thank you very much for helping us nail down these LNet tunings. It's something we've wanted to do for a long time but had a hard time finding straightforward documentation on how to do so.

          charr Cameron Harr added a comment - Serguei, Thank you very much for helping us nail down these LNet tunings. It's something we've wanted to do for a long time but had a hard time finding straightforward documentation on how to do so.
          charr Cameron Harr added a comment -

          Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing.

          charr Cameron Harr added a comment - Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing.

          If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful.

          ssmirnov Serguei Smirnov added a comment - If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful.
          charr Cameron Harr added a comment -

          I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes.

          charr Cameron Harr added a comment - I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes.

          Cameron, are you planning to run any performance tests to compare "before" and "after"?

          ssmirnov Serguei Smirnov added a comment - Cameron, are you planning to run any performance tests to compare "before" and "after"?
          charr Cameron Harr added a comment -

          Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

          charr Cameron Harr added a comment - Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

          People

            green Oleg Drokin
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: