Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17303

mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • server (asp4) lustre-2.14.0_21.llnl-5.t4.x86_64=
      clients (oslic) lustre-2.12.9_6.llnl-2.t4.x86_64, (ruby) lustre-2.12.9_7.llnl-1.t4.x86_64
      TOSS 4.6-6
    • 3
    • 9223372036854775807

    Description

      mdt-aspls3-MDT0003 is stuck and not responding to clients. It has many (~244) threads stuck in ldlm_completion_ast, stopping and starting lustre does not fix the problem.

      Attachments

        1. ruby1066.log
          33 kB
        2. asp.dmesg.logs.tgz
          797 kB
        3. ruby.lnet-diag.tgz
          101 kB
        4. orelic.lnet-diag.tgz
          33 kB

        Activity

          [LU-17303] mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

          We haven't seen this issue since applying the suggested lnet tunables to our relic clusters.

          defazio Gian-Carlo Defazio added a comment - We haven't seen this issue since applying the suggested lnet tunables to our relic clusters.
          charr Cameron Harr added a comment -

          Serguei, Thank you very much for helping us nail down these LNet tunings. It's something we've wanted to do for a long time but had a hard time finding straightforward documentation on how to do so.

          charr Cameron Harr added a comment - Serguei, Thank you very much for helping us nail down these LNet tunings. It's something we've wanted to do for a long time but had a hard time finding straightforward documentation on how to do so.
          charr Cameron Harr added a comment -

          Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing.

          charr Cameron Harr added a comment - Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing.

          If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful.

          ssmirnov Serguei Smirnov added a comment - If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful.
          charr Cameron Harr added a comment -

          I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes.

          charr Cameron Harr added a comment - I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes.

          Cameron, are you planning to run any performance tests to compare "before" and "after"?

          ssmirnov Serguei Smirnov added a comment - Cameron, are you planning to run any performance tests to compare "before" and "after"?
          charr Cameron Harr added a comment -

          Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

          charr Cameron Harr added a comment - Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

          I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated.

          So I think it makes sense to test with the new settings you came up with. 

          ssmirnov Serguei Smirnov added a comment - I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated. So I think it makes sense to test with the new settings you came up with. 

          Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.

          charr Cameron Harr added a comment - Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.
          ssmirnov Serguei Smirnov added a comment - - edited

          Weren't orelic tcp "credits" at 512?

          ssmirnov Serguei Smirnov added a comment - - edited Weren't orelic tcp "credits" at 512?

          People

            green Oleg Drokin
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: