Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17303

mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • server (asp4) lustre-2.14.0_21.llnl-5.t4.x86_64=
      clients (oslic) lustre-2.12.9_6.llnl-2.t4.x86_64, (ruby) lustre-2.12.9_7.llnl-1.t4.x86_64
      TOSS 4.6-6
    • 3
    • 9223372036854775807

    Description

      mdt-aspls3-MDT0003 is stuck and not responding to clients. It has many (~244) threads stuck in ldlm_completion_ast, stopping and starting lustre does not fix the problem.

      Attachments

        1. asp.dmesg.logs.tgz
          797 kB
        2. orelic.lnet-diag.tgz
          33 kB
        3. ruby.lnet-diag.tgz
          101 kB
        4. ruby1066.log
          33 kB

        Activity

          [LU-17303] mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"
          charr Cameron Harr added a comment -

          Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing.

          charr Cameron Harr added a comment - Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing.

          If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful.

          ssmirnov Serguei Smirnov added a comment - If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful.
          charr Cameron Harr added a comment -

          I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes.

          charr Cameron Harr added a comment - I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes.

          Cameron, are you planning to run any performance tests to compare "before" and "after"?

          ssmirnov Serguei Smirnov added a comment - Cameron, are you planning to run any performance tests to compare "before" and "after"?
          charr Cameron Harr added a comment -

          Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

          charr Cameron Harr added a comment - Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

          I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated.

          So I think it makes sense to test with the new settings you came up with. 

          ssmirnov Serguei Smirnov added a comment - I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated. So I think it makes sense to test with the new settings you came up with. 

          Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.

          charr Cameron Harr added a comment - Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.
          ssmirnov Serguei Smirnov added a comment - - edited

          Weren't orelic tcp "credits" at 512?

          ssmirnov Serguei Smirnov added a comment - - edited Weren't orelic tcp "credits" at 512?
          charr Cameron Harr added a comment - - edited

          Just a follow-up on this. It turns out zrelic had triple the number of local peers on o2ib100 (151), so I've further increased some of the numbers listed above to accommodate. New settings for credits and router buffers are below in red:

          • o2iblnd
            • peer_credits: 32 (up from 8)
            • credits: 4096 (up from 1024)
            • conns_per_peer: leave at 1 (MLNX std)
            • concurrent_sends: 64 (up from 8)
            • peercredits_hiw: 16 (up from 4)
          • tcp (200 Gb link)
            • peer_credits: 16 (up from 8, increase due to high b/w of network)
            • credits: 640 (16 * 34=544; up from 8)
          • Router buffers
            • (151 * 32) + (41 * 16) = 5488. I'm doubling buffer settings to  4096/32768/4096.
          charr Cameron Harr added a comment - - edited Just a follow-up on this. It turns out zrelic had triple the number of local peers on o2ib100 (151), so I've further increased some of the numbers listed above to accommodate. New settings for credits and router buffers are below in red: o2iblnd peer_credits: 32 (up from 8) credits: 4096 (up from 1024) conns_per_peer: leave at 1 (MLNX std) concurrent_sends: 64 (up from 8) peercredits_hiw: 16 (up from 4) tcp (200 Gb link) peer_credits: 16 (up from 8, increase due to high b/w of network) credits: 640 (16 * 34=544; up from 8) Router buffers (151 * 32) + (41 * 16) = 5488. I'm doubling buffer settings to  4096/32768/4096 .

          Yes, if the buffer numbers appear to be large enough, there's no need to change them. Let's hope peer_credits/credits changes you are making can improve performance.

          ssmirnov Serguei Smirnov added a comment - Yes, if the buffer numbers appear to be large enough, there's no need to change them. Let's hope peer_credits/credits changes you are making can improve performance.

          Thank you for clarifying this is only for local peers. That drastically reduces the number, with 53 local peers on o2ib100 and the same 34 on tcp0. I dropped the credits number accordingly. The revised numbers would then be:

          • o2iblnd
            • peer_credits: 32 (up from 8)
            • credits: 2048 (up from 1024)
            • conns_per_peer: leave at 1 (MLNX std)
            • concurrent_sends: 64 (up from 8)
            • peercredits_hiw: 16 (up from 4)
          • tcp (200 Gb link)
            • peer_credits: 16 (up from 8, increase due to high b/w of network)
            • credits: 512 (16 * 34=544; up from 8)
          • Router buffers
            • (53 * 32) + (34 * 16) = 2240. That makes me think we could keep the current settings of 2048/16384/2048.

          By my calculations, that's only about 16GB of RAM. Since we have ~256GB on the node, is there a benefit to bumping up all these buffers or could that cause latency issues as the buffers wait to be cleared?

          charr Cameron Harr added a comment - Thank you for clarifying this is only for local peers. That drastically reduces the number, with 53 local peers on o2ib100 and the same 34 on tcp0. I dropped the credits number accordingly. The revised numbers would then be: o2iblnd peer_credits: 32 (up from 8) credits: 2048 (up from 1024) conns_per_peer: leave at 1 (MLNX std) concurrent_sends: 64 (up from 8) peercredits_hiw: 16 (up from 4) tcp (200 Gb link) peer_credits: 16 (up from 8, increase due to high b/w of network) credits: 512 (16 * 34=544; up from 8) Router buffers (53 * 32) + (34 * 16) = 2240. That makes me think we could keep the current settings of 2048/16384/2048. By my calculations, that's only about 16GB of RAM. Since we have ~256GB on the node, is there a benefit to bumping up all these buffers or could that cause latency issues as the buffers wait to be cleared?

          People

            green Oleg Drokin
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: