Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17303

mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • server (asp4) lustre-2.14.0_21.llnl-5.t4.x86_64=
      clients (oslic) lustre-2.12.9_6.llnl-2.t4.x86_64, (ruby) lustre-2.12.9_7.llnl-1.t4.x86_64
      TOSS 4.6-6
    • 3
    • 9223372036854775807

    Description

      mdt-aspls3-MDT0003 is stuck and not responding to clients. It has many (~244) threads stuck in ldlm_completion_ast, stopping and starting lustre does not fix the problem.

      Attachments

        1. asp.dmesg.logs.tgz
          797 kB
        2. orelic.lnet-diag.tgz
          33 kB
        3. ruby.lnet-diag.tgz
          101 kB
        4. ruby1066.log
          33 kB

        Activity

          [LU-17303] mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

          Cameron, are you planning to run any performance tests to compare "before" and "after"?

          ssmirnov Serguei Smirnov added a comment - Cameron, are you planning to run any performance tests to compare "before" and "after"?
          charr Cameron Harr added a comment -

          Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

          charr Cameron Harr added a comment - Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

          I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated.

          So I think it makes sense to test with the new settings you came up with. 

          ssmirnov Serguei Smirnov added a comment - I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated. So I think it makes sense to test with the new settings you came up with. 

          Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.

          charr Cameron Harr added a comment - Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.
          ssmirnov Serguei Smirnov added a comment - - edited

          Weren't orelic tcp "credits" at 512?

          ssmirnov Serguei Smirnov added a comment - - edited Weren't orelic tcp "credits" at 512?
          charr Cameron Harr added a comment - - edited

          Just a follow-up on this. It turns out zrelic had triple the number of local peers on o2ib100 (151), so I've further increased some of the numbers listed above to accommodate. New settings for credits and router buffers are below in red:

          • o2iblnd
            • peer_credits: 32 (up from 8)
            • credits: 4096 (up from 1024)
            • conns_per_peer: leave at 1 (MLNX std)
            • concurrent_sends: 64 (up from 8)
            • peercredits_hiw: 16 (up from 4)
          • tcp (200 Gb link)
            • peer_credits: 16 (up from 8, increase due to high b/w of network)
            • credits: 640 (16 * 34=544; up from 8)
          • Router buffers
            • (151 * 32) + (41 * 16) = 5488. I'm doubling buffer settings to  4096/32768/4096.
          charr Cameron Harr added a comment - - edited Just a follow-up on this. It turns out zrelic had triple the number of local peers on o2ib100 (151), so I've further increased some of the numbers listed above to accommodate. New settings for credits and router buffers are below in red: o2iblnd peer_credits: 32 (up from 8) credits: 4096 (up from 1024) conns_per_peer: leave at 1 (MLNX std) concurrent_sends: 64 (up from 8) peercredits_hiw: 16 (up from 4) tcp (200 Gb link) peer_credits: 16 (up from 8, increase due to high b/w of network) credits: 640 (16 * 34=544; up from 8) Router buffers (151 * 32) + (41 * 16) = 5488. I'm doubling buffer settings to  4096/32768/4096 .

          Yes, if the buffer numbers appear to be large enough, there's no need to change them. Let's hope peer_credits/credits changes you are making can improve performance.

          ssmirnov Serguei Smirnov added a comment - Yes, if the buffer numbers appear to be large enough, there's no need to change them. Let's hope peer_credits/credits changes you are making can improve performance.

          Thank you for clarifying this is only for local peers. That drastically reduces the number, with 53 local peers on o2ib100 and the same 34 on tcp0. I dropped the credits number accordingly. The revised numbers would then be:

          • o2iblnd
            • peer_credits: 32 (up from 8)
            • credits: 2048 (up from 1024)
            • conns_per_peer: leave at 1 (MLNX std)
            • concurrent_sends: 64 (up from 8)
            • peercredits_hiw: 16 (up from 4)
          • tcp (200 Gb link)
            • peer_credits: 16 (up from 8, increase due to high b/w of network)
            • credits: 512 (16 * 34=544; up from 8)
          • Router buffers
            • (53 * 32) + (34 * 16) = 2240. That makes me think we could keep the current settings of 2048/16384/2048.

          By my calculations, that's only about 16GB of RAM. Since we have ~256GB on the node, is there a benefit to bumping up all these buffers or could that cause latency issues as the buffers wait to be cleared?

          charr Cameron Harr added a comment - Thank you for clarifying this is only for local peers. That drastically reduces the number, with 53 local peers on o2ib100 and the same 34 on tcp0. I dropped the credits number accordingly. The revised numbers would then be: o2iblnd peer_credits: 32 (up from 8) credits: 2048 (up from 1024) conns_per_peer: leave at 1 (MLNX std) concurrent_sends: 64 (up from 8) peercredits_hiw: 16 (up from 4) tcp (200 Gb link) peer_credits: 16 (up from 8, increase due to high b/w of network) credits: 512 (16 * 34=544; up from 8) Router buffers (53 * 32) + (34 * 16) = 2240. That makes me think we could keep the current settings of 2048/16384/2048. By my calculations, that's only about 16GB of RAM. Since we have ~256GB on the node, is there a benefit to bumping up all these buffers or could that cause latency issues as the buffers wait to be cleared?

          Cameron,

          Some clarifications:

          • Does orelic actually have 2200 local peers? "credits" should be calculated using the number of local peers, i.e. on the same LNet. If you have a client which is separated from the servers by a bunch of routers, then you'd multiply "peer_credits" by the number of routers to get "credits" because on the local LNet for which we're tuning, the client only talks to the routers.
          • Same logic applies when calculating router buffer sizes. You multiply the number of local peers by the peer_credits separately for the LNets being routed, then add the results
          • After calculating the new buffer numbers for the routers, please check if there's enough memory on the router nodes to actually create the new pools. You may need to reduce the pool sizes if there isn't enough memory.

          Thanks,

          Serguei

          ssmirnov Serguei Smirnov added a comment - Cameron, Some clarifications: Does orelic actually have 2200 local peers? "credits" should be calculated using the number of local  peers, i.e. on the same LNet. If you have a client which is separated from the servers by a bunch of routers, then you'd multiply "peer_credits" by the number of routers to get "credits" because on the local LNet for which we're tuning, the client only talks to the routers. Same logic applies when calculating router buffer sizes. You multiply the number of local peers by the peer_credits separately for the LNets being routed, then add the results After calculating the new buffer numbers for the routers, please check if there's enough memory on the router nodes to actually create the new pools. You may need to reduce the pool sizes if there isn't enough memory. Thanks, Serguei
          charr Cameron Harr added a comment - - edited

          Thanks for the tips Serguei. Based on what you wrote, I did a little math and came up with some suggestions.

          On each orelic router, I have around 2200 peers (based on `lnetctl peer show`). Of those, only 34 peers are on tcp while the remaining are on various o2ib networks. Following my understanding of your recommendations, I would make the following changes on orelic:

          • o2iblnd
            • peer_credits: 32 (up from 8)
            • credits: 65536 (up from 1024)
            • conns_per_peer: leave at 1 (MLNX std)
            • concurrent_sends: 64 (up from 8)
            • peercredits_hiw: 16 (up from 4)
          • tcp (200 Gb link)
            • peer_credits: 16 (up from 8, increase due to high b/w of network)
            • credits: 512 (16 * 34=544; up from 8)
          • Router buffers
            • (# o2ib peers * o2ib peer_credits) + (# tcp peers * tcp peer credits)
            • (2166 * 32) + (34 * 16) = 69856, so round down to 65536 (up from 2048/16384/2048)

          Do those bold numbers look sane? Should I set all 3 buffers to 65536? Note these orelic routers have 256 GB of RAM.

          Thanks!

          Cameron

          charr Cameron Harr added a comment - - edited Thanks for the tips Serguei. Based on what you wrote, I did a little math and came up with some suggestions. On each orelic router, I have around 2200 peers (based on `lnetctl peer show`). Of those, only 34 peers are on tcp while the remaining are on various o2ib networks. Following my understanding of your recommendations, I would make the following changes on orelic: o2iblnd peer_credits: 32 (up from 8) credits: 65536 (up from 1024) conns_per_peer: leave at 1 (MLNX std) concurrent_sends: 64 (up from 8) peercredits_hiw: 16 (up from 4) tcp (200 Gb link) peer_credits: 16 (up from 8, increase due to high b/w of network) credits: 512 (16 * 34=544; up from 8) Router buffers (# o2ib peers * o2ib peer_credits) + (# tcp peers * tcp peer credits) (2166 * 32) + (34 * 16) = 69856, so round down to 65536 (up from 2048/16384/2048) Do those bold numbers look sane? Should I set all 3 buffers to 65536? Note these orelic routers have 256 GB of RAM. Thanks! Cameron

          Cameron,

          I don't think 2.12.9 has socklnd conns_per_peer feature so orelics won't have it either.

          However, there's likely still room for experiment with peer_credits and routing buffer numbers. In case you are willing to experiment, I recently put together some guidelines for LNet routing tuning: https://wiki.whamcloud.com/display/LNet/LNet+Routing+Setup+Verification+and+Tuning

          This page has some suggestions for what the "peer_credits" should be, "credits" as a function of "peer_credits" and number of local peers, and similar suggestions for router buffer numbers.

          Thanks,

          Serguei

           

          ssmirnov Serguei Smirnov added a comment - Cameron, I don't think 2.12.9 has socklnd conns_per_peer feature so orelics won't have it either. However, there's likely still room for experiment with peer_credits and routing buffer numbers. In case you are willing to experiment, I recently put together some guidelines for LNet routing tuning: https://wiki.whamcloud.com/display/LNet/LNet+Routing+Setup+Verification+and+Tuning This page has some suggestions for what the "peer_credits" should be, "credits" as a function of "peer_credits" and number of local peers, and similar suggestions for router buffer numbers. Thanks, Serguei  

          People

            green Oleg Drokin
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: