Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17303

mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • server (asp4) lustre-2.14.0_21.llnl-5.t4.x86_64=
      clients (oslic) lustre-2.12.9_6.llnl-2.t4.x86_64, (ruby) lustre-2.12.9_7.llnl-1.t4.x86_64
      TOSS 4.6-6
    • 3
    • 9223372036854775807

    Description

      mdt-aspls3-MDT0003 is stuck and not responding to clients. It has many (~244) threads stuck in ldlm_completion_ast, stopping and starting lustre does not fix the problem.

      Attachments

        1. asp.dmesg.logs.tgz
          797 kB
        2. orelic.lnet-diag.tgz
          33 kB
        3. ruby.lnet-diag.tgz
          101 kB
        4. ruby1066.log
          33 kB

        Activity

          [LU-17303] mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

          Cameron,

          Some clarifications:

          • Does orelic actually have 2200 local peers? "credits" should be calculated using the number of local peers, i.e. on the same LNet. If you have a client which is separated from the servers by a bunch of routers, then you'd multiply "peer_credits" by the number of routers to get "credits" because on the local LNet for which we're tuning, the client only talks to the routers.
          • Same logic applies when calculating router buffer sizes. You multiply the number of local peers by the peer_credits separately for the LNets being routed, then add the results
          • After calculating the new buffer numbers for the routers, please check if there's enough memory on the router nodes to actually create the new pools. You may need to reduce the pool sizes if there isn't enough memory.

          Thanks,

          Serguei

          ssmirnov Serguei Smirnov added a comment - Cameron, Some clarifications: Does orelic actually have 2200 local peers? "credits" should be calculated using the number of local  peers, i.e. on the same LNet. If you have a client which is separated from the servers by a bunch of routers, then you'd multiply "peer_credits" by the number of routers to get "credits" because on the local LNet for which we're tuning, the client only talks to the routers. Same logic applies when calculating router buffer sizes. You multiply the number of local peers by the peer_credits separately for the LNets being routed, then add the results After calculating the new buffer numbers for the routers, please check if there's enough memory on the router nodes to actually create the new pools. You may need to reduce the pool sizes if there isn't enough memory. Thanks, Serguei
          charr Cameron Harr added a comment - - edited

          Thanks for the tips Serguei. Based on what you wrote, I did a little math and came up with some suggestions.

          On each orelic router, I have around 2200 peers (based on `lnetctl peer show`). Of those, only 34 peers are on tcp while the remaining are on various o2ib networks. Following my understanding of your recommendations, I would make the following changes on orelic:

          • o2iblnd
            • peer_credits: 32 (up from 8)
            • credits: 65536 (up from 1024)
            • conns_per_peer: leave at 1 (MLNX std)
            • concurrent_sends: 64 (up from 8)
            • peercredits_hiw: 16 (up from 4)
          • tcp (200 Gb link)
            • peer_credits: 16 (up from 8, increase due to high b/w of network)
            • credits: 512 (16 * 34=544; up from 8)
          • Router buffers
            • (# o2ib peers * o2ib peer_credits) + (# tcp peers * tcp peer credits)
            • (2166 * 32) + (34 * 16) = 69856, so round down to 65536 (up from 2048/16384/2048)

          Do those bold numbers look sane? Should I set all 3 buffers to 65536? Note these orelic routers have 256 GB of RAM.

          Thanks!

          Cameron

          charr Cameron Harr added a comment - - edited Thanks for the tips Serguei. Based on what you wrote, I did a little math and came up with some suggestions. On each orelic router, I have around 2200 peers (based on `lnetctl peer show`). Of those, only 34 peers are on tcp while the remaining are on various o2ib networks. Following my understanding of your recommendations, I would make the following changes on orelic: o2iblnd peer_credits: 32 (up from 8) credits: 65536 (up from 1024) conns_per_peer: leave at 1 (MLNX std) concurrent_sends: 64 (up from 8) peercredits_hiw: 16 (up from 4) tcp (200 Gb link) peer_credits: 16 (up from 8, increase due to high b/w of network) credits: 512 (16 * 34=544; up from 8) Router buffers (# o2ib peers * o2ib peer_credits) + (# tcp peers * tcp peer credits) (2166 * 32) + (34 * 16) = 69856, so round down to 65536 (up from 2048/16384/2048) Do those bold numbers look sane? Should I set all 3 buffers to 65536? Note these orelic routers have 256 GB of RAM. Thanks! Cameron

          Cameron,

          I don't think 2.12.9 has socklnd conns_per_peer feature so orelics won't have it either.

          However, there's likely still room for experiment with peer_credits and routing buffer numbers. In case you are willing to experiment, I recently put together some guidelines for LNet routing tuning: https://wiki.whamcloud.com/display/LNet/LNet+Routing+Setup+Verification+and+Tuning

          This page has some suggestions for what the "peer_credits" should be, "credits" as a function of "peer_credits" and number of local peers, and similar suggestions for router buffer numbers.

          Thanks,

          Serguei

           

          ssmirnov Serguei Smirnov added a comment - Cameron, I don't think 2.12.9 has socklnd conns_per_peer feature so orelics won't have it either. However, there's likely still room for experiment with peer_credits and routing buffer numbers. In case you are willing to experiment, I recently put together some guidelines for LNet routing tuning: https://wiki.whamcloud.com/display/LNet/LNet+Routing+Setup+Verification+and+Tuning This page has some suggestions for what the "peer_credits" should be, "credits" as a function of "peer_credits" and number of local peers, and similar suggestions for router buffer numbers. Thanks, Serguei  
          charr Cameron Harr added a comment -

          Sergei, both clients and orelic routers are running 2.12.9. Orelic is running an older OS and configuration management system and using the default peer credits value, whereas on clusters like Ruby that are on the newer OS, they are  setting peer credits higher. I don’t know of any reasons why we couldn’t raise credits on  orelic and will look into changing those values today.

          charr Cameron Harr added a comment - Sergei, both clients and orelic routers are running 2.12.9. Orelic is running an older OS and configuration management system and using the default peer credits value, whereas on clusters like Ruby that are on the newer OS, they are  setting peer credits higher. I don’t know of any reasons why we couldn’t raise credits on  orelic and will look into changing those values today.

          Cameron,

          My question is specifically about o2ib100 LNet.

          Ruby has two IB NIDs, one on o2ib100 and another on o2ib39. I'm guessing one of them is MLNX and another one is OPA?  (You can only have one set of o2ib tunables and it looks like OPA set is used for both NIDs.)

          Orelic has a tcp NID and an IB NID on o2ib100 and its o2ib100 tunings are not matching the settings seen on ruby's o2ib100 NID (when comparing "lnetctl net show -v 4" outputs)

          So the question is, can you remember a specific reason for orelic using lower o2ib peer_credit settings? If there's no such reason, I'd recommend matching ruby's settings on orelic. 

          Which version is orelic running? Does it make use of socklnd conns_per_peer?

          Thanks,

          Serguei

           

          ssmirnov Serguei Smirnov added a comment - Cameron, My question is specifically about o2ib100 LNet. Ruby has two IB NIDs, one on o2ib100 and another on o2ib39. I'm guessing one of them is MLNX and another one is OPA?  (You can only have one set of o2ib tunables and it looks like OPA set is used for both NIDs.) Orelic has a tcp NID and an IB NID on o2ib100 and its o2ib100 tunings are not matching the settings seen on ruby's o2ib100 NID (when comparing "lnetctl net show -v 4" outputs) So the question is, can you remember a specific reason for orelic using lower o2ib peer_credit settings? If there's no such reason, I'd recommend matching ruby's settings on orelic.  Which version is orelic running? Does it make use of socklnd conns_per_peer? Thanks, Serguei  

          Serguei,

          Thanks for the feedback and sorry for the slow response. In answer to your question, on orelic, it is running the TOSS 3 OS, based on RHEL 7 and we are using default lnet credit settings with the exception of:

          ko2iblnd credits=1024
          ksocklnd credits=512

          Additional Lnet-related tunings on orelic are:

          lustre_common.conf:options libcfs libcfs_panic_on_lbug=1
          lustre_common.conf:options libcfs libcfs_debug=0x3060580
          lustre_common.conf:options ptlrpc at_min=45
          lustre_common.conf:options ptlrpc at_max=600
          lustre_common.conf:options ksocklnd keepalive_count=100
          lustre_common.conf:options ksocklnd keepalive_idle=30
          lustre_common.conf:options lnet check_routers_before_use=1
          lustre_common.conf:options lnet lnet_peer_discovery_disabled=1
          lustre_common.lustre212.conf:options lnet lnet_retry_count=0
          lustre_common.lustre212.conf:options lnet lnet_health_sensitivity=0
          lustre_router.conf:options lnet forwarding="enabled"
          lustre_router.conf:options lnet tiny_router_buffers=2048
          lustre_router.conf:options lnet small_router_buffers=16384
          lustre_router.conf:options lnet large_router_buffers=2048

          Ruby, and nearly every other system in the center is running the TOSS 4 OS, based on RHEL 8, and they also have extra tunings in addition to those two above. The routers on ruby are setting the following, which appears to take effect for both IB and OPA interfaces:

          ko2iblnd-opa peer_credits=32 peer_credits_hiw=16 credits=1024 concurrent_sends=64 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 
          
          ko2iblnd credits=1024 lustre_router.conf:options ksocklnd credits=512

          In case it's helpful, the other Lnet-related settings on Ruby routers are:

          lustre_common.conf:options libcfs libcfs_panic_on_lbug=1
          lustre_common.conf:options libcfs libcfs_debug=0x3060580
          lustre_common.conf:options ptlrpc at_min=45
          lustre_common.conf:options ptlrpc at_max=600
          lustre_common.conf:options ksocklnd keepalive_count=100
          lustre_common.conf:options ksocklnd keepalive_idle=30
          lustre_common.conf:options lnet check_routers_before_use=1
          lustre_common.conf:options lnet lnet_health_sensitivity=0
          lustre_common.conf:options lnet lnet_peer_discovery_disabled=1
          lustre_router.conf:options lnet forwarding="enabled"
          lustre_router.conf:options lnet tiny_router_buffers=2048
          lustre_router.conf:options lnet small_router_buffers=16384
          lustre_router.conf:options lnet large_router_buffers=2048
          
          charr Cameron Harr added a comment - Serguei, Thanks for the feedback and sorry for the slow response. In answer to your question, on orelic, it is running the TOSS 3 OS, based on RHEL 7 and we are using default lnet credit settings with the exception of: ko2iblnd credits=1024 ksocklnd credits=512 Additional Lnet-related tunings on orelic are: lustre_common.conf:options libcfs libcfs_panic_on_lbug=1 lustre_common.conf:options libcfs libcfs_debug=0x3060580 lustre_common.conf:options ptlrpc at_min=45 lustre_common.conf:options ptlrpc at_max=600 lustre_common.conf:options ksocklnd keepalive_count=100 lustre_common.conf:options ksocklnd keepalive_idle=30 lustre_common.conf:options lnet check_routers_before_use=1 lustre_common.conf:options lnet lnet_peer_discovery_disabled=1 lustre_common.lustre212.conf:options lnet lnet_retry_count=0 lustre_common.lustre212.conf:options lnet lnet_health_sensitivity=0 lustre_router.conf:options lnet forwarding= "enabled" lustre_router.conf:options lnet tiny_router_buffers=2048 lustre_router.conf:options lnet small_router_buffers=16384 lustre_router.conf:options lnet large_router_buffers=2048 Ruby, and nearly every other system in the center is running the TOSS 4 OS, based on RHEL 8, and they also have extra tunings in addition to those two above. The routers on ruby are setting the following, which appears to take effect for both IB and OPA interfaces: ko2iblnd-opa peer_credits=32 peer_credits_hiw=16 credits=1024 concurrent_sends=64 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 ko2iblnd credits=1024 lustre_router.conf:options ksocklnd credits=512 In case it's helpful, the other Lnet-related settings on Ruby routers are: lustre_common.conf:options libcfs libcfs_panic_on_lbug=1 lustre_common.conf:options libcfs libcfs_debug=0x3060580 lustre_common.conf:options ptlrpc at_min=45 lustre_common.conf:options ptlrpc at_max=600 lustre_common.conf:options ksocklnd keepalive_count=100 lustre_common.conf:options ksocklnd keepalive_idle=30 lustre_common.conf:options lnet check_routers_before_use=1 lustre_common.conf:options lnet lnet_health_sensitivity=0 lustre_common.conf:options lnet lnet_peer_discovery_disabled=1 lustre_router.conf:options lnet forwarding= "enabled" lustre_router.conf:options lnet tiny_router_buffers=2048 lustre_router.conf:options lnet small_router_buffers=16384 lustre_router.conf:options lnet large_router_buffers=2048

          Hi Cameron,

          Before getting into router buffer number calculations, one thing which stands out in the provided outputs is the mismatch of o2iblnd tunables values on o2ib100 NIDs of ruby and orelic: 32/16/64 vs 8/4/8 for peer_credits/peer_credits_hiw/concurrent_sends. When they talk to each other, they will try to negotiate down to 8/4 for peer_credits/peer_credits_hiw which may actually end up being 8/7 on ruby. I'd recommend making sure these match, 32/16/64 are the recommended settings unless there's reason to throttle the flow in your case by using lower values. Are you using 8/4/8 on orelic to match the TCP NIDs settings?

          Thanks,

          Serguei.

          ssmirnov Serguei Smirnov added a comment - Hi Cameron, Before getting into router buffer number calculations, one thing which stands out in the provided outputs is the mismatch of o2iblnd tunables values on o2ib100 NIDs of ruby and orelic: 32/16/64 vs 8/4/8 for peer_credits/peer_credits_hiw/concurrent_sends. When they talk to each other, they will try to negotiate down to 8/4 for peer_credits/peer_credits_hiw which may actually end up being 8/7 on ruby. I'd recommend making sure these match, 32/16/64 are the recommended settings unless there's reason to throttle the flow in your case by using lower values. Are you using 8/4/8 on orelic to match the TCP NIDs settings? Thanks, Serguei.

          Serguei, Sorry for the delay, I've updated the output from the client cluster routers (ruby) and the router cluster routers (orelic). Let me know if you need anything else.

          charr Cameron Harr added a comment - Serguei, Sorry for the delay, I've updated the output from the client cluster routers (ruby) and the router cluster routers (orelic). Let me know if you need anything else.
          green Oleg Drokin added a comment -

          yes, the backport seems to be correct

          green Oleg Drokin added a comment - yes, the backport seems to be correct

          Oops, looks like I did that some time in the past
          https://review.whamcloud.com/c/fs/lustre-release/+/47547

          Is my backport correct/sufficient?

          thanks

          ofaaland Olaf Faaland added a comment - Oops, looks like I did that some time in the past https://review.whamcloud.com/c/fs/lustre-release/+/47547 Is my backport correct/sufficient? thanks

          Hi,

          Our clients are running 2.12.9 based stack, with not many patches on top. I see https://review.whamcloud.com/c/fs/lustre-release/+/40052 is against master. Can you push to b2_12, even if you don't plan to land it, so we get your backport and it goes through test?

          I realize it may be a trivial backport, I ask because Eric is new and can't evaluate on his own yet.

          Thanks

          ofaaland Olaf Faaland added a comment - Hi, Our clients are running 2.12.9 based stack, with not many patches on top. I see https://review.whamcloud.com/c/fs/lustre-release/+/40052 is against master. Can you push to b2_12, even if you don't plan to land it, so we get your backport and it goes through test? I realize it may be a trivial backport, I ask because Eric is new and can't evaluate on his own yet. Thanks

          People

            green Oleg Drokin
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: