Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15234

LNet high peer reference counts inconsistent with queue

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • lustre-2.12.7_2.llnl-2.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

      The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again.  Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0.  This is just a little over 6 days since the ruby routers were rebooted during an update.

      [root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
       172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
       172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
       172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
       172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0
      

      The ruby routers  have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

      Attachments

        1. 2022-jun-21.tgz
          267 kB
        2. debug_refcount_01.patch
          19 kB
        3. dk.orelic2.1654723678.txt
          7 kB
        4. dk.orelic2.1654723686.txt
          2 kB
        5. dk.orelic2.1654724730.txt
          27 kB
        6. dk.orelic2.1654724740.txt
          2 kB
        7. dk.orelic2.1654724745.txt
          2 kB
        8. dk.orelic2.1654724751.txt
          2 kB
        9. dk.ruby1016.1637103254.txt.bz2
          8.58 MB
        10. ko2iblnd.parameters.orelic4.1637617473.txt
          1 kB
        11. ksocklnd.parameters.orelic4.1637617487.txt
          1 kB
        12. lctl.version.orelic4.1637616867.txt
          0.0 kB
        13. lctl.version.ruby1016.1637616519.txt
          0.0 kB
        14. lnet.parameters.orelic4.1637617458.txt
          2 kB
        15. lnetctl.net-show.orelic4.1637616889.txt
          2 kB
        16. lnetctl.net-show.ruby1016.1637616206.txt
          5 kB
        17. lnetctl.peer.show.orelic2.1654723542.txt
          1.20 MB
        18. lnetctl.peer.show.orelic2.1654724780.txt
          1.20 MB
        19. orelic4.debug_refcount_01.tar.gz
          27 kB
        20. orelic4-lustre212-20211216.tgz
          2 kB
        21. params_20211213.tar.gz
          6 kB
        22. peer.show.172.16.70.62_at_tcp.orelic4.1644951836
          2 kB
        23. peer.show.172.16.70.63_at_tcp.orelic4.1644951836
          2 kB
        24. peer.show.172.16.70.64_at_tcp.orelic4.1644951836
          2 kB
        25. peer.show.172.16.70.65_at_tcp.orelic4.1644951836
          2 kB
        26. peer.show.ruby1016.1637103254.txt
          1 kB
        27. peer.show.ruby1016.1637103865.txt
          1 kB
        28. peer status orelic4 with discovery race patch v3.png
          peer status orelic4 with discovery race patch v3.png
          407 kB
        29. stats.show.ruby1016.1637103254.txt
          0.6 kB
        30. stats.show.ruby1016.1637103865.txt
          0.6 kB
        31. toss-5305 queue 2021-11-15.png
          toss-5305 queue 2021-11-15.png
          62 kB
        32. toss-5305 refs 2021-11-15.png
          toss-5305 refs 2021-11-15.png
          66 kB

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: