Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15234

LNet high peer reference counts inconsistent with queue

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • lustre-2.12.7_2.llnl-2.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

      The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again.  Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0.  This is just a little over 6 days since the ruby routers were rebooted during an update.

      [root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
       172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
       172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
       172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
       172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0
      

      The ruby routers  have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

      Attachments

        1. toss-5305 refs 2021-11-15.png
          toss-5305 refs 2021-11-15.png
          66 kB
        2. toss-5305 queue 2021-11-15.png
          toss-5305 queue 2021-11-15.png
          62 kB
        3. stats.show.ruby1016.1637103865.txt
          0.6 kB
        4. stats.show.ruby1016.1637103254.txt
          0.6 kB
        5. peer status orelic4 with discovery race patch v3.png
          peer status orelic4 with discovery race patch v3.png
          407 kB
        6. peer.show.ruby1016.1637103865.txt
          1 kB
        7. peer.show.ruby1016.1637103254.txt
          1 kB
        8. peer.show.172.16.70.65_at_tcp.orelic4.1644951836
          2 kB
        9. peer.show.172.16.70.64_at_tcp.orelic4.1644951836
          2 kB
        10. peer.show.172.16.70.63_at_tcp.orelic4.1644951836
          2 kB
        11. peer.show.172.16.70.62_at_tcp.orelic4.1644951836
          2 kB
        12. params_20211213.tar.gz
          6 kB
        13. orelic4-lustre212-20211216.tgz
          2 kB
        14. orelic4.debug_refcount_01.tar.gz
          27 kB
        15. lnetctl.peer.show.orelic2.1654724780.txt
          1.20 MB
        16. lnetctl.peer.show.orelic2.1654723542.txt
          1.20 MB
        17. lnetctl.net-show.ruby1016.1637616206.txt
          5 kB
        18. lnetctl.net-show.orelic4.1637616889.txt
          2 kB
        19. lnet.parameters.orelic4.1637617458.txt
          2 kB
        20. lctl.version.ruby1016.1637616519.txt
          0.0 kB
        21. lctl.version.orelic4.1637616867.txt
          0.0 kB
        22. ksocklnd.parameters.orelic4.1637617487.txt
          1 kB
        23. ko2iblnd.parameters.orelic4.1637617473.txt
          1 kB
        24. dk.ruby1016.1637103254.txt.bz2
          8.58 MB
        25. dk.orelic2.1654724751.txt
          2 kB
        26. dk.orelic2.1654724745.txt
          2 kB
        27. dk.orelic2.1654724740.txt
          2 kB
        28. dk.orelic2.1654724730.txt
          27 kB
        29. dk.orelic2.1654723686.txt
          2 kB
        30. dk.orelic2.1654723678.txt
          7 kB
        31. debug_refcount_01.patch
          19 kB
        32. 2022-jun-21.tgz
          267 kB

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: