Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15234

LNet high peer reference counts inconsistent with queue

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • lustre-2.12.7_2.llnl-2.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

      The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again.  Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0.  This is just a little over 6 days since the ruby routers were rebooted during an update.

      [root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
       172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
       172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
       172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
       172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0
      

      The ruby routers  have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

      Attachments

        1. 2022-jun-21.tgz
          267 kB
        2. debug_refcount_01.patch
          19 kB
        3. dk.orelic2.1654723678.txt
          7 kB
        4. dk.orelic2.1654723686.txt
          2 kB
        5. dk.orelic2.1654724730.txt
          27 kB
        6. dk.orelic2.1654724740.txt
          2 kB
        7. dk.orelic2.1654724745.txt
          2 kB
        8. dk.orelic2.1654724751.txt
          2 kB
        9. dk.ruby1016.1637103254.txt.bz2
          8.58 MB
        10. ko2iblnd.parameters.orelic4.1637617473.txt
          1 kB
        11. ksocklnd.parameters.orelic4.1637617487.txt
          1 kB
        12. lctl.version.orelic4.1637616867.txt
          0.0 kB
        13. lctl.version.ruby1016.1637616519.txt
          0.0 kB
        14. lnet.parameters.orelic4.1637617458.txt
          2 kB
        15. lnetctl.net-show.orelic4.1637616889.txt
          2 kB
        16. lnetctl.net-show.ruby1016.1637616206.txt
          5 kB
        17. lnetctl.peer.show.orelic2.1654723542.txt
          1.20 MB
        18. lnetctl.peer.show.orelic2.1654724780.txt
          1.20 MB
        19. orelic4.debug_refcount_01.tar.gz
          27 kB
        20. orelic4-lustre212-20211216.tgz
          2 kB
        21. params_20211213.tar.gz
          6 kB
        22. peer.show.172.16.70.62_at_tcp.orelic4.1644951836
          2 kB
        23. peer.show.172.16.70.63_at_tcp.orelic4.1644951836
          2 kB
        24. peer.show.172.16.70.64_at_tcp.orelic4.1644951836
          2 kB
        25. peer.show.172.16.70.65_at_tcp.orelic4.1644951836
          2 kB
        26. peer.show.ruby1016.1637103254.txt
          1 kB
        27. peer.show.ruby1016.1637103865.txt
          1 kB
        28. peer status orelic4 with discovery race patch v3.png
          peer status orelic4 with discovery race patch v3.png
          407 kB
        29. stats.show.ruby1016.1637103254.txt
          0.6 kB
        30. stats.show.ruby1016.1637103865.txt
          0.6 kB
        31. toss-5305 queue 2021-11-15.png
          toss-5305 queue 2021-11-15.png
          62 kB
        32. toss-5305 refs 2021-11-15.png
          toss-5305 refs 2021-11-15.png
          66 kB

        Issue Links

          Activity

            [LU-15234] LNet high peer reference counts inconsistent with queue
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48566/
            Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 950e59ced18d49e9fdd31c1e9de43b89a0bc1c1d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48566/ Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info Project: fs/lustre-release Branch: master Current Patch Set: Commit: 950e59ced18d49e9fdd31c1e9de43b89a0bc1c1d

            No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue. 

            ssmirnov Serguei Smirnov added a comment - No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue. 
            pjones Peter Jones added a comment -

            I think it is really a call for ssmirnov . Do you still think that there is value in landing https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ or do you intend to abandon it in light of the review comments?

            pjones Peter Jones added a comment - I think it is really a call for ssmirnov  . Do you still think that there is value in landing https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/  or do you intend to abandon it in light of the review comments?
            ofaaland Olaf Faaland added a comment -

            >  The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ ?

            No opinion from me.

            Thanks for getting this fixed.

            ofaaland Olaf Faaland added a comment - >  The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ ? No opinion from me. Thanks for getting this fixed.
            pjones Peter Jones added a comment -

            The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?

            pjones Peter Jones added a comment - The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48566
            Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: dc704df0be48fc9f933e6f2c6fede3c5991a951a

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48566 Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: dc704df0be48fc9f933e6f2c6fede3c5991a951a
            pjones Peter Jones added a comment -

            Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test

            pjones Peter Jones added a comment - Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test
            ofaaland Olaf Faaland added a comment -

            As far as I'm concerned, this will be resolved when the patch lands to b2_12.  Do you agree?  If so, what is the plan for that?

            thanks

            ofaaland Olaf Faaland added a comment - As far as I'm concerned, this will be resolved when the patch lands to b2_12.  Do you agree?  If so, what is the plan for that? thanks

            Hi Serguei,

            2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer.

            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, 2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer. Olaf

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: