Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15234

LNet high peer reference counts inconsistent with queue

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • lustre-2.12.7_2.llnl-2.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

      The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again.  Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0.  This is just a little over 6 days since the ruby routers were rebooted during an update.

      [root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
       172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
       172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
       172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
       172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0
      

      The ruby routers  have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

      Attachments

        1. toss-5305 refs 2021-11-15.png
          toss-5305 refs 2021-11-15.png
          66 kB
        2. toss-5305 queue 2021-11-15.png
          toss-5305 queue 2021-11-15.png
          62 kB
        3. stats.show.ruby1016.1637103865.txt
          0.6 kB
        4. stats.show.ruby1016.1637103254.txt
          0.6 kB
        5. peer status orelic4 with discovery race patch v3.png
          peer status orelic4 with discovery race patch v3.png
          407 kB
        6. peer.show.ruby1016.1637103865.txt
          1 kB
        7. peer.show.ruby1016.1637103254.txt
          1 kB
        8. peer.show.172.16.70.65_at_tcp.orelic4.1644951836
          2 kB
        9. peer.show.172.16.70.64_at_tcp.orelic4.1644951836
          2 kB
        10. peer.show.172.16.70.63_at_tcp.orelic4.1644951836
          2 kB
        11. peer.show.172.16.70.62_at_tcp.orelic4.1644951836
          2 kB
        12. params_20211213.tar.gz
          6 kB
        13. orelic4-lustre212-20211216.tgz
          2 kB
        14. orelic4.debug_refcount_01.tar.gz
          27 kB
        15. lnetctl.peer.show.orelic2.1654724780.txt
          1.20 MB
        16. lnetctl.peer.show.orelic2.1654723542.txt
          1.20 MB
        17. lnetctl.net-show.ruby1016.1637616206.txt
          5 kB
        18. lnetctl.net-show.orelic4.1637616889.txt
          2 kB
        19. lnet.parameters.orelic4.1637617458.txt
          2 kB
        20. lctl.version.ruby1016.1637616519.txt
          0.0 kB
        21. lctl.version.orelic4.1637616867.txt
          0.0 kB
        22. ksocklnd.parameters.orelic4.1637617487.txt
          1 kB
        23. ko2iblnd.parameters.orelic4.1637617473.txt
          1 kB
        24. dk.ruby1016.1637103254.txt.bz2
          8.58 MB
        25. dk.orelic2.1654724751.txt
          2 kB
        26. dk.orelic2.1654724745.txt
          2 kB
        27. dk.orelic2.1654724740.txt
          2 kB
        28. dk.orelic2.1654724730.txt
          27 kB
        29. dk.orelic2.1654723686.txt
          2 kB
        30. dk.orelic2.1654723678.txt
          7 kB
        31. debug_refcount_01.patch
          19 kB
        32. 2022-jun-21.tgz
          267 kB

        Issue Links

          Activity

            [LU-15234] LNet high peer reference counts inconsistent with queue
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48566/
            Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 950e59ced18d49e9fdd31c1e9de43b89a0bc1c1d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48566/ Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info Project: fs/lustre-release Branch: master Current Patch Set: Commit: 950e59ced18d49e9fdd31c1e9de43b89a0bc1c1d

            No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue. 

            ssmirnov Serguei Smirnov added a comment - No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue. 
            pjones Peter Jones added a comment -

            I think it is really a call for ssmirnov . Do you still think that there is value in landing https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ or do you intend to abandon it in light of the review comments?

            pjones Peter Jones added a comment - I think it is really a call for ssmirnov  . Do you still think that there is value in landing https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/  or do you intend to abandon it in light of the review comments?
            ofaaland Olaf Faaland added a comment -

            >  The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ ?

            No opinion from me.

            Thanks for getting this fixed.

            ofaaland Olaf Faaland added a comment - >  The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ ? No opinion from me. Thanks for getting this fixed.
            pjones Peter Jones added a comment -

            The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?

            pjones Peter Jones added a comment - The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48566
            Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: dc704df0be48fc9f933e6f2c6fede3c5991a951a

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48566 Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: dc704df0be48fc9f933e6f2c6fede3c5991a951a
            pjones Peter Jones added a comment -

            Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test

            pjones Peter Jones added a comment - Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test
            ofaaland Olaf Faaland added a comment -

            As far as I'm concerned, this will be resolved when the patch lands to b2_12.  Do you agree?  If so, what is the plan for that?

            thanks

            ofaaland Olaf Faaland added a comment - As far as I'm concerned, this will be resolved when the patch lands to b2_12.  Do you agree?  If so, what is the plan for that? thanks

            Hi Serguei,

            2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer.

            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, 2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer. Olaf

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: