Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15234

LNet high peer reference counts inconsistent with queue

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • lustre-2.12.7_2.llnl-2.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

      The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again.  Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0.  This is just a little over 6 days since the ruby routers were rebooted during an update.

      [root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
       172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
       172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
       172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
       172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0
      

      The ruby routers  have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

      Attachments

        1. 2022-jun-21.tgz
          267 kB
        2. debug_refcount_01.patch
          19 kB
        3. dk.orelic2.1654723678.txt
          7 kB
        4. dk.orelic2.1654723686.txt
          2 kB
        5. dk.orelic2.1654724730.txt
          27 kB
        6. dk.orelic2.1654724740.txt
          2 kB
        7. dk.orelic2.1654724745.txt
          2 kB
        8. dk.orelic2.1654724751.txt
          2 kB
        9. dk.ruby1016.1637103254.txt.bz2
          8.58 MB
        10. ko2iblnd.parameters.orelic4.1637617473.txt
          1 kB
        11. ksocklnd.parameters.orelic4.1637617487.txt
          1 kB
        12. lctl.version.orelic4.1637616867.txt
          0.0 kB
        13. lctl.version.ruby1016.1637616519.txt
          0.0 kB
        14. lnet.parameters.orelic4.1637617458.txt
          2 kB
        15. lnetctl.net-show.orelic4.1637616889.txt
          2 kB
        16. lnetctl.net-show.ruby1016.1637616206.txt
          5 kB
        17. lnetctl.peer.show.orelic2.1654723542.txt
          1.20 MB
        18. lnetctl.peer.show.orelic2.1654724780.txt
          1.20 MB
        19. orelic4.debug_refcount_01.tar.gz
          27 kB
        20. orelic4-lustre212-20211216.tgz
          2 kB
        21. params_20211213.tar.gz
          6 kB
        22. peer.show.172.16.70.62_at_tcp.orelic4.1644951836
          2 kB
        23. peer.show.172.16.70.63_at_tcp.orelic4.1644951836
          2 kB
        24. peer.show.172.16.70.64_at_tcp.orelic4.1644951836
          2 kB
        25. peer.show.172.16.70.65_at_tcp.orelic4.1644951836
          2 kB
        26. peer.show.ruby1016.1637103254.txt
          1 kB
        27. peer.show.ruby1016.1637103865.txt
          1 kB
        28. peer status orelic4 with discovery race patch v3.png
          peer status orelic4 with discovery race patch v3.png
          407 kB
        29. stats.show.ruby1016.1637103254.txt
          0.6 kB
        30. stats.show.ruby1016.1637103865.txt
          0.6 kB
        31. toss-5305 queue 2021-11-15.png
          toss-5305 queue 2021-11-15.png
          62 kB
        32. toss-5305 refs 2021-11-15.png
          toss-5305 refs 2021-11-15.png
          66 kB

        Issue Links

          Activity

            [LU-15234] LNet high peer reference counts inconsistent with queue
            pjones Peter Jones added a comment -

            Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test

            pjones Peter Jones added a comment - Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test
            ofaaland Olaf Faaland added a comment -

            As far as I'm concerned, this will be resolved when the patch lands to b2_12.  Do you agree?  If so, what is the plan for that?

            thanks

            ofaaland Olaf Faaland added a comment - As far as I'm concerned, this will be resolved when the patch lands to b2_12.  Do you agree?  If so, what is the plan for that? thanks

            Hi Serguei,

            2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer.

            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, 2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer. Olaf
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            2.12.9 + change 48190 held up well overnight which is far beyond how long we've needed to wait for symptoms in the past. If you can get someone to perform a second review on the patch in gerrit that would be great.

            I'll deploy more widely and update here early next week.

            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, 2.12.9 + change 48190 held up well overnight which is far beyond how long we've needed to wait for symptoms in the past. If you can get someone to perform a second review on the patch in gerrit that would be great. I'll deploy more widely and update here early next week. thanks, Olaf

            Hi Serguei

            I tested 2.12.9 + change 48190 today and results so far are promising. I'll test it further and post here in the next couple of days.

            ofaaland Olaf Faaland added a comment - Hi Serguei I tested 2.12.9 + change 48190 today and results so far are promising. I'll test it further and post here in the next couple of days.

            Hi Olaf,

            I ported Chris's fix for LU-12739 to b2_12: https://review.whamcloud.com/#/c/48190/

            Please give this patch a try. It is aiming to eliminate a race condition with effects potentially similar to what is seen in the coredump you provided.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, I ported Chris's fix for LU-12739 to b2_12: https://review.whamcloud.com/#/c/48190/ Please give this patch a try. It is aiming to eliminate a race condition with effects potentially similar to what is seen in the coredump you provided. Thanks, Serguei.

            Chris,

            Yes indeed, it looks very much like LU-12739

            I'll port these changes.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Chris, Yes indeed, it looks very much like LU-12739 I'll port these changes. Thanks, Serguei.
            hornc Chris Horn added a comment - Sounds like https://jira.whamcloud.com/browse/LU-12739 ?

            Hi Olaf,

            While examining the core I found that messages causing the delay are waiting to be sent: they are listed on lp_dc_pendq of the destination peer.

            At the same time, the destination peer is not queued to be discovered, so it appears that there's no good reason for the messages to be delayed.

            I pushed a test patch in order to rule out a race condition which somehow enables a thread to queue a message for a peer which is not (or no longer) going to be discovered. The new patch is going to attempt to recognize this situation on discovery completion, print an error and handle any messages which are still pending. This should help locate the race condition if it is actually occurring. If this is the only cause, with this patch we should see the error message "Peer X msg list not empty on disc comp" and no more refcount increase.

            Otherwise, I'll have to look for other possible causes.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, While examining the core I found that messages causing the delay are waiting to be sent: they are listed on lp_dc_pendq of the destination peer. At the same time, the destination peer is not queued to be discovered, so it appears that there's no good reason for the messages to be delayed. I pushed a test patch in order to rule out a race condition which somehow enables a thread to queue a message for a peer which is not (or no longer) going to be discovered. The new patch is going to attempt to recognize this situation on discovery completion, print an error and handle any messages which are still pending. This should help locate the race condition if it is actually occurring. If this is the only cause, with this patch we should see the error message "Peer X msg list not empty on disc comp" and no more refcount increase. Otherwise, I'll have to look for other possible causes. Thanks, Serguei.

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48163
            Subject: LU-15234 lnet: test for race when completing discovery
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 0eb36b2ace98b0c57595098a3a6d9f5de8e6045c

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48163 Subject: LU-15234 lnet: test for race when completing discovery Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 0eb36b2ace98b0c57595098a3a6d9f5de8e6045c

            Hi Olaf,

            I had too many distractions and haven't finished looking at the core yet.

            Basically I believe what I see in the core so far does confirm the idea that messages are not getting finalized, but I still haven't understood why. In LNet layer the number of queued messages on the problem peer looks consistent with the high refcount, but I still need to dig more at the LND level and examine message queues there.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, I had too many distractions and haven't finished looking at the core yet. Basically I believe what I see in the core so far does confirm the idea that messages are not getting finalized, but I still haven't understood why. In LNet layer the number of queued messages on the problem peer looks consistent with the high refcount, but I still need to dig more at the LND level and examine message queues there. Thanks, Serguei.

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: