Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15234

LNet high peer reference counts inconsistent with queue

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • lustre-2.12.7_2.llnl-2.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

      The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again.  Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0.  This is just a little over 6 days since the ruby routers were rebooted during an update.

      [root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
       172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
       172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
       172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
       172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0
      

      The ruby routers  have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

      Attachments

        1. 2022-jun-21.tgz
          267 kB
        2. debug_refcount_01.patch
          19 kB
        3. dk.orelic2.1654723678.txt
          7 kB
        4. dk.orelic2.1654723686.txt
          2 kB
        5. dk.orelic2.1654724730.txt
          27 kB
        6. dk.orelic2.1654724740.txt
          2 kB
        7. dk.orelic2.1654724745.txt
          2 kB
        8. dk.orelic2.1654724751.txt
          2 kB
        9. dk.ruby1016.1637103254.txt.bz2
          8.58 MB
        10. ko2iblnd.parameters.orelic4.1637617473.txt
          1 kB
        11. ksocklnd.parameters.orelic4.1637617487.txt
          1 kB
        12. lctl.version.orelic4.1637616867.txt
          0.0 kB
        13. lctl.version.ruby1016.1637616519.txt
          0.0 kB
        14. lnet.parameters.orelic4.1637617458.txt
          2 kB
        15. lnetctl.net-show.orelic4.1637616889.txt
          2 kB
        16. lnetctl.net-show.ruby1016.1637616206.txt
          5 kB
        17. lnetctl.peer.show.orelic2.1654723542.txt
          1.20 MB
        18. lnetctl.peer.show.orelic2.1654724780.txt
          1.20 MB
        19. orelic4.debug_refcount_01.tar.gz
          27 kB
        20. orelic4-lustre212-20211216.tgz
          2 kB
        21. params_20211213.tar.gz
          6 kB
        22. peer.show.172.16.70.62_at_tcp.orelic4.1644951836
          2 kB
        23. peer.show.172.16.70.63_at_tcp.orelic4.1644951836
          2 kB
        24. peer.show.172.16.70.64_at_tcp.orelic4.1644951836
          2 kB
        25. peer.show.172.16.70.65_at_tcp.orelic4.1644951836
          2 kB
        26. peer.show.ruby1016.1637103254.txt
          1 kB
        27. peer.show.ruby1016.1637103865.txt
          1 kB
        28. peer status orelic4 with discovery race patch v3.png
          peer status orelic4 with discovery race patch v3.png
          407 kB
        29. stats.show.ruby1016.1637103254.txt
          0.6 kB
        30. stats.show.ruby1016.1637103865.txt
          0.6 kB
        31. toss-5305 queue 2021-11-15.png
          toss-5305 queue 2021-11-15.png
          62 kB
        32. toss-5305 refs 2021-11-15.png
          toss-5305 refs 2021-11-15.png
          66 kB

        Issue Links

          Activity

            [LU-15234] LNet high peer reference counts inconsistent with queue

            Hi Serguei,
            Do you have any updates?
            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, Do you have any updates? thanks, Olaf

            Hi Olaf,

            I found these files 

            -rw-r--r--  1 sdsmirnov  staff   469346936 12 Jul 11:36 kernel-debuginfo-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm
            -rw-r--r--  1 sdsmirnov  staff    65354176 12 Jul 11:37 kernel-debuginfo-common-x86_64-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm
            -rw-r--r--  1 sdsmirnov  staff    19370216 12 Jul 11:37 lustre-debuginfo-2.12.8_9.llnl.olaf1.toss5305-1.ch6_1.x86_64.rpm
            -rw-r--r--  1 sdsmirnov  staff  1270395238 12 Jul 11:34 vmcore
            -rw-r--r--  1 sdsmirnov  staff      148855 12 Jul 11:34 vmcore-dmesg.txt

            and copied them over to my machine. I'll take a look and keep you updated.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, I found these files  -rw-r--r--  1 sdsmirnov  staff   469346936 12 Jul 11:36 kernel-debuginfo-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm -rw-r--r--  1 sdsmirnov  staff    65354176 12 Jul 11:37 kernel-debuginfo-common-x86_64-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm -rw-r--r--  1 sdsmirnov  staff    19370216 12 Jul 11:37 lustre-debuginfo-2.12.8_9.llnl.olaf1.toss5305-1.ch6_1.x86_64.rpm -rw-r--r--  1 sdsmirnov  staff  1270395238 12 Jul 11:34 vmcore -rw-r--r--  1 sdsmirnov  staff      148855 12 Jul 11:34 vmcore-dmesg.txt and copied them over to my machine. I'll take a look and keep you updated. Thanks, Serguei.

            Hi Serguei,
            I've uploaded the dump and debuginfos via ftp. Please confirm you received them.
            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, I've uploaded the dump and debuginfos via ftp. Please confirm you received them. thanks, Olaf

            Hi Olaf,

            I examined the traces you provided. It still looks like some messages are just not getting finalized. One idea I have is that they might have gotten stuck in resend queue somehow.

            Could please give me access to the crash dump if you still have it, along with debuginfo rpms?

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, I examined the traces you provided. It still looks like some messages are just not getting finalized. One idea I have is that they might have gotten stuck in resend queue somehow. Could please give me access to the crash dump if you still have it, along with debuginfo rpms? Thanks, Serguei.
            ofaaland Olaf Faaland added a comment -

            Hi Serguei, do you have any update on this?
            Thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, do you have any update on this? Thanks, Olaf
            ofaaland Olaf Faaland added a comment - - edited

            Hi Serguei,

            I was able to gather detailed counts over time, remove the affected node from all routes so no messages should have been coming in to be routed, attempt to stop lnet, and obtain a crash dump. The node that ran 2.12 with the debug patch was "orelic2".

            The detailed counts and debug logs are attached:  2022-jun-21.tgz

            To provide context:
            2022-06-21 14:36:56 LNet started with debug patch
            2022-06-21 14:55:00 Removed routes on other clusters where gateway == orelic2. (time approximate)
            2022-06-21 15:21:34 issued "lnetctl lnet unconfigure"
            2022-06-21 15:26:21 crashed orelic2 to gather the dump

            The timestamps on the files in the tarball will tell you when counts, debug logs, etc. were gathered.

            Before removing routes, the refcounts continued to climb.
            After removing routes, the refcounts plateaued at 82
            The "lnetctl lnet unconfigure" command hung

            I've also included debug logs for the period. I changed the debug mask to -1 after removing routes but before issuing "lnetctl lnet unconfigure".

            I can send you the crash dump.

            Thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - - edited Hi Serguei, I was able to gather detailed counts over time, remove the affected node from all routes so no messages should have been coming in to be routed, attempt to stop lnet, and obtain a crash dump. The node that ran 2.12 with the debug patch was "orelic2". The detailed counts and debug logs are attached:  2022-jun-21.tgz To provide context: 2022-06-21 14:36:56 LNet started with debug patch 2022-06-21 14:55:00 Removed routes on other clusters where gateway == orelic2. (time approximate) 2022-06-21 15:21:34 issued "lnetctl lnet unconfigure" 2022-06-21 15:26:21 crashed orelic2 to gather the dump The timestamps on the files in the tarball will tell you when counts, debug logs, etc. were gathered. Before removing routes, the refcounts continued to climb. After removing routes, the refcounts plateaued at 82 The "lnetctl lnet unconfigure" command hung I've also included debug logs for the period. I changed the debug mask to -1 after removing routes but before issuing "lnetctl lnet unconfigure". I can send you the crash dump. Thanks, Olaf

            Hi Olaf,

            After discussing with ashehata, I wonder if we could revisit testing with the "detailed peer refcount summary" patch https://review.whamcloud.com/46364

            I'd like to clarify the following:

            1) How do the "detailed" counts change over time (for a peer which has refcount steadily increasing)? This means taking more than one snapshot of lnetctl output: e.g. at refcount 100 and refcount = 500.

            2) The increasing peer refcount appears to be associated with negative number of router credits, i.e. we're slow routing messages from this peer. What happens if the corresponding route is removed from the peer? 

            Not sure if it is easy enough to arrange, but for "2" it should be possible to remove the route dynamically using lnetctl. After the route is removed, we should stop receiving traffic from this peer. We would finish forwarding whatever messages we had queued up and rtr_credits should return to normal value. In order to avoid issues with "symmetry", it would be best to remove the route from all peers. Then we can check what happened to the peer refcount: dump the "detailed" counts again and try to delete the peer using lnetctl (won't work if there's actually a leak). Maybe dump a core, too.

            Thanks,

            Serguei.

             

             

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, After discussing with ashehata , I wonder if we could revisit testing with the "detailed peer refcount summary" patch https://review.whamcloud.com/46364 I'd like to clarify the following: 1) How do the "detailed" counts change over time (for a peer which has refcount steadily increasing)? This means taking more than one snapshot of lnetctl output: e.g. at refcount 100 and refcount = 500. 2) The increasing peer refcount appears to be associated with negative number of router credits, i.e. we're slow routing messages from this peer. What happens if the corresponding route is removed from the peer?  Not sure if it is easy enough to arrange, but for "2" it should be possible to remove the route dynamically using lnetctl. After the route is removed, we should stop receiving traffic from this peer. We would finish forwarding whatever messages we had queued up and rtr_credits should return to normal value. In order to avoid issues with "symmetry", it would be best to remove the route from all peers. Then we can check what happened to the peer refcount: dump the "detailed" counts again and try to delete the peer using lnetctl (won't work if there's actually a leak). Maybe dump a core, too. Thanks, Serguei.    

            Hi Olaf,

            So far, from looking at the logs you provided, I haven't seen any outputs with abnormal stats for any of the peers you dumped, which may mean that the problem is not reflected at lnd level.

            If you do reproduce again, you could try using "lnetctl peer show -v 4" (vs. just "lnetctl peer show"). To reduce the amount of output this produces, you can use " --nid " option to dump for specific peer only.

            In the meantime I'm looking at how instrumentation can be extended to yield more useful info.

            Thanks,

            Serguei.

             

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, So far, from looking at the logs you provided, I haven't seen any outputs with abnormal stats for any of the peers you dumped, which may mean that the problem is not reflected at lnd level. If you do reproduce again, you could try using "lnetctl peer show -v 4" (vs. just "lnetctl peer show"). To reduce the amount of output this produces, you can use " --nid " option to dump for specific peer only. In the meantime I'm looking at how instrumentation can be extended to yield more useful info. Thanks, Serguei.  

            Hi Serguei,
            I don't have a record of which peer NID was given as the argument, for the above debug sessions. Do you need me to reproduce this and keep track of that?
            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Hi Serguei, I don't have a record of which peer NID was given as the argument, for the above debug sessions. Do you need me to reproduce this and keep track of that? thanks, Olaf
            ofaaland Olaf Faaland added a comment - - edited

            Hi Serguei,

            We reproduced the issue on orelic2, with https://review.whamcloud.com/47460, under Lustre 2.12.8. 

            There were 4 peers with high refcounts, with NIDs  172.16.70.6[2-5]@tcp.  I captured the debug information multiple times for some those peers, but I may not be able to identify which peer a set of debug output is for.  I'll post that mapping if I find it. The debug information, as well as the output of "lnetctl peer show --details", is attached.

            lnetctl.peer.show.orelic2.1654723542.txt
            lnetctl.peer.show.orelic2.1654724780.txt
            dk.orelic2.1654723678.txt
            dk.orelic2.1654723686.txt
            dk.orelic2.1654724730.txt
            dk.orelic2.1654724740.txt
            dk.orelic2.1654724745.txt
            dk.orelic2.1654724751.txt

            Thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - - edited Hi Serguei, We reproduced the issue on orelic2, with https://review.whamcloud.com/47460 , under Lustre 2.12.8.  There were 4 peers with high refcounts, with NIDs  172.16.70.6 [2-5] @tcp.  I captured the debug information multiple times for some those peers, but I may not be able to identify which peer a set of debug output is for.  I'll post that mapping if I find it. The debug information, as well as the output of "lnetctl peer show --details", is attached. lnetctl.peer.show.orelic2.1654723542.txt lnetctl.peer.show.orelic2.1654724780.txt dk.orelic2.1654723678.txt dk.orelic2.1654723686.txt dk.orelic2.1654724730.txt dk.orelic2.1654724740.txt dk.orelic2.1654724745.txt dk.orelic2.1654724751.txt Thanks, Olaf
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,

            Routers with this symptom recently have FW 16.29.2002.  We don't have any routers running xxx.30.xxxx.

            thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, Routers with this symptom recently have FW 16.29.2002.  We don't have any routers running xxx.30.xxxx. thanks

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: