[LU-15234] LNet high peer reference counts inconsistent with queue - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
- llnl
Environment:
lustre-2.12.7_2.llnl-2.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again. Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0. This is just a little over 6 days since the ruby routers were rebooted during an update.

[root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
 172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
 172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
 172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
 172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0

The ruby routers have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2022-jun-21.tgz
267 kB
21/Jun/22 10:45 PM
debug_refcount_01.patch
19 kB
28/Jan/22 12:31 AM
dk.orelic2.1654723678.txt
7 kB
09/Jun/22 12:43 AM
dk.orelic2.1654723686.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724730.txt
27 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724740.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724745.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724751.txt
2 kB
09/Jun/22 12:43 AM
dk.ruby1016.1637103254.txt.bz2
8.58 MB
16/Nov/21 11:12 PM
ko2iblnd.parameters.orelic4.1637617473.txt
1 kB
22/Nov/21 9:48 PM
ksocklnd.parameters.orelic4.1637617487.txt
1 kB
22/Nov/21 9:48 PM
lctl.version.orelic4.1637616867.txt
0.0 kB
22/Nov/21 9:48 PM
lctl.version.ruby1016.1637616519.txt
0.0 kB
22/Nov/21 9:48 PM
lnet.parameters.orelic4.1637617458.txt
2 kB
22/Nov/21 9:48 PM
lnetctl.net-show.orelic4.1637616889.txt
2 kB
22/Nov/21 9:48 PM
lnetctl.net-show.ruby1016.1637616206.txt
5 kB
22/Nov/21 9:48 PM
lnetctl.peer.show.orelic2.1654723542.txt
1.20 MB
09/Jun/22 12:43 AM
lnetctl.peer.show.orelic2.1654724780.txt
1.20 MB
09/Jun/22 12:43 AM
orelic4.debug_refcount_01.tar.gz
27 kB
01/Feb/22 7:39 PM
orelic4-lustre212-20211216.tgz
2 kB
16/Dec/21 11:14 PM
params_20211213.tar.gz
6 kB
14/Dec/21 2:01 AM
peer.show.172.16.70.62_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.63_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.64_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.65_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.ruby1016.1637103254.txt
1 kB
16/Nov/21 11:12 PM
peer.show.ruby1016.1637103865.txt
1 kB
16/Nov/21 11:12 PM
peer status orelic4 with discovery race patch v3.png
407 kB
09/Dec/21 1:13 AM
stats.show.ruby1016.1637103254.txt
0.6 kB
16/Nov/21 11:12 PM
stats.show.ruby1016.1637103865.txt
0.6 kB
16/Nov/21 11:12 PM
toss-5305 queue 2021-11-15.png
62 kB
16/Nov/21 12:05 AM
toss-5305 refs 2021-11-15.png
66 kB
16/Nov/21 12:05 AM

Issue Links

duplicates

LU-12739 Race with discovery thread completion and message queueing

Resolved

is related to

LU-15453 MDT shutdown hangs on mutex_lock, possibly cld_lock

Open

Activity

[LU-15234] LNet high peer reference counts inconsistent with queue

Olaf Faaland added a comment - 25/Jul/22 10:55 PM

Hi Serguei,
Do you have any updates?
thanks,
Olaf

Olaf Faaland added a comment - 25/Jul/22 10:55 PM Hi Serguei, Do you have any updates? thanks, Olaf

Serguei Smirnov added a comment - 12/Jul/22 6:40 PM

Hi Olaf,

I found these files

-rw-r--r--  1 sdsmirnov  staff   469346936 12 Jul 11:36 kernel-debuginfo-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm
-rw-r--r--  1 sdsmirnov  staff    65354176 12 Jul 11:37 kernel-debuginfo-common-x86_64-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm
-rw-r--r--  1 sdsmirnov  staff    19370216 12 Jul 11:37 lustre-debuginfo-2.12.8_9.llnl.olaf1.toss5305-1.ch6_1.x86_64.rpm
-rw-r--r--  1 sdsmirnov  staff  1270395238 12 Jul 11:34 vmcore
-rw-r--r--  1 sdsmirnov  staff      148855 12 Jul 11:34 vmcore-dmesg.txt

and copied them over to my machine. I'll take a look and keep you updated.

Thanks,

Serguei.

Serguei Smirnov added a comment - 12/Jul/22 6:40 PM Hi Olaf, I found these files -rw-r--r-- 1 sdsmirnov staff 469346936 12 Jul 11:36 kernel-debuginfo-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm -rw-r--r-- 1 sdsmirnov staff 65354176 12 Jul 11:37 kernel-debuginfo-common-x86_64-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm -rw-r--r-- 1 sdsmirnov staff 19370216 12 Jul 11:37 lustre-debuginfo-2.12.8_9.llnl.olaf1.toss5305-1.ch6_1.x86_64.rpm -rw-r--r-- 1 sdsmirnov staff 1270395238 12 Jul 11:34 vmcore -rw-r--r-- 1 sdsmirnov staff 148855 12 Jul 11:34 vmcore-dmesg.txt and copied them over to my machine. I'll take a look and keep you updated. Thanks, Serguei.

Olaf Faaland added a comment - 11/Jul/22 11:37 PM

Hi Serguei,
I've uploaded the dump and debuginfos via ftp. Please confirm you received them.
thanks,
Olaf

Olaf Faaland added a comment - 11/Jul/22 11:37 PM Hi Serguei, I've uploaded the dump and debuginfos via ftp. Please confirm you received them. thanks, Olaf

Serguei Smirnov added a comment - 11/Jul/22 9:14 PM

Hi Olaf,

I examined the traces you provided. It still looks like some messages are just not getting finalized. One idea I have is that they might have gotten stuck in resend queue somehow.

Could please give me access to the crash dump if you still have it, along with debuginfo rpms?

Thanks,

Serguei.

Serguei Smirnov added a comment - 11/Jul/22 9:14 PM Hi Olaf, I examined the traces you provided. It still looks like some messages are just not getting finalized. One idea I have is that they might have gotten stuck in resend queue somehow. Could please give me access to the crash dump if you still have it, along with debuginfo rpms? Thanks, Serguei.

Olaf Faaland added a comment - 11/Jul/22 4:55 PM

Hi Serguei, do you have any update on this?
Thanks,
Olaf

Olaf Faaland added a comment - 11/Jul/22 4:55 PM Hi Serguei, do you have any update on this? Thanks, Olaf

Olaf Faaland added a comment - 21/Jun/22 10:54 PM - edited

Hi Serguei,

I was able to gather detailed counts over time, remove the affected node from all routes so no messages should have been coming in to be routed, attempt to stop lnet, and obtain a crash dump. The node that ran 2.12 with the debug patch was "orelic2".

The detailed counts and debug logs are attached: 2022-jun-21.tgz

To provide context:
2022-06-21 14:36:56 LNet started with debug patch
2022-06-21 14:55:00 Removed routes on other clusters where gateway == orelic2. (time approximate)
2022-06-21 15:21:34 issued "lnetctl lnet unconfigure"
2022-06-21 15:26:21 crashed orelic2 to gather the dump

The timestamps on the files in the tarball will tell you when counts, debug logs, etc. were gathered.

Before removing routes, the refcounts continued to climb.
After removing routes, the refcounts plateaued at 82
The "lnetctl lnet unconfigure" command hung

I've also included debug logs for the period. I changed the debug mask to -1 after removing routes but before issuing "lnetctl lnet unconfigure".

I can send you the crash dump.

Thanks,
Olaf

Olaf Faaland added a comment - 21/Jun/22 10:54 PM - edited Hi Serguei, I was able to gather detailed counts over time, remove the affected node from all routes so no messages should have been coming in to be routed, attempt to stop lnet, and obtain a crash dump. The node that ran 2.12 with the debug patch was "orelic2". The detailed counts and debug logs are attached: 2022-jun-21.tgz To provide context: 2022-06-21 14:36:56 LNet started with debug patch 2022-06-21 14:55:00 Removed routes on other clusters where gateway == orelic2. (time approximate) 2022-06-21 15:21:34 issued "lnetctl lnet unconfigure" 2022-06-21 15:26:21 crashed orelic2 to gather the dump The timestamps on the files in the tarball will tell you when counts, debug logs, etc. were gathered. Before removing routes, the refcounts continued to climb. After removing routes, the refcounts plateaued at 82 The "lnetctl lnet unconfigure" command hung I've also included debug logs for the period. I changed the debug mask to -1 after removing routes but before issuing "lnetctl lnet unconfigure". I can send you the crash dump. Thanks, Olaf

Serguei Smirnov added a comment - 16/Jun/22 3:32 AM

Hi Olaf,

After discussing with ashehata, I wonder if we could revisit testing with the "detailed peer refcount summary" patch https://review.whamcloud.com/46364

I'd like to clarify the following:

1) How do the "detailed" counts change over time (for a peer which has refcount steadily increasing)? This means taking more than one snapshot of lnetctl output: e.g. at refcount 100 and refcount = 500.

2) The increasing peer refcount appears to be associated with negative number of router credits, i.e. we're slow routing messages from this peer. What happens if the corresponding route is removed from the peer?

Not sure if it is easy enough to arrange, but for "2" it should be possible to remove the route dynamically using lnetctl. After the route is removed, we should stop receiving traffic from this peer. We would finish forwarding whatever messages we had queued up and rtr_credits should return to normal value. In order to avoid issues with "symmetry", it would be best to remove the route from all peers. Then we can check what happened to the peer refcount: dump the "detailed" counts again and try to delete the peer using lnetctl (won't work if there's actually a leak). Maybe dump a core, too.

Thanks,

Serguei.

Serguei Smirnov added a comment - 16/Jun/22 3:32 AM Hi Olaf, After discussing with ashehata , I wonder if we could revisit testing with the "detailed peer refcount summary" patch https://review.whamcloud.com/46364 I'd like to clarify the following: 1) How do the "detailed" counts change over time (for a peer which has refcount steadily increasing)? This means taking more than one snapshot of lnetctl output: e.g. at refcount 100 and refcount = 500. 2) The increasing peer refcount appears to be associated with negative number of router credits, i.e. we're slow routing messages from this peer. What happens if the corresponding route is removed from the peer? Not sure if it is easy enough to arrange, but for "2" it should be possible to remove the route dynamically using lnetctl. After the route is removed, we should stop receiving traffic from this peer. We would finish forwarding whatever messages we had queued up and rtr_credits should return to normal value. In order to avoid issues with "symmetry", it would be best to remove the route from all peers. Then we can check what happened to the peer refcount: dump the "detailed" counts again and try to delete the peer using lnetctl (won't work if there's actually a leak). Maybe dump a core, too. Thanks, Serguei.

Serguei Smirnov added a comment - 13/Jun/22 10:19 PM

Hi Olaf,

So far, from looking at the logs you provided, I haven't seen any outputs with abnormal stats for any of the peers you dumped, which may mean that the problem is not reflected at lnd level.

If you do reproduce again, you could try using "lnetctl peer show -v 4" (vs. just "lnetctl peer show"). To reduce the amount of output this produces, you can use " --nid " option to dump for specific peer only.

In the meantime I'm looking at how instrumentation can be extended to yield more useful info.

Thanks,

Serguei.

Serguei Smirnov added a comment - 13/Jun/22 10:19 PM Hi Olaf, So far, from looking at the logs you provided, I haven't seen any outputs with abnormal stats for any of the peers you dumped, which may mean that the problem is not reflected at lnd level. If you do reproduce again, you could try using "lnetctl peer show -v 4" (vs. just "lnetctl peer show"). To reduce the amount of output this produces, you can use " --nid " option to dump for specific peer only. In the meantime I'm looking at how instrumentation can be extended to yield more useful info. Thanks, Serguei.

Olaf Faaland added a comment - 13/Jun/22 10:00 PM

Hi Serguei,
I don't have a record of which peer NID was given as the argument, for the above debug sessions. Do you need me to reproduce this and keep track of that?
thanks,
Olaf

Olaf Faaland added a comment - 13/Jun/22 10:00 PM Hi Serguei, I don't have a record of which peer NID was given as the argument, for the above debug sessions. Do you need me to reproduce this and keep track of that? thanks, Olaf

Olaf Faaland added a comment - 09/Jun/22 12:46 AM - edited

Hi Serguei,

We reproduced the issue on orelic2, with https://review.whamcloud.com/47460, under Lustre 2.12.8.

There were 4 peers with high refcounts, with NIDs 172.16.70.6[2-5]@tcp. I captured the debug information multiple times for some those peers, but I may not be able to identify which peer a set of debug output is for. I'll post that mapping if I find it. The debug information, as well as the output of "lnetctl peer show --details", is attached.

lnetctl.peer.show.orelic2.1654723542.txt
lnetctl.peer.show.orelic2.1654724780.txt
dk.orelic2.1654723678.txt
dk.orelic2.1654723686.txt
dk.orelic2.1654724730.txt
dk.orelic2.1654724740.txt
dk.orelic2.1654724745.txt
dk.orelic2.1654724751.txt

Thanks,
Olaf

Olaf Faaland added a comment - 09/Jun/22 12:46 AM - edited Hi Serguei, We reproduced the issue on orelic2, with https://review.whamcloud.com/47460 , under Lustre 2.12.8. There were 4 peers with high refcounts, with NIDs 172.16.70.6 [2-5] @tcp. I captured the debug information multiple times for some those peers, but I may not be able to identify which peer a set of debug output is for. I'll post that mapping if I find it. The debug information, as well as the output of "lnetctl peer show --details", is attached. lnetctl.peer.show.orelic2.1654723542.txt lnetctl.peer.show.orelic2.1654724780.txt dk.orelic2.1654723678.txt dk.orelic2.1654723686.txt dk.orelic2.1654724730.txt dk.orelic2.1654724740.txt dk.orelic2.1654724745.txt dk.orelic2.1654724751.txt Thanks, Olaf

Olaf Faaland added a comment - 01/Jun/22 7:23 PM

Hi Serguei,

Routers with this symptom recently have FW 16.29.2002. We don't have any routers running xxx.30.xxxx.

thanks

Olaf Faaland added a comment - 01/Jun/22 7:23 PM Hi Serguei, Routers with this symptom recently have FW 16.29.2002. We don't have any routers running xxx.30.xxxx. thanks

People

Assignee:: Serguei Smirnov

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 16/Nov/21 12:05 AM

Updated:: 17/Dec/22 2:28 AM

Resolved:: 25/Oct/22 7:09 PM