[LU-15234] LNet high peer reference counts inconsistent with queue - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
- llnl
Environment:
lustre-2.12.7_2.llnl-2.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again. Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0. This is just a little over 6 days since the ruby routers were rebooted during an update.

[root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
 172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
 172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
 172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
 172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0

The ruby routers have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2022-jun-21.tgz
267 kB
21/Jun/22 10:45 PM
debug_refcount_01.patch
19 kB
28/Jan/22 12:31 AM
dk.orelic2.1654723678.txt
7 kB
09/Jun/22 12:43 AM
dk.orelic2.1654723686.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724730.txt
27 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724740.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724745.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724751.txt
2 kB
09/Jun/22 12:43 AM
dk.ruby1016.1637103254.txt.bz2
8.58 MB
16/Nov/21 11:12 PM
ko2iblnd.parameters.orelic4.1637617473.txt
1 kB
22/Nov/21 9:48 PM
ksocklnd.parameters.orelic4.1637617487.txt
1 kB
22/Nov/21 9:48 PM
lctl.version.orelic4.1637616867.txt
0.0 kB
22/Nov/21 9:48 PM
lctl.version.ruby1016.1637616519.txt
0.0 kB
22/Nov/21 9:48 PM
lnet.parameters.orelic4.1637617458.txt
2 kB
22/Nov/21 9:48 PM
lnetctl.net-show.orelic4.1637616889.txt
2 kB
22/Nov/21 9:48 PM
lnetctl.net-show.ruby1016.1637616206.txt
5 kB
22/Nov/21 9:48 PM
lnetctl.peer.show.orelic2.1654723542.txt
1.20 MB
09/Jun/22 12:43 AM
lnetctl.peer.show.orelic2.1654724780.txt
1.20 MB
09/Jun/22 12:43 AM
orelic4.debug_refcount_01.tar.gz
27 kB
01/Feb/22 7:39 PM
orelic4-lustre212-20211216.tgz
2 kB
16/Dec/21 11:14 PM
params_20211213.tar.gz
6 kB
14/Dec/21 2:01 AM
peer.show.172.16.70.62_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.63_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.64_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.65_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.ruby1016.1637103254.txt
1 kB
16/Nov/21 11:12 PM
peer.show.ruby1016.1637103865.txt
1 kB
16/Nov/21 11:12 PM
peer status orelic4 with discovery race patch v3.png
407 kB
09/Dec/21 1:13 AM
stats.show.ruby1016.1637103254.txt
0.6 kB
16/Nov/21 11:12 PM
stats.show.ruby1016.1637103865.txt
0.6 kB
16/Nov/21 11:12 PM
toss-5305 queue 2021-11-15.png
62 kB
16/Nov/21 12:05 AM
toss-5305 refs 2021-11-15.png
66 kB
16/Nov/21 12:05 AM

Issue Links

duplicates

LU-12739 Race with discovery thread completion and message queueing

Resolved

is related to

LU-15453 MDT shutdown hangs on mutex_lock, possibly cld_lock

Open

Activity

[LU-15234] LNet high peer reference counts inconsistent with queue

Peter Jones added a comment - 12/Sep/22 6:38 PM

Yes I think that we can mark this ticket as a duplicate of ~~LU-12739~~ once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test

Peter Jones added a comment - 12/Sep/22 6:38 PM Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test

Olaf Faaland added a comment - 12/Sep/22 5:59 PM

As far as I'm concerned, this will be resolved when the patch lands to b2_12. Do you agree? If so, what is the plan for that?

thanks

Olaf Faaland added a comment - 12/Sep/22 5:59 PM As far as I'm concerned, this will be resolved when the patch lands to b2_12. Do you agree? If so, what is the plan for that? thanks

Olaf Faaland added a comment - 31/Aug/22 12:54 AM

Hi Serguei,

2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer.

Olaf

Olaf Faaland added a comment - 31/Aug/22 12:54 AM Hi Serguei, 2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer. Olaf

Olaf Faaland added a comment - 25/Aug/22 4:00 PM

Hi Serguei,

2.12.9 + change 48190 held up well overnight which is far beyond how long we've needed to wait for symptoms in the past. If you can get someone to perform a second review on the patch in gerrit that would be great.

I'll deploy more widely and update here early next week.

thanks,
Olaf

Olaf Faaland added a comment - 25/Aug/22 4:00 PM Hi Serguei, 2.12.9 + change 48190 held up well overnight which is far beyond how long we've needed to wait for symptoms in the past. If you can get someone to perform a second review on the patch in gerrit that would be great. I'll deploy more widely and update here early next week. thanks, Olaf

Olaf Faaland added a comment - 24/Aug/22 10:59 PM

Hi Serguei

I tested 2.12.9 + change 48190 today and results so far are promising. I'll test it further and post here in the next couple of days.

Olaf Faaland added a comment - 24/Aug/22 10:59 PM Hi Serguei I tested 2.12.9 + change 48190 today and results so far are promising. I'll test it further and post here in the next couple of days.

Serguei Smirnov added a comment - 11/Aug/22 5:14 PM

Hi Olaf,

I ported Chris's fix for ~~LU-12739~~ to b2_12: https://review.whamcloud.com/#/c/48190/

Please give this patch a try. It is aiming to eliminate a race condition with effects potentially similar to what is seen in the coredump you provided.

Thanks,

Serguei.

Serguei Smirnov added a comment - 11/Aug/22 5:14 PM Hi Olaf, I ported Chris's fix for LU-12739 to b2_12: https://review.whamcloud.com/#/c/48190/ Please give this patch a try. It is aiming to eliminate a race condition with effects potentially similar to what is seen in the coredump you provided. Thanks, Serguei.

Serguei Smirnov added a comment - 09/Aug/22 8:07 PM

Chris,

Yes indeed, it looks very much like ~~LU-12739~~

I'll port these changes.

Thanks,

Serguei.

Serguei Smirnov added a comment - 09/Aug/22 8:07 PM Chris, Yes indeed, it looks very much like LU-12739 I'll port these changes. Thanks, Serguei.

Chris Horn added a comment - 09/Aug/22 4:23 PM

Sounds like https://jira.whamcloud.com/browse/LU-12739 ?

Chris Horn added a comment - 09/Aug/22 4:23 PM Sounds like https://jira.whamcloud.com/browse/LU-12739 ?

Serguei Smirnov added a comment - 08/Aug/22 10:48 PM

Hi Olaf,

While examining the core I found that messages causing the delay are waiting to be sent: they are listed on lp_dc_pendq of the destination peer.

At the same time, the destination peer is not queued to be discovered, so it appears that there's no good reason for the messages to be delayed.

I pushed a test patch in order to rule out a race condition which somehow enables a thread to queue a message for a peer which is not (or no longer) going to be discovered. The new patch is going to attempt to recognize this situation on discovery completion, print an error and handle any messages which are still pending. This should help locate the race condition if it is actually occurring. If this is the only cause, with this patch we should see the error message "Peer X msg list not empty on disc comp" and no more refcount increase.

Otherwise, I'll have to look for other possible causes.

Thanks,

Serguei.

Serguei Smirnov added a comment - 08/Aug/22 10:48 PM Hi Olaf, While examining the core I found that messages causing the delay are waiting to be sent: they are listed on lp_dc_pendq of the destination peer. At the same time, the destination peer is not queued to be discovered, so it appears that there's no good reason for the messages to be delayed. I pushed a test patch in order to rule out a race condition which somehow enables a thread to queue a message for a peer which is not (or no longer) going to be discovered. The new patch is going to attempt to recognize this situation on discovery completion, print an error and handle any messages which are still pending. This should help locate the race condition if it is actually occurring. If this is the only cause, with this patch we should see the error message "Peer X msg list not empty on disc comp" and no more refcount increase. Otherwise, I'll have to look for other possible causes. Thanks, Serguei.

Gerrit Updater added a comment - 08/Aug/22 10:26 PM

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48163
Subject: ~~LU-15234~~ lnet: test for race when completing discovery
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 0eb36b2ace98b0c57595098a3a6d9f5de8e6045c

Gerrit Updater added a comment - 08/Aug/22 10:26 PM "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48163 Subject: LU-15234 lnet: test for race when completing discovery Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 0eb36b2ace98b0c57595098a3a6d9f5de8e6045c

Serguei Smirnov added a comment - 26/Jul/22 3:35 AM

Hi Olaf,

I had too many distractions and haven't finished looking at the core yet.

Basically I believe what I see in the core so far does confirm the idea that messages are not getting finalized, but I still haven't understood why. In LNet layer the number of queued messages on the problem peer looks consistent with the high refcount, but I still need to dig more at the LND level and examine message queues there.

Thanks,

Serguei.

Serguei Smirnov added a comment - 26/Jul/22 3:35 AM Hi Olaf, I had too many distractions and haven't finished looking at the core yet. Basically I believe what I see in the core so far does confirm the idea that messages are not getting finalized, but I still haven't understood why. In LNet layer the number of queued messages on the problem peer looks consistent with the high refcount, but I still need to dig more at the LND level and examine message queues there. Thanks, Serguei.

People

Assignee:: Serguei Smirnov

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 16/Nov/21 12:05 AM

Updated:: 17/Dec/22 2:28 AM

Resolved:: 25/Oct/22 7:09 PM