[LU-15234] LNet high peer reference counts inconsistent with queue - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
- llnl
Environment:
lustre-2.12.7_2.llnl-2.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".

The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again. Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0. This is just a little over 6 days since the ruby routers were rebooted during an update.

[root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u
 172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
 172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
 172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
 172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0

The ruby routers have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2022-jun-21.tgz
267 kB
21/Jun/22 10:45 PM
debug_refcount_01.patch
19 kB
28/Jan/22 12:31 AM
dk.orelic2.1654723678.txt
7 kB
09/Jun/22 12:43 AM
dk.orelic2.1654723686.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724730.txt
27 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724740.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724745.txt
2 kB
09/Jun/22 12:43 AM
dk.orelic2.1654724751.txt
2 kB
09/Jun/22 12:43 AM
dk.ruby1016.1637103254.txt.bz2
8.58 MB
16/Nov/21 11:12 PM
ko2iblnd.parameters.orelic4.1637617473.txt
1 kB
22/Nov/21 9:48 PM
ksocklnd.parameters.orelic4.1637617487.txt
1 kB
22/Nov/21 9:48 PM
lctl.version.orelic4.1637616867.txt
0.0 kB
22/Nov/21 9:48 PM
lctl.version.ruby1016.1637616519.txt
0.0 kB
22/Nov/21 9:48 PM
lnet.parameters.orelic4.1637617458.txt
2 kB
22/Nov/21 9:48 PM
lnetctl.net-show.orelic4.1637616889.txt
2 kB
22/Nov/21 9:48 PM
lnetctl.net-show.ruby1016.1637616206.txt
5 kB
22/Nov/21 9:48 PM
lnetctl.peer.show.orelic2.1654723542.txt
1.20 MB
09/Jun/22 12:43 AM
lnetctl.peer.show.orelic2.1654724780.txt
1.20 MB
09/Jun/22 12:43 AM
orelic4.debug_refcount_01.tar.gz
27 kB
01/Feb/22 7:39 PM
orelic4-lustre212-20211216.tgz
2 kB
16/Dec/21 11:14 PM
params_20211213.tar.gz
6 kB
14/Dec/21 2:01 AM
peer.show.172.16.70.62_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.63_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.64_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.172.16.70.65_at_tcp.orelic4.1644951836
2 kB
15/Feb/22 7:44 PM
peer.show.ruby1016.1637103254.txt
1 kB
16/Nov/21 11:12 PM
peer.show.ruby1016.1637103865.txt
1 kB
16/Nov/21 11:12 PM
peer status orelic4 with discovery race patch v3.png
407 kB
09/Dec/21 1:13 AM
stats.show.ruby1016.1637103254.txt
0.6 kB
16/Nov/21 11:12 PM
stats.show.ruby1016.1637103865.txt
0.6 kB
16/Nov/21 11:12 PM
toss-5305 queue 2021-11-15.png
62 kB
16/Nov/21 12:05 AM
toss-5305 refs 2021-11-15.png
66 kB
16/Nov/21 12:05 AM

Issue Links

duplicates

LU-12739 Race with discovery thread completion and message queueing

Resolved

is related to

LU-15453 MDT shutdown hangs on mutex_lock, possibly cld_lock

Open

Activity

[LU-15234] LNet high peer reference counts inconsistent with queue

Peter Jones added a comment - 25/Oct/22 7:09 PM

Landed for 2.16

Peter Jones added a comment - 25/Oct/22 7:09 PM Landed for 2.16

Gerrit Updater added a comment - 25/Oct/22 5:25 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48566/
Subject: ~~LU-15234~~ lnet: add mechanism for dumping lnd peer debug info
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 950e59ced18d49e9fdd31c1e9de43b89a0bc1c1d

Gerrit Updater added a comment - 25/Oct/22 5:25 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48566/ Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info Project: fs/lustre-release Branch: master Current Patch Set: Commit: 950e59ced18d49e9fdd31c1e9de43b89a0bc1c1d

Serguei Smirnov added a comment - 03/Oct/22 12:43 AM

No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue.

Serguei Smirnov added a comment - 03/Oct/22 12:43 AM No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue.

Peter Jones added a comment - 01/Oct/22 6:42 AM

I think it is really a call for ssmirnov . Do you still think that there is value in landing https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ or do you intend to abandon it in light of the review comments?

Peter Jones added a comment - 01/Oct/22 6:42 AM I think it is really a call for ssmirnov . Do you still think that there is value in landing https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ or do you intend to abandon it in light of the review comments?

Olaf Faaland added a comment - 27/Sep/22 9:16 PM

> The ~~LU-12739~~ fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ ?

No opinion from me.

Thanks for getting this fixed.

Olaf Faaland added a comment - 27/Sep/22 9:16 PM > The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/ ? No opinion from me. Thanks for getting this fixed.

Peter Jones added a comment - 20/Sep/22 12:21 PM

The ~~LU-12739~~ fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?

Peter Jones added a comment - 20/Sep/22 12:21 PM The LU-12739 fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?

Gerrit Updater added a comment - 15/Sep/22 10:52 PM

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48566
Subject: ~~LU-15234~~ lnet: add mechanism for dumping lnd peer debug info
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dc704df0be48fc9f933e6f2c6fede3c5991a951a

Gerrit Updater added a comment - 15/Sep/22 10:52 PM "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48566 Subject: LU-15234 lnet: add mechanism for dumping lnd peer debug info Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: dc704df0be48fc9f933e6f2c6fede3c5991a951a

Peter Jones added a comment - 12/Sep/22 6:38 PM

Yes I think that we can mark this ticket as a duplicate of ~~LU-12739~~ once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test

Peter Jones added a comment - 12/Sep/22 6:38 PM Yes I think that we can mark this ticket as a duplicate of LU-12739 once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test

Olaf Faaland added a comment - 12/Sep/22 5:59 PM

As far as I'm concerned, this will be resolved when the patch lands to b2_12. Do you agree? If so, what is the plan for that?

thanks

Olaf Faaland added a comment - 12/Sep/22 5:59 PM As far as I'm concerned, this will be resolved when the patch lands to b2_12. Do you agree? If so, what is the plan for that? thanks

Olaf Faaland added a comment - 31/Aug/22 12:54 AM

Hi Serguei,

2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer.

Olaf

Olaf Faaland added a comment - 31/Aug/22 12:54 AM Hi Serguei, 2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer. Olaf

People

Assignee:: Serguei Smirnov

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 16/Nov/21 12:05 AM

Updated:: 17/Dec/22 2:28 AM

Resolved:: 25/Oct/22 7:09 PM