Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
lustre-2.12.7_2.llnl-2.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64
-
3
-
9223372036854775807
Description
I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".
The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again. Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0. This is just a little over 6 days since the ruby routers were rebooted during an update.
[root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u 172.19.2.24@o2ib100 46957 up 5 8 -46945 -46945 8 -13 0 172.19.2.24@o2ib100 47380 up 1 8 -47368 -47368 8 -23 0 172.19.2.24@o2ib100 48449 up 15 8 -48437 -48437 8 -17 0 172.19.2.24@o2ib100 49999 up 3 8 -49987 -49987 8 -7 0
The ruby routers have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).
Attachments
Issue Links
Activity
Link | Original: This issue is related to JFC-21 [ JFC-21 ] |
Fix Version/s | New: Lustre 2.16.0 [ 15190 ] | |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Reopened [ 4 ] | New: Resolved [ 5 ] |
Labels | Original: llnl topllnl | New: llnl |
Landed for 2.16