Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
lustre-2.12.7_2.llnl-2.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64
-
3
-
9223372036854775807
Description
I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by "lctl get_param peers".
The reference counts reported as "refs" by "lctl get_param peers" are increasing linearly with time. This is in contrast with "queue" which periodically spikes but then drops to 0 again. Below shows 4 routers on ruby which have refs > 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0. This is just a little over 6 days since the ruby routers were rebooted during an update.
[root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2>/dev/null | awk '$3 > 20 {print}' | sed 's/^.*://' | sort -V -u 172.19.2.24@o2ib100 46957 up 5 8 -46945 -46945 8 -13 0 172.19.2.24@o2ib100 47380 up 1 8 -47368 -47368 8 -23 0 172.19.2.24@o2ib100 48449 up 15 8 -48437 -48437 8 -17 0 172.19.2.24@o2ib100 49999 up 3 8 -49987 -49987 8 -7 0
The ruby routers have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).