[LU-12886] A lot of LNetError: lnet_peer_ni_add_to_recoveryq_locked() messages Created: 19/Oct/19  Updated: 10/Apr/20  Resolved: 10/Apr/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 1
Labels: llnl
Environment:

2.12.3 RC1 (vanilla) on servers, CentOS 7.6, patched kernel; 2.12.0 + patches on clients


Attachments: Text File fir-io1-s1.log     Text File lnet_recoveryq.log    
Issue Links:
Related
is related to LU-13071 LNet Health: reduce log severity Resolved
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

After upgrading our servers on Fir (Sherlock's /scratch) to Lustre 2.12.3 RC1, we are noticing a lot of these messages on all Lustre servers:

LNetError: 49537:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.201@o2ib7 added to recovery queue. Health = 900

The NIDs reported are our Lustre routers, that are still running 2.12.0+patches (they are on the client clusters).

Attaching logs from all servers as lnet_recoveryq.log

This doesn't seem to have an impact on production and so far 2.12.3 RC1 has been just great for us (and we run it without additional patches now!). Thanks!

Stéphane



 Comments   
Comment by Stephane Thiell [ 19/Oct/19 ]

I forgot to mention that we're now using Mellanox OFED 4.7 on IB HDR100.

Comment by Peter Jones [ 20/Oct/19 ]

Amir

Could you please advise

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 20/Oct/19 ]

That's the health feature. When sending to a peer fails the health of the interface is adjusted. If there are multiple interfaces available the healthiest one is selected. The interface is added on a recovery queue and pinged every 1 second. Until it's fully healthy.

Are these particular routers down? Or did you see a burst of these messages early on and then they stopped?

Comment by Stephane Thiell [ 20/Oct/19 ]

Thanks for checking Amir! No, I just see a constant (slow) flow of these messages on the server side, but all our routers are up and running. Attached kernel logs of an OSS as fir-io1-s1.log. The trafic on the routers can be pretty high though (close to 8-10 GB/s per EDR/HDR100 router right now). We're not using multi-rail lnet. According to lnet/peers, we might have ran out of rtr credits but at the moment it seems OK. BTW, is there a way to reset the counters of /sys/kernel/debug/lnet/peers? that would be helpful

# clush -w @rtr_fir_[1-2] -b "cat /sys/kernel/debug/lnet/peers  | egrep '^nid|o2ib7'"
---------------
sh-rtr-fir-1-1
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
10.0.10.105@o2ib7           3    up    36     8     6   -63     8 -1210 0
10.0.10.110@o2ib7           1    up    26     8     8     8     8     6 0
10.0.10.102@o2ib7           4    up    32     8     5   -70     8 -1157 0
10.0.10.107@o2ib7           1    up    94     8     8   -45     8 -1249 0
10.0.10.52@o2ib7            1    up   168     8     8   -16     8 -1645 0
10.0.10.104@o2ib7           2    up   170     8     7   -51     8 -1195 0
10.0.10.54@o2ib7            1    up    57     8     8    -8     8   -39 0
10.0.10.101@o2ib7           3    up    52     8     6   -54     8 -1150 0
10.0.10.106@o2ib7           1    up     7     8     8   -43     8 -1221 0
10.0.10.51@o2ib7            1    up    13     8     8   -16     8 -3546 0
10.0.10.103@o2ib7           4    up   152     8     5   -70     8 -1244 0
10.0.10.116@o2ib7           1    up    33     8     8     8     8     6 0
10.0.10.108@o2ib7           1    up    12     8     8   -43     8 -1151 0
10.0.10.53@o2ib7            1    up   165     8     8    -8     8   -46 0
---------------
sh-rtr-fir-1-2
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
10.0.10.105@o2ib7           1    up   166     8     8    -8     8   -39 0
10.0.10.110@o2ib7           1    up    82     8     8     8     8     6 0
10.0.10.102@o2ib7           1    up    83     8     8    -8     8   -73 0
10.0.10.107@o2ib7           1    up    32     8     8    -8     8   -30 0
10.0.10.52@o2ib7            1    up    61     8     8    -8     8  -741 0
10.0.10.104@o2ib7           3    up    45     8     6    -8     8   -33 0
10.0.10.54@o2ib7            1    up    26     8     8    -8     8   -18 0
10.0.10.101@o2ib7           1    up   120     8     8    -8     8   -19 0
10.0.10.106@o2ib7           1    up    35     8     8    -8     8   -34 0
10.0.10.51@o2ib7            1    up   100     8     8    -8     8  -123 0
10.0.10.103@o2ib7           2    up    24     8     7    -8     8   -28 0
10.0.10.116@o2ib7           1    up    82     8     8     8     8     6 0
10.0.10.108@o2ib7           1    up    18     8     8    -8     8   -72 0
10.0.10.53@o2ib7            2    up   111     8     7    -8     8   -51 0
---------------
sh-rtr-fir-1-3
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
10.0.10.105@o2ib7           2    up   179     8     7  -549     8 -1016 0
10.0.10.110@o2ib7           1    up   177     8     8     8     8     6 0
10.0.10.102@o2ib7           2    up     6     8     7  -793     8  -807 0
10.0.10.107@o2ib7           3    up    60     8     6  -664     8  -808 0
10.0.10.52@o2ib7            1    up    96     8     8 -1151     8 -1394 0
10.0.10.104@o2ib7           4    up   138     8     6  -789     8 -1020 0
10.0.10.54@o2ib7            1    up    58     8     8    -8     8   -36 0
10.0.10.101@o2ib7           2    up    17     8     7  -826     8  -827 0
10.0.10.106@o2ib7           1    up   176     8     8  -773     8  -782 0
10.0.10.51@o2ib7            1    up    33     8     8    -8     8 -3907 0
10.0.10.103@o2ib7           1    up   170     8     8 -1383     8 -1052 0
10.0.10.116@o2ib7           1    up     3     8     8     8     8     6 0
10.0.10.108@o2ib7           2    up   154     8     7  -688     8  -787 0
10.0.10.53@o2ib7            1    up   144     8     8    -8     8   -23 0
---------------
sh-rtr-fir-1-4
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.10.3@o2ib7             1    up   175     8     8     8     8     6 0
10.0.10.105@o2ib7           4    up   127     8     5    -8     8   -59 0
10.0.10.110@o2ib7           1    up    49     8     8     8     8     6 0
10.0.10.102@o2ib7           3    up   153     8     6    -8     8   -91 0
10.0.10.107@o2ib7           1    up    98     8     8    -8     8   -28 0
10.0.10.52@o2ib7            1    up    83     8     8    -8     8 -1007 0
10.0.10.104@o2ib7           3    up   122     8     6    -8     8   -42 0
10.0.10.54@o2ib7            1    up    43     8     8    -8     8   -39 0
10.0.10.101@o2ib7           2    up   160     8     7    -8     8   -22 0
10.0.10.114@o2ib7           1  down  9999     8     8     8     8     7 0
10.0.10.106@o2ib7           2    up   115     8     7    -8     8   -50 0
10.0.10.51@o2ib7            1    up     8     8     8    -8     8  -252 0
10.0.10.103@o2ib7           3    up   120     8     6    -8     8   -59 0
10.0.10.116@o2ib7           1    up     0     8     8     8     8     6 0
10.0.10.108@o2ib7           2    up    80     8     7    -8     8   -89 0
10.0.10.53@o2ib7            1    up   126     8     8    -8     8   -42 0
---------------
sh-rtr-fir-2-1
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
10.0.10.105@o2ib7           3    up    43     8     6   -16     8  -243 0
10.0.10.110@o2ib7           1    up   157     8     8     8     8     6 0
10.0.10.102@o2ib7           1    up     6     8     8   -16     8  -289 0
10.0.10.107@o2ib7           2    up    62     8     8   -16     7  -261 560
10.0.10.52@o2ib7            1    up    81     8     8    -8     8   -23 0
10.0.10.104@o2ib7           1    up     9     8     8    -8     8  -197 0
10.0.10.54@o2ib7            1    up    73     8     8    -8     8   -75 0
10.0.10.101@o2ib7           1    up    48     8     8   -16     8  -228 0
10.0.10.106@o2ib7           1    up    34     8     8   -16     8  -284 0
10.0.10.51@o2ib7            1    up   157     8     8    -8     8  -685 0
0.0.10.51@o2ib7             1  down  9999     8     8     8     8     7 0
10.0.10.103@o2ib7           2    up    38     8     7   -16     8  -259 0
10.0.10.116@o2ib7           1    up   158     8     8     8     8     6 0
10.0.10.108@o2ib7           1    up    37     8     8   -16     8  -376 0
10.0.10.53@o2ib7            1    up    87     8     8    -8     8  -410 0
---------------
sh-rtr-fir-2-2
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
10.0.10.105@o2ib7           3    up    58     8     6   -16     8  -237 0
10.0.10.110@o2ib7           1    up    12     8     8     8     8     6 0
10.0.10.102@o2ib7           1    up    56     8     8   -16     8  -297 0
10.0.10.107@o2ib7           3    up    84     8     6   -16     8  -229 0
10.0.10.52@o2ib7            1    up   132     8     8    -8     8   -14 0
10.0.10.104@o2ib7           4    up    95     8     6    -8     7  -217 72
10.0.10.54@o2ib7            1    up   130     8     8    -8     8   -72 0
10.0.10.101@o2ib7           1    up    99     8     8   -16     8  -250 0
10.0.10.106@o2ib7           2    up    58     8     7   -16     8  -308 0
10.0.10.51@o2ib7            1    up   141     8     8    -8     8  -353 0
10.0.10.103@o2ib7           2    up    35     8     7   -16     8  -313 0
10.0.10.116@o2ib7           1    up    15     8     8     8     8     6 0
10.0.10.108@o2ib7           2    up    37     8     7   -16     8  -267 0
10.0.10.53@o2ib7            1    up   117     8     8    -8     8  -323 0
---------------
sh-rtr-fir-2-3
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
10.0.10.105@o2ib7           2    up    42     8     7   -16     8  -212 0
10.0.10.110@o2ib7           1    up   107     8     8     8     8     6 0
10.0.10.102@o2ib7           2    up    59     8     7   -16     8  -338 0
10.0.10.107@o2ib7           2    up    91     8     7   -16     8  -215 0
10.0.10.52@o2ib7            1    up     1     8     8    -8     8   -17 0
10.0.10.104@o2ib7           2    up    42     8     7    -8     8  -226 0
10.0.10.54@o2ib7            1    up    76     8     8    -8     8   -86 0
10.0.10.101@o2ib7           1    up    55     8     8   -16     8  -248 0
10.0.10.106@o2ib7           1    up    21     8     8   -16     8  -358 0
10.0.10.51@o2ib7            1    up   136     8     8    -8     8  -424 0
10.0.10.103@o2ib7           2    up    13     8     7   -16     8  -259 0
10.0.10.116@o2ib7           1    up   107     8     8     8     8     6 0
10.0.10.108@o2ib7           1    up    24     8     8   -16     8  -396 0
10.0.10.53@o2ib7            1    up   157     8     8    -8     8  -549 0
---------------
sh-rtr-fir-2-4
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
10.0.10.105@o2ib7           3    up    44     8     6   -16     8  -292 0
10.0.10.110@o2ib7           1    up    52     8     8     8     8     6 0
10.0.10.102@o2ib7           1    up    28     8     8   -16     8  -318 0
10.0.10.107@o2ib7           1    up    29     8     8   -16     8  -271 0
10.0.10.52@o2ib7            1    up     1     8     8    -8     8   -19 0
10.0.10.104@o2ib7           3    up    48     8     6   -16     8  -294 0
10.0.10.54@o2ib7            1    up    28     8     8    -8     8   -53 0
10.0.10.101@o2ib7           1    up    19     8     8   -16     8  -254 0
10.0.10.106@o2ib7           1    up    37     8     8   -16     8  -342 0
10.0.10.51@o2ib7            1    up    89     8     8    -8     8  -470 0
10.0.10.103@o2ib7           1    up    31     8     8   -16     8  -311 0
10.0.10.116@o2ib7           1    up    53     8     8     8     8     6 0
10.0.10.108@o2ib7           2    up     9     8     7   -16     8  -261 0
10.0.10.53@o2ib7            1    up    20     8     8    -8     8  -332 0

Example of trafic on sh-rtr-fir-2-1 (10.0.10.209@o2ib7) at 10:03AM this morning:

[root@sh-rtr-fir-2-1 ~]# collectl -sx
waiting for 1 second sample...
#<-----------InfiniBand----------->
#   KBIn  PktIn   KBOut PktOut Errs 
  377711  2493K 9336463  2488K    0 
  397428  2070K 7594350  2066K    0 
  545513  2403K 8775817  2395K    0 
  521123  2781K  10089K  2775K    0 
  527162  2831K  10251K  2824K    0 
  416422  2516K 9368186  2511K    0 
  691311  2339K 8384500  2330K    0 

At about the same time:

fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.209@o2ib7 added to recovery queue. Health = 900
fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 79 previous similar messages

On the router sh-rtr-fir-2-1 itself, no new messages today (these messages are from yesterday). Occassional client timeout on the cluster fabric (o2ib4 here) is normal I think:

Oct 19 13:26:59 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.106.54@o2ib4 (103): c: 7, oc: 0, rc: 8
Oct 19 13:27:49 sh-rtr-fir-2-1.int kernel: LNetError: 10498:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
Oct 19 13:27:56 sh-rtr-fir-2-1.int kernel: LNetError: 10501:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.105.12@o2ib4 (107): c: 7, oc: 0, rc: 8
Oct 19 17:34:53 sh-rtr-fir-2-1.int kernel: LNetError: 10496:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
Oct 19 17:35:07 sh-rtr-fir-2-1.int kernel: LNetError: 10495:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.107.3@o2ib4 (107): c: 8, oc: 0, rc: 8

Our setup:

  • sh-rtr-fir-1-[1-4] are our o2ib6 (Sherlock 1 cluster FDR fabric) / o2ib7 (scratch storage, HDR100 fabric) routers (10.0.10.[201-204@o2ib7)
  • sh-rtr-fir-2-[1-4] are our o2ib4 (Sherlock 2 cluster EDR fabric) / o2ib7 (scratch storage, HDR100 fabric) routers (10.0.10.[209-212@o2ib7)
Comment by Luis Silva [ 03/Jan/20 ]

We are having the same errors on our clients. Our setup only uses a single IB interface everywhere. We are not using multi-rail anywhere. My understanding is that this health check is designed for multi-rail setups? Is that correct? If so, is there any way I can tell Lustre to skip these health checks? Thanks in advance. -Luis

Comment by Amir Shehata (Inactive) [ 03/Jan/20 ]

Hi Luis,

You can turn off health on your setup

lnetctl set health_sensitivity 0
lnetctl set transaction_timeout 50 # or some value you'd like 
lnetctl set retry_count 0

We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.

Comment by Olaf Faaland [ 30/Jan/20 ]

We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.

Hi Amir,
Which patch?
Thanks

Comment by Peter Jones [ 30/Jan/20 ]

LU-11981 I assume. It will be in 2.12.4

Comment by Olaf Faaland [ 30/Jan/20 ]

It looks to me like LU-11981 patch addresses only:

"Msg is in inconsistent state, don't perform health checking"

not

"added to recovery queue"

Comment by Olaf Faaland [ 19/Feb/20 ]

Hi Amir,

What 2.12.4 patch(es) reduce the severity of the "added to recovery queue" messages?

thanks

Comment by Amir Shehata (Inactive) [ 20/Feb/20 ]

Hi Olaf,

It doesn't look like this patch made it in 2.12.4. It still hasn't landed. Needs an extra review:

https://review.whamcloud.com/#/c/37002/

The "inconsistent" message was reduced here:

f549927ea633b910a8c788fa970af742b3bf10c1 LU-11981 lnet: clean up error message

thanks

amir

Comment by Stephane Thiell [ 27/Feb/20 ]

Hello, we would also appreciate a backport of this patch (that just landed into master) to b2_12 as it is very verbose. Thanks!!

Comment by Olaf Faaland [ 27/Feb/20 ]

Hi Stephane,

They've pushed a backport for b2_12 to gerrit, and it's been reviewed and passed tests, just needs to land:

https://review.whamcloud.com/#/c/37718/

Comment by Peter Jones [ 01/Apr/20 ]

The https://review.whamcloud.com/#/c/37718/ fix landed to b2_12 some weeks back so I think that it should be ok to close this ticket. Any objections?

Comment by Olaf Faaland [ 10/Apr/20 ]

Agreed

Generated at Sat Feb 10 02:56:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.