[LU-12886] A lot of LNetError: lnet_peer_ni_add_to_recoveryq_locked() messages Created: 19/Oct/19 Updated: 10/Apr/20 Resolved: 10/Apr/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | llnl | ||
| Environment: |
2.12.3 RC1 (vanilla) on servers, CentOS 7.6, patched kernel; 2.12.0 + patches on clients |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 4 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After upgrading our servers on Fir (Sherlock's /scratch) to Lustre 2.12.3 RC1, we are noticing a lot of these messages on all Lustre servers: LNetError: 49537:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.201@o2ib7 added to recovery queue. Health = 900 The NIDs reported are our Lustre routers, that are still running 2.12.0+patches (they are on the client clusters). Attaching logs from all servers as lnet_recoveryq.log This doesn't seem to have an impact on production and so far 2.12.3 RC1 has been just great for us (and we run it without additional patches now!). Thanks! Stéphane |
| Comments |
| Comment by Stephane Thiell [ 19/Oct/19 ] |
|
I forgot to mention that we're now using Mellanox OFED 4.7 on IB HDR100. |
| Comment by Peter Jones [ 20/Oct/19 ] |
|
Amir Could you please advise Thanks Peter |
| Comment by Amir Shehata (Inactive) [ 20/Oct/19 ] |
|
That's the health feature. When sending to a peer fails the health of the interface is adjusted. If there are multiple interfaces available the healthiest one is selected. The interface is added on a recovery queue and pinged every 1 second. Until it's fully healthy. Are these particular routers down? Or did you see a burst of these messages early on and then they stopped? |
| Comment by Stephane Thiell [ 20/Oct/19 ] |
|
Thanks for checking Amir! No, I just see a constant (slow) flow of these messages on the server side, but all our routers are up and running. Attached kernel logs of an OSS as fir-io1-s1.log # clush -w @rtr_fir_[1-2] -b "cat /sys/kernel/debug/lnet/peers | egrep '^nid|o2ib7'" --------------- sh-rtr-fir-1-1 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 36 8 6 -63 8 -1210 0 10.0.10.110@o2ib7 1 up 26 8 8 8 8 6 0 10.0.10.102@o2ib7 4 up 32 8 5 -70 8 -1157 0 10.0.10.107@o2ib7 1 up 94 8 8 -45 8 -1249 0 10.0.10.52@o2ib7 1 up 168 8 8 -16 8 -1645 0 10.0.10.104@o2ib7 2 up 170 8 7 -51 8 -1195 0 10.0.10.54@o2ib7 1 up 57 8 8 -8 8 -39 0 10.0.10.101@o2ib7 3 up 52 8 6 -54 8 -1150 0 10.0.10.106@o2ib7 1 up 7 8 8 -43 8 -1221 0 10.0.10.51@o2ib7 1 up 13 8 8 -16 8 -3546 0 10.0.10.103@o2ib7 4 up 152 8 5 -70 8 -1244 0 10.0.10.116@o2ib7 1 up 33 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 12 8 8 -43 8 -1151 0 10.0.10.53@o2ib7 1 up 165 8 8 -8 8 -46 0 --------------- sh-rtr-fir-1-2 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 1 up 166 8 8 -8 8 -39 0 10.0.10.110@o2ib7 1 up 82 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 83 8 8 -8 8 -73 0 10.0.10.107@o2ib7 1 up 32 8 8 -8 8 -30 0 10.0.10.52@o2ib7 1 up 61 8 8 -8 8 -741 0 10.0.10.104@o2ib7 3 up 45 8 6 -8 8 -33 0 10.0.10.54@o2ib7 1 up 26 8 8 -8 8 -18 0 10.0.10.101@o2ib7 1 up 120 8 8 -8 8 -19 0 10.0.10.106@o2ib7 1 up 35 8 8 -8 8 -34 0 10.0.10.51@o2ib7 1 up 100 8 8 -8 8 -123 0 10.0.10.103@o2ib7 2 up 24 8 7 -8 8 -28 0 10.0.10.116@o2ib7 1 up 82 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 18 8 8 -8 8 -72 0 10.0.10.53@o2ib7 2 up 111 8 7 -8 8 -51 0 --------------- sh-rtr-fir-1-3 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 2 up 179 8 7 -549 8 -1016 0 10.0.10.110@o2ib7 1 up 177 8 8 8 8 6 0 10.0.10.102@o2ib7 2 up 6 8 7 -793 8 -807 0 10.0.10.107@o2ib7 3 up 60 8 6 -664 8 -808 0 10.0.10.52@o2ib7 1 up 96 8 8 -1151 8 -1394 0 10.0.10.104@o2ib7 4 up 138 8 6 -789 8 -1020 0 10.0.10.54@o2ib7 1 up 58 8 8 -8 8 -36 0 10.0.10.101@o2ib7 2 up 17 8 7 -826 8 -827 0 10.0.10.106@o2ib7 1 up 176 8 8 -773 8 -782 0 10.0.10.51@o2ib7 1 up 33 8 8 -8 8 -3907 0 10.0.10.103@o2ib7 1 up 170 8 8 -1383 8 -1052 0 10.0.10.116@o2ib7 1 up 3 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 154 8 7 -688 8 -787 0 10.0.10.53@o2ib7 1 up 144 8 8 -8 8 -23 0 --------------- sh-rtr-fir-1-4 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 175 8 8 8 8 6 0 10.0.10.105@o2ib7 4 up 127 8 5 -8 8 -59 0 10.0.10.110@o2ib7 1 up 49 8 8 8 8 6 0 10.0.10.102@o2ib7 3 up 153 8 6 -8 8 -91 0 10.0.10.107@o2ib7 1 up 98 8 8 -8 8 -28 0 10.0.10.52@o2ib7 1 up 83 8 8 -8 8 -1007 0 10.0.10.104@o2ib7 3 up 122 8 6 -8 8 -42 0 10.0.10.54@o2ib7 1 up 43 8 8 -8 8 -39 0 10.0.10.101@o2ib7 2 up 160 8 7 -8 8 -22 0 10.0.10.114@o2ib7 1 down 9999 8 8 8 8 7 0 10.0.10.106@o2ib7 2 up 115 8 7 -8 8 -50 0 10.0.10.51@o2ib7 1 up 8 8 8 -8 8 -252 0 10.0.10.103@o2ib7 3 up 120 8 6 -8 8 -59 0 10.0.10.116@o2ib7 1 up 0 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 80 8 7 -8 8 -89 0 10.0.10.53@o2ib7 1 up 126 8 8 -8 8 -42 0 --------------- sh-rtr-fir-2-1 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 43 8 6 -16 8 -243 0 10.0.10.110@o2ib7 1 up 157 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 6 8 8 -16 8 -289 0 10.0.10.107@o2ib7 2 up 62 8 8 -16 7 -261 560 10.0.10.52@o2ib7 1 up 81 8 8 -8 8 -23 0 10.0.10.104@o2ib7 1 up 9 8 8 -8 8 -197 0 10.0.10.54@o2ib7 1 up 73 8 8 -8 8 -75 0 10.0.10.101@o2ib7 1 up 48 8 8 -16 8 -228 0 10.0.10.106@o2ib7 1 up 34 8 8 -16 8 -284 0 10.0.10.51@o2ib7 1 up 157 8 8 -8 8 -685 0 0.0.10.51@o2ib7 1 down 9999 8 8 8 8 7 0 10.0.10.103@o2ib7 2 up 38 8 7 -16 8 -259 0 10.0.10.116@o2ib7 1 up 158 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 37 8 8 -16 8 -376 0 10.0.10.53@o2ib7 1 up 87 8 8 -8 8 -410 0 --------------- sh-rtr-fir-2-2 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 58 8 6 -16 8 -237 0 10.0.10.110@o2ib7 1 up 12 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 56 8 8 -16 8 -297 0 10.0.10.107@o2ib7 3 up 84 8 6 -16 8 -229 0 10.0.10.52@o2ib7 1 up 132 8 8 -8 8 -14 0 10.0.10.104@o2ib7 4 up 95 8 6 -8 7 -217 72 10.0.10.54@o2ib7 1 up 130 8 8 -8 8 -72 0 10.0.10.101@o2ib7 1 up 99 8 8 -16 8 -250 0 10.0.10.106@o2ib7 2 up 58 8 7 -16 8 -308 0 10.0.10.51@o2ib7 1 up 141 8 8 -8 8 -353 0 10.0.10.103@o2ib7 2 up 35 8 7 -16 8 -313 0 10.0.10.116@o2ib7 1 up 15 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 37 8 7 -16 8 -267 0 10.0.10.53@o2ib7 1 up 117 8 8 -8 8 -323 0 --------------- sh-rtr-fir-2-3 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 2 up 42 8 7 -16 8 -212 0 10.0.10.110@o2ib7 1 up 107 8 8 8 8 6 0 10.0.10.102@o2ib7 2 up 59 8 7 -16 8 -338 0 10.0.10.107@o2ib7 2 up 91 8 7 -16 8 -215 0 10.0.10.52@o2ib7 1 up 1 8 8 -8 8 -17 0 10.0.10.104@o2ib7 2 up 42 8 7 -8 8 -226 0 10.0.10.54@o2ib7 1 up 76 8 8 -8 8 -86 0 10.0.10.101@o2ib7 1 up 55 8 8 -16 8 -248 0 10.0.10.106@o2ib7 1 up 21 8 8 -16 8 -358 0 10.0.10.51@o2ib7 1 up 136 8 8 -8 8 -424 0 10.0.10.103@o2ib7 2 up 13 8 7 -16 8 -259 0 10.0.10.116@o2ib7 1 up 107 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 24 8 8 -16 8 -396 0 10.0.10.53@o2ib7 1 up 157 8 8 -8 8 -549 0 --------------- sh-rtr-fir-2-4 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 44 8 6 -16 8 -292 0 10.0.10.110@o2ib7 1 up 52 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 28 8 8 -16 8 -318 0 10.0.10.107@o2ib7 1 up 29 8 8 -16 8 -271 0 10.0.10.52@o2ib7 1 up 1 8 8 -8 8 -19 0 10.0.10.104@o2ib7 3 up 48 8 6 -16 8 -294 0 10.0.10.54@o2ib7 1 up 28 8 8 -8 8 -53 0 10.0.10.101@o2ib7 1 up 19 8 8 -16 8 -254 0 10.0.10.106@o2ib7 1 up 37 8 8 -16 8 -342 0 10.0.10.51@o2ib7 1 up 89 8 8 -8 8 -470 0 10.0.10.103@o2ib7 1 up 31 8 8 -16 8 -311 0 10.0.10.116@o2ib7 1 up 53 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 9 8 7 -16 8 -261 0 10.0.10.53@o2ib7 1 up 20 8 8 -8 8 -332 0 Example of trafic on sh-rtr-fir-2-1 (10.0.10.209@o2ib7) at 10:03AM this morning: [root@sh-rtr-fir-2-1 ~]# collectl -sx waiting for 1 second sample... #<-----------InfiniBand-----------> # KBIn PktIn KBOut PktOut Errs 377711 2493K 9336463 2488K 0 397428 2070K 7594350 2066K 0 545513 2403K 8775817 2395K 0 521123 2781K 10089K 2775K 0 527162 2831K 10251K 2824K 0 416422 2516K 9368186 2511K 0 691311 2339K 8384500 2330K 0 At about the same time: fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.209@o2ib7 added to recovery queue. Health = 900 fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 79 previous similar messages On the router sh-rtr-fir-2-1 itself, no new messages today (these messages are from yesterday). Occassional client timeout on the cluster fabric (o2ib4 here) is normal I think: Oct 19 13:26:59 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.106.54@o2ib4 (103): c: 7, oc: 0, rc: 8 Oct 19 13:27:49 sh-rtr-fir-2-1.int kernel: LNetError: 10498:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 13:27:56 sh-rtr-fir-2-1.int kernel: LNetError: 10501:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.105.12@o2ib4 (107): c: 7, oc: 0, rc: 8 Oct 19 17:34:53 sh-rtr-fir-2-1.int kernel: LNetError: 10496:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 17:35:07 sh-rtr-fir-2-1.int kernel: LNetError: 10495:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.107.3@o2ib4 (107): c: 8, oc: 0, rc: 8 Our setup:
|
| Comment by Luis Silva [ 03/Jan/20 ] |
|
We are having the same errors on our clients. Our setup only uses a single IB interface everywhere. We are not using multi-rail anywhere. My understanding is that this health check is designed for multi-rail setups? Is that correct? If so, is there any way I can tell Lustre to skip these health checks? Thanks in advance. -Luis |
| Comment by Amir Shehata (Inactive) [ 03/Jan/20 ] |
|
Hi Luis, You can turn off health on your setup lnetctl set health_sensitivity 0 lnetctl set transaction_timeout 50 # or some value you'd like lnetctl set retry_count 0 We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console. |
| Comment by Olaf Faaland [ 30/Jan/20 ] |
Hi Amir, |
| Comment by Peter Jones [ 30/Jan/20 ] |
|
|
| Comment by Olaf Faaland [ 30/Jan/20 ] |
|
It looks to me like "Msg is in inconsistent state, don't perform health checking" not "added to recovery queue" |
| Comment by Olaf Faaland [ 19/Feb/20 ] |
|
Hi Amir, What 2.12.4 patch(es) reduce the severity of the "added to recovery queue" messages? thanks |
| Comment by Amir Shehata (Inactive) [ 20/Feb/20 ] |
|
Hi Olaf, It doesn't look like this patch made it in 2.12.4. It still hasn't landed. Needs an extra review: https://review.whamcloud.com/#/c/37002/ The "inconsistent" message was reduced here: f549927ea633b910a8c788fa970af742b3bf10c1 thanks amir |
| Comment by Stephane Thiell [ 27/Feb/20 ] |
|
Hello, we would also appreciate a backport of this patch (that just landed into master) to b2_12 as it is very verbose. Thanks!! |
| Comment by Olaf Faaland [ 27/Feb/20 ] |
|
Hi Stephane, They've pushed a backport for b2_12 to gerrit, and it's been reviewed and passed tests, just needs to land: |
| Comment by Peter Jones [ 01/Apr/20 ] |
|
The https://review.whamcloud.com/#/c/37718/ fix landed to b2_12 some weeks back so I think that it should be ok to close this ticket. Any objections? |
| Comment by Olaf Faaland [ 10/Apr/20 ] |
|
Agreed |