Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Lustre 2.12.3
-
2.12.3 RC1 (vanilla) on servers, CentOS 7.6, patched kernel; 2.12.0 + patches on clients
-
4
-
9223372036854775807
Description
After upgrading our servers on Fir (Sherlock's /scratch) to Lustre 2.12.3 RC1, we are noticing a lot of these messages on all Lustre servers:
LNetError: 49537:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.201@o2ib7 added to recovery queue. Health = 900
The NIDs reported are our Lustre routers, that are still running 2.12.0+patches (they are on the client clusters).
Attaching logs from all servers as lnet_recoveryq.log
This doesn't seem to have an impact on production and so far 2.12.3 RC1 has been just great for us (and we run it without additional patches now!). Thanks!
Stéphane
Attachments
- fir-io1-s1.log
- 932 kB
- lnet_recoveryq.log
- 20 kB
Issue Links
- is related to
-
LU-13071 LNet Health: reduce log severity
-
- Resolved
-
Activity
Hi Olaf,
It doesn't look like this patch made it in 2.12.4. It still hasn't landed. Needs an extra review:
https://review.whamcloud.com/#/c/37002/
The "inconsistent" message was reduced here:
f549927ea633b910a8c788fa970af742b3bf10c1 LU-11981 lnet: clean up error message
thanks
amir
Hi Amir,
What 2.12.4 patch(es) reduce the severity of the "added to recovery queue" messages?
thanks
It looks to me like LU-11981 patch addresses only:
"Msg is in inconsistent state, don't perform health checking"
not
"added to recovery queue"
We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.
Hi Amir,
Which patch?
Thanks
Hi Luis,
You can turn off health on your setup
lnetctl set health_sensitivity 0 lnetctl set transaction_timeout 50 # or some value you'd like lnetctl set retry_count 0
We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.
We are having the same errors on our clients. Our setup only uses a single IB interface everywhere. We are not using multi-rail anywhere. My understanding is that this health check is designed for multi-rail setups? Is that correct? If so, is there any way I can tell Lustre to skip these health checks? Thanks in advance. -Luis
Thanks for checking Amir! No, I just see a constant (slow) flow of these messages on the server side, but all our routers are up and running. Attached kernel logs of an OSS as fir-io1-s1.log. The trafic on the routers can be pretty high though (close to 8-10 GB/s per EDR/HDR100 router right now). We're not using multi-rail lnet. According to lnet/peers, we might have ran out of rtr credits but at the moment it seems OK. BTW, is there a way to reset the counters of /sys/kernel/debug/lnet/peers? that would be helpful
# clush -w @rtr_fir_[1-2] -b "cat /sys/kernel/debug/lnet/peers | egrep '^nid|o2ib7'" --------------- sh-rtr-fir-1-1 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 36 8 6 -63 8 -1210 0 10.0.10.110@o2ib7 1 up 26 8 8 8 8 6 0 10.0.10.102@o2ib7 4 up 32 8 5 -70 8 -1157 0 10.0.10.107@o2ib7 1 up 94 8 8 -45 8 -1249 0 10.0.10.52@o2ib7 1 up 168 8 8 -16 8 -1645 0 10.0.10.104@o2ib7 2 up 170 8 7 -51 8 -1195 0 10.0.10.54@o2ib7 1 up 57 8 8 -8 8 -39 0 10.0.10.101@o2ib7 3 up 52 8 6 -54 8 -1150 0 10.0.10.106@o2ib7 1 up 7 8 8 -43 8 -1221 0 10.0.10.51@o2ib7 1 up 13 8 8 -16 8 -3546 0 10.0.10.103@o2ib7 4 up 152 8 5 -70 8 -1244 0 10.0.10.116@o2ib7 1 up 33 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 12 8 8 -43 8 -1151 0 10.0.10.53@o2ib7 1 up 165 8 8 -8 8 -46 0 --------------- sh-rtr-fir-1-2 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 1 up 166 8 8 -8 8 -39 0 10.0.10.110@o2ib7 1 up 82 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 83 8 8 -8 8 -73 0 10.0.10.107@o2ib7 1 up 32 8 8 -8 8 -30 0 10.0.10.52@o2ib7 1 up 61 8 8 -8 8 -741 0 10.0.10.104@o2ib7 3 up 45 8 6 -8 8 -33 0 10.0.10.54@o2ib7 1 up 26 8 8 -8 8 -18 0 10.0.10.101@o2ib7 1 up 120 8 8 -8 8 -19 0 10.0.10.106@o2ib7 1 up 35 8 8 -8 8 -34 0 10.0.10.51@o2ib7 1 up 100 8 8 -8 8 -123 0 10.0.10.103@o2ib7 2 up 24 8 7 -8 8 -28 0 10.0.10.116@o2ib7 1 up 82 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 18 8 8 -8 8 -72 0 10.0.10.53@o2ib7 2 up 111 8 7 -8 8 -51 0 --------------- sh-rtr-fir-1-3 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 2 up 179 8 7 -549 8 -1016 0 10.0.10.110@o2ib7 1 up 177 8 8 8 8 6 0 10.0.10.102@o2ib7 2 up 6 8 7 -793 8 -807 0 10.0.10.107@o2ib7 3 up 60 8 6 -664 8 -808 0 10.0.10.52@o2ib7 1 up 96 8 8 -1151 8 -1394 0 10.0.10.104@o2ib7 4 up 138 8 6 -789 8 -1020 0 10.0.10.54@o2ib7 1 up 58 8 8 -8 8 -36 0 10.0.10.101@o2ib7 2 up 17 8 7 -826 8 -827 0 10.0.10.106@o2ib7 1 up 176 8 8 -773 8 -782 0 10.0.10.51@o2ib7 1 up 33 8 8 -8 8 -3907 0 10.0.10.103@o2ib7 1 up 170 8 8 -1383 8 -1052 0 10.0.10.116@o2ib7 1 up 3 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 154 8 7 -688 8 -787 0 10.0.10.53@o2ib7 1 up 144 8 8 -8 8 -23 0 --------------- sh-rtr-fir-1-4 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 175 8 8 8 8 6 0 10.0.10.105@o2ib7 4 up 127 8 5 -8 8 -59 0 10.0.10.110@o2ib7 1 up 49 8 8 8 8 6 0 10.0.10.102@o2ib7 3 up 153 8 6 -8 8 -91 0 10.0.10.107@o2ib7 1 up 98 8 8 -8 8 -28 0 10.0.10.52@o2ib7 1 up 83 8 8 -8 8 -1007 0 10.0.10.104@o2ib7 3 up 122 8 6 -8 8 -42 0 10.0.10.54@o2ib7 1 up 43 8 8 -8 8 -39 0 10.0.10.101@o2ib7 2 up 160 8 7 -8 8 -22 0 10.0.10.114@o2ib7 1 down 9999 8 8 8 8 7 0 10.0.10.106@o2ib7 2 up 115 8 7 -8 8 -50 0 10.0.10.51@o2ib7 1 up 8 8 8 -8 8 -252 0 10.0.10.103@o2ib7 3 up 120 8 6 -8 8 -59 0 10.0.10.116@o2ib7 1 up 0 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 80 8 7 -8 8 -89 0 10.0.10.53@o2ib7 1 up 126 8 8 -8 8 -42 0 --------------- sh-rtr-fir-2-1 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 43 8 6 -16 8 -243 0 10.0.10.110@o2ib7 1 up 157 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 6 8 8 -16 8 -289 0 10.0.10.107@o2ib7 2 up 62 8 8 -16 7 -261 560 10.0.10.52@o2ib7 1 up 81 8 8 -8 8 -23 0 10.0.10.104@o2ib7 1 up 9 8 8 -8 8 -197 0 10.0.10.54@o2ib7 1 up 73 8 8 -8 8 -75 0 10.0.10.101@o2ib7 1 up 48 8 8 -16 8 -228 0 10.0.10.106@o2ib7 1 up 34 8 8 -16 8 -284 0 10.0.10.51@o2ib7 1 up 157 8 8 -8 8 -685 0 0.0.10.51@o2ib7 1 down 9999 8 8 8 8 7 0 10.0.10.103@o2ib7 2 up 38 8 7 -16 8 -259 0 10.0.10.116@o2ib7 1 up 158 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 37 8 8 -16 8 -376 0 10.0.10.53@o2ib7 1 up 87 8 8 -8 8 -410 0 --------------- sh-rtr-fir-2-2 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 58 8 6 -16 8 -237 0 10.0.10.110@o2ib7 1 up 12 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 56 8 8 -16 8 -297 0 10.0.10.107@o2ib7 3 up 84 8 6 -16 8 -229 0 10.0.10.52@o2ib7 1 up 132 8 8 -8 8 -14 0 10.0.10.104@o2ib7 4 up 95 8 6 -8 7 -217 72 10.0.10.54@o2ib7 1 up 130 8 8 -8 8 -72 0 10.0.10.101@o2ib7 1 up 99 8 8 -16 8 -250 0 10.0.10.106@o2ib7 2 up 58 8 7 -16 8 -308 0 10.0.10.51@o2ib7 1 up 141 8 8 -8 8 -353 0 10.0.10.103@o2ib7 2 up 35 8 7 -16 8 -313 0 10.0.10.116@o2ib7 1 up 15 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 37 8 7 -16 8 -267 0 10.0.10.53@o2ib7 1 up 117 8 8 -8 8 -323 0 --------------- sh-rtr-fir-2-3 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 2 up 42 8 7 -16 8 -212 0 10.0.10.110@o2ib7 1 up 107 8 8 8 8 6 0 10.0.10.102@o2ib7 2 up 59 8 7 -16 8 -338 0 10.0.10.107@o2ib7 2 up 91 8 7 -16 8 -215 0 10.0.10.52@o2ib7 1 up 1 8 8 -8 8 -17 0 10.0.10.104@o2ib7 2 up 42 8 7 -8 8 -226 0 10.0.10.54@o2ib7 1 up 76 8 8 -8 8 -86 0 10.0.10.101@o2ib7 1 up 55 8 8 -16 8 -248 0 10.0.10.106@o2ib7 1 up 21 8 8 -16 8 -358 0 10.0.10.51@o2ib7 1 up 136 8 8 -8 8 -424 0 10.0.10.103@o2ib7 2 up 13 8 7 -16 8 -259 0 10.0.10.116@o2ib7 1 up 107 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 24 8 8 -16 8 -396 0 10.0.10.53@o2ib7 1 up 157 8 8 -8 8 -549 0 --------------- sh-rtr-fir-2-4 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 44 8 6 -16 8 -292 0 10.0.10.110@o2ib7 1 up 52 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 28 8 8 -16 8 -318 0 10.0.10.107@o2ib7 1 up 29 8 8 -16 8 -271 0 10.0.10.52@o2ib7 1 up 1 8 8 -8 8 -19 0 10.0.10.104@o2ib7 3 up 48 8 6 -16 8 -294 0 10.0.10.54@o2ib7 1 up 28 8 8 -8 8 -53 0 10.0.10.101@o2ib7 1 up 19 8 8 -16 8 -254 0 10.0.10.106@o2ib7 1 up 37 8 8 -16 8 -342 0 10.0.10.51@o2ib7 1 up 89 8 8 -8 8 -470 0 10.0.10.103@o2ib7 1 up 31 8 8 -16 8 -311 0 10.0.10.116@o2ib7 1 up 53 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 9 8 7 -16 8 -261 0 10.0.10.53@o2ib7 1 up 20 8 8 -8 8 -332 0
Example of trafic on sh-rtr-fir-2-1 (10.0.10.209@o2ib7) at 10:03AM this morning:
[root@sh-rtr-fir-2-1 ~]# collectl -sx waiting for 1 second sample... #<-----------InfiniBand-----------> # KBIn PktIn KBOut PktOut Errs 377711 2493K 9336463 2488K 0 397428 2070K 7594350 2066K 0 545513 2403K 8775817 2395K 0 521123 2781K 10089K 2775K 0 527162 2831K 10251K 2824K 0 416422 2516K 9368186 2511K 0 691311 2339K 8384500 2330K 0
At about the same time:
fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.209@o2ib7 added to recovery queue. Health = 900 fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 79 previous similar messages
On the router sh-rtr-fir-2-1 itself, no new messages today (these messages are from yesterday). Occassional client timeout on the cluster fabric (o2ib4 here) is normal I think:
Oct 19 13:26:59 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.106.54@o2ib4 (103): c: 7, oc: 0, rc: 8 Oct 19 13:27:49 sh-rtr-fir-2-1.int kernel: LNetError: 10498:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 13:27:56 sh-rtr-fir-2-1.int kernel: LNetError: 10501:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.105.12@o2ib4 (107): c: 7, oc: 0, rc: 8 Oct 19 17:34:53 sh-rtr-fir-2-1.int kernel: LNetError: 10496:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 17:35:07 sh-rtr-fir-2-1.int kernel: LNetError: 10495:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.107.3@o2ib4 (107): c: 8, oc: 0, rc: 8
Our setup:
- sh-rtr-fir-1-[1-4] are our o2ib6 (Sherlock 1 cluster FDR fabric) / o2ib7 (scratch storage, HDR100 fabric) routers (10.0.10.[201-204@o2ib7)
- sh-rtr-fir-2-[1-4] are our o2ib4 (Sherlock 2 cluster EDR fabric) / o2ib7 (scratch storage, HDR100 fabric) routers (10.0.10.[209-212@o2ib7)
That's the health feature. When sending to a peer fails the health of the interface is adjusted. If there are multiple interfaces available the healthiest one is selected. The interface is added on a recovery queue and pinged every 1 second. Until it's fully healthy.
Are these particular routers down? Or did you see a burst of these messages early on and then they stopped?
Hello, we would also appreciate a backport of this patch (that just landed into master) to b2_12 as it is very verbose. Thanks!!