Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12886

A lot of LNetError: lnet_peer_ni_add_to_recoveryq_locked() messages

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.12.3
    • 2.12.3 RC1 (vanilla) on servers, CentOS 7.6, patched kernel; 2.12.0 + patches on clients
    • 4
    • 9223372036854775807

    Description

      After upgrading our servers on Fir (Sherlock's /scratch) to Lustre 2.12.3 RC1, we are noticing a lot of these messages on all Lustre servers:

      LNetError: 49537:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.201@o2ib7 added to recovery queue. Health = 900
      

      The NIDs reported are our Lustre routers, that are still running 2.12.0+patches (they are on the client clusters).

      Attaching logs from all servers as lnet_recoveryq.log

      This doesn't seem to have an impact on production and so far 2.12.3 RC1 has been just great for us (and we run it without additional patches now!). Thanks!

      Stéphane

      Attachments

        Issue Links

          Activity

            [LU-12886] A lot of LNetError: lnet_peer_ni_add_to_recoveryq_locked() messages

            Hello, we would also appreciate a backport of this patch (that just landed into master) to b2_12 as it is very verbose. Thanks!!

            sthiell Stephane Thiell added a comment - Hello, we would also appreciate a backport of this patch (that just landed into master) to b2_12 as it is very verbose. Thanks!!

            Hi Olaf,

            It doesn't look like this patch made it in 2.12.4. It still hasn't landed. Needs an extra review:

            https://review.whamcloud.com/#/c/37002/

            The "inconsistent" message was reduced here:

            f549927ea633b910a8c788fa970af742b3bf10c1 LU-11981 lnet: clean up error message

            thanks

            amir

            ashehata Amir Shehata (Inactive) added a comment - Hi Olaf, It doesn't look like this patch made it in 2.12.4. It still hasn't landed. Needs an extra review: https://review.whamcloud.com/#/c/37002/ The "inconsistent" message was reduced here: f549927ea633b910a8c788fa970af742b3bf10c1 LU-11981 lnet: clean up error message thanks amir
            ofaaland Olaf Faaland added a comment -

            Hi Amir,

            What 2.12.4 patch(es) reduce the severity of the "added to recovery queue" messages?

            thanks

            ofaaland Olaf Faaland added a comment - Hi Amir, What 2.12.4 patch(es) reduce the severity of the "added to recovery queue" messages? thanks

            It looks to me like LU-11981 patch addresses only:

            "Msg is in inconsistent state, don't perform health checking"

            not

            "added to recovery queue"

            ofaaland Olaf Faaland added a comment - It looks to me like LU-11981 patch addresses only: "Msg is in inconsistent state, don't perform health checking" not "added to recovery queue"
            pjones Peter Jones added a comment -

            LU-11981 I assume. It will be in 2.12.4

            pjones Peter Jones added a comment - LU-11981 I assume. It will be in 2.12.4

            We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.

            Hi Amir,
            Which patch?
            Thanks

            ofaaland Olaf Faaland added a comment - We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console. Hi Amir, Which patch? Thanks

            Hi Luis,

            You can turn off health on your setup

            lnetctl set health_sensitivity 0
            lnetctl set transaction_timeout 50 # or some value you'd like 
            lnetctl set retry_count 0

            We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.

            ashehata Amir Shehata (Inactive) added a comment - Hi Luis, You can turn off health on your setup lnetctl set health_sensitivity 0 lnetctl set transaction_timeout 50 # or some value you'd like lnetctl set retry_count 0 We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.

            We are having the same errors on our clients. Our setup only uses a single IB interface everywhere. We are not using multi-rail anywhere. My understanding is that this health check is designed for multi-rail setups? Is that correct? If so, is there any way I can tell Lustre to skip these health checks? Thanks in advance. -Luis

            Luis_silva Luis Silva (Inactive) added a comment - We are having the same errors on our clients. Our setup only uses a single IB interface everywhere. We are not using multi-rail anywhere. My understanding is that this health check is designed for multi-rail setups? Is that correct? If so, is there any way I can tell Lustre to skip these health checks? Thanks in advance. -Luis

            Thanks for checking Amir! No, I just see a constant (slow) flow of these messages on the server side, but all our routers are up and running. Attached kernel logs of an OSS as fir-io1-s1.log. The trafic on the routers can be pretty high though (close to 8-10 GB/s per EDR/HDR100 router right now). We're not using multi-rail lnet. According to lnet/peers, we might have ran out of rtr credits but at the moment it seems OK. BTW, is there a way to reset the counters of /sys/kernel/debug/lnet/peers? that would be helpful

            # clush -w @rtr_fir_[1-2] -b "cat /sys/kernel/debug/lnet/peers  | egrep '^nid|o2ib7'"
            ---------------
            sh-rtr-fir-1-1
            ---------------
            nid                      refs state  last   max   rtr   min    tx   min queue
            10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
            10.0.10.105@o2ib7           3    up    36     8     6   -63     8 -1210 0
            10.0.10.110@o2ib7           1    up    26     8     8     8     8     6 0
            10.0.10.102@o2ib7           4    up    32     8     5   -70     8 -1157 0
            10.0.10.107@o2ib7           1    up    94     8     8   -45     8 -1249 0
            10.0.10.52@o2ib7            1    up   168     8     8   -16     8 -1645 0
            10.0.10.104@o2ib7           2    up   170     8     7   -51     8 -1195 0
            10.0.10.54@o2ib7            1    up    57     8     8    -8     8   -39 0
            10.0.10.101@o2ib7           3    up    52     8     6   -54     8 -1150 0
            10.0.10.106@o2ib7           1    up     7     8     8   -43     8 -1221 0
            10.0.10.51@o2ib7            1    up    13     8     8   -16     8 -3546 0
            10.0.10.103@o2ib7           4    up   152     8     5   -70     8 -1244 0
            10.0.10.116@o2ib7           1    up    33     8     8     8     8     6 0
            10.0.10.108@o2ib7           1    up    12     8     8   -43     8 -1151 0
            10.0.10.53@o2ib7            1    up   165     8     8    -8     8   -46 0
            ---------------
            sh-rtr-fir-1-2
            ---------------
            nid                      refs state  last   max   rtr   min    tx   min queue
            10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
            10.0.10.105@o2ib7           1    up   166     8     8    -8     8   -39 0
            10.0.10.110@o2ib7           1    up    82     8     8     8     8     6 0
            10.0.10.102@o2ib7           1    up    83     8     8    -8     8   -73 0
            10.0.10.107@o2ib7           1    up    32     8     8    -8     8   -30 0
            10.0.10.52@o2ib7            1    up    61     8     8    -8     8  -741 0
            10.0.10.104@o2ib7           3    up    45     8     6    -8     8   -33 0
            10.0.10.54@o2ib7            1    up    26     8     8    -8     8   -18 0
            10.0.10.101@o2ib7           1    up   120     8     8    -8     8   -19 0
            10.0.10.106@o2ib7           1    up    35     8     8    -8     8   -34 0
            10.0.10.51@o2ib7            1    up   100     8     8    -8     8  -123 0
            10.0.10.103@o2ib7           2    up    24     8     7    -8     8   -28 0
            10.0.10.116@o2ib7           1    up    82     8     8     8     8     6 0
            10.0.10.108@o2ib7           1    up    18     8     8    -8     8   -72 0
            10.0.10.53@o2ib7            2    up   111     8     7    -8     8   -51 0
            ---------------
            sh-rtr-fir-1-3
            ---------------
            nid                      refs state  last   max   rtr   min    tx   min queue
            10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
            10.0.10.105@o2ib7           2    up   179     8     7  -549     8 -1016 0
            10.0.10.110@o2ib7           1    up   177     8     8     8     8     6 0
            10.0.10.102@o2ib7           2    up     6     8     7  -793     8  -807 0
            10.0.10.107@o2ib7           3    up    60     8     6  -664     8  -808 0
            10.0.10.52@o2ib7            1    up    96     8     8 -1151     8 -1394 0
            10.0.10.104@o2ib7           4    up   138     8     6  -789     8 -1020 0
            10.0.10.54@o2ib7            1    up    58     8     8    -8     8   -36 0
            10.0.10.101@o2ib7           2    up    17     8     7  -826     8  -827 0
            10.0.10.106@o2ib7           1    up   176     8     8  -773     8  -782 0
            10.0.10.51@o2ib7            1    up    33     8     8    -8     8 -3907 0
            10.0.10.103@o2ib7           1    up   170     8     8 -1383     8 -1052 0
            10.0.10.116@o2ib7           1    up     3     8     8     8     8     6 0
            10.0.10.108@o2ib7           2    up   154     8     7  -688     8  -787 0
            10.0.10.53@o2ib7            1    up   144     8     8    -8     8   -23 0
            ---------------
            sh-rtr-fir-1-4
            ---------------
            nid                      refs state  last   max   rtr   min    tx   min queue
            10.0.10.3@o2ib7             1    up   175     8     8     8     8     6 0
            10.0.10.105@o2ib7           4    up   127     8     5    -8     8   -59 0
            10.0.10.110@o2ib7           1    up    49     8     8     8     8     6 0
            10.0.10.102@o2ib7           3    up   153     8     6    -8     8   -91 0
            10.0.10.107@o2ib7           1    up    98     8     8    -8     8   -28 0
            10.0.10.52@o2ib7            1    up    83     8     8    -8     8 -1007 0
            10.0.10.104@o2ib7           3    up   122     8     6    -8     8   -42 0
            10.0.10.54@o2ib7            1    up    43     8     8    -8     8   -39 0
            10.0.10.101@o2ib7           2    up   160     8     7    -8     8   -22 0
            10.0.10.114@o2ib7           1  down  9999     8     8     8     8     7 0
            10.0.10.106@o2ib7           2    up   115     8     7    -8     8   -50 0
            10.0.10.51@o2ib7            1    up     8     8     8    -8     8  -252 0
            10.0.10.103@o2ib7           3    up   120     8     6    -8     8   -59 0
            10.0.10.116@o2ib7           1    up     0     8     8     8     8     6 0
            10.0.10.108@o2ib7           2    up    80     8     7    -8     8   -89 0
            10.0.10.53@o2ib7            1    up   126     8     8    -8     8   -42 0
            ---------------
            sh-rtr-fir-2-1
            ---------------
            nid                      refs state  last   max   rtr   min    tx   min queue
            10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
            10.0.10.105@o2ib7           3    up    43     8     6   -16     8  -243 0
            10.0.10.110@o2ib7           1    up   157     8     8     8     8     6 0
            10.0.10.102@o2ib7           1    up     6     8     8   -16     8  -289 0
            10.0.10.107@o2ib7           2    up    62     8     8   -16     7  -261 560
            10.0.10.52@o2ib7            1    up    81     8     8    -8     8   -23 0
            10.0.10.104@o2ib7           1    up     9     8     8    -8     8  -197 0
            10.0.10.54@o2ib7            1    up    73     8     8    -8     8   -75 0
            10.0.10.101@o2ib7           1    up    48     8     8   -16     8  -228 0
            10.0.10.106@o2ib7           1    up    34     8     8   -16     8  -284 0
            10.0.10.51@o2ib7            1    up   157     8     8    -8     8  -685 0
            0.0.10.51@o2ib7             1  down  9999     8     8     8     8     7 0
            10.0.10.103@o2ib7           2    up    38     8     7   -16     8  -259 0
            10.0.10.116@o2ib7           1    up   158     8     8     8     8     6 0
            10.0.10.108@o2ib7           1    up    37     8     8   -16     8  -376 0
            10.0.10.53@o2ib7            1    up    87     8     8    -8     8  -410 0
            ---------------
            sh-rtr-fir-2-2
            ---------------
            nid                      refs state  last   max   rtr   min    tx   min queue
            10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
            10.0.10.105@o2ib7           3    up    58     8     6   -16     8  -237 0
            10.0.10.110@o2ib7           1    up    12     8     8     8     8     6 0
            10.0.10.102@o2ib7           1    up    56     8     8   -16     8  -297 0
            10.0.10.107@o2ib7           3    up    84     8     6   -16     8  -229 0
            10.0.10.52@o2ib7            1    up   132     8     8    -8     8   -14 0
            10.0.10.104@o2ib7           4    up    95     8     6    -8     7  -217 72
            10.0.10.54@o2ib7            1    up   130     8     8    -8     8   -72 0
            10.0.10.101@o2ib7           1    up    99     8     8   -16     8  -250 0
            10.0.10.106@o2ib7           2    up    58     8     7   -16     8  -308 0
            10.0.10.51@o2ib7            1    up   141     8     8    -8     8  -353 0
            10.0.10.103@o2ib7           2    up    35     8     7   -16     8  -313 0
            10.0.10.116@o2ib7           1    up    15     8     8     8     8     6 0
            10.0.10.108@o2ib7           2    up    37     8     7   -16     8  -267 0
            10.0.10.53@o2ib7            1    up   117     8     8    -8     8  -323 0
            ---------------
            sh-rtr-fir-2-3
            ---------------
            nid                      refs state  last   max   rtr   min    tx   min queue
            10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
            10.0.10.105@o2ib7           2    up    42     8     7   -16     8  -212 0
            10.0.10.110@o2ib7           1    up   107     8     8     8     8     6 0
            10.0.10.102@o2ib7           2    up    59     8     7   -16     8  -338 0
            10.0.10.107@o2ib7           2    up    91     8     7   -16     8  -215 0
            10.0.10.52@o2ib7            1    up     1     8     8    -8     8   -17 0
            10.0.10.104@o2ib7           2    up    42     8     7    -8     8  -226 0
            10.0.10.54@o2ib7            1    up    76     8     8    -8     8   -86 0
            10.0.10.101@o2ib7           1    up    55     8     8   -16     8  -248 0
            10.0.10.106@o2ib7           1    up    21     8     8   -16     8  -358 0
            10.0.10.51@o2ib7            1    up   136     8     8    -8     8  -424 0
            10.0.10.103@o2ib7           2    up    13     8     7   -16     8  -259 0
            10.0.10.116@o2ib7           1    up   107     8     8     8     8     6 0
            10.0.10.108@o2ib7           1    up    24     8     8   -16     8  -396 0
            10.0.10.53@o2ib7            1    up   157     8     8    -8     8  -549 0
            ---------------
            sh-rtr-fir-2-4
            ---------------
            nid                      refs state  last   max   rtr   min    tx   min queue
            10.0.10.3@o2ib7             1    up   176     8     8     8     8     6 0
            10.0.10.105@o2ib7           3    up    44     8     6   -16     8  -292 0
            10.0.10.110@o2ib7           1    up    52     8     8     8     8     6 0
            10.0.10.102@o2ib7           1    up    28     8     8   -16     8  -318 0
            10.0.10.107@o2ib7           1    up    29     8     8   -16     8  -271 0
            10.0.10.52@o2ib7            1    up     1     8     8    -8     8   -19 0
            10.0.10.104@o2ib7           3    up    48     8     6   -16     8  -294 0
            10.0.10.54@o2ib7            1    up    28     8     8    -8     8   -53 0
            10.0.10.101@o2ib7           1    up    19     8     8   -16     8  -254 0
            10.0.10.106@o2ib7           1    up    37     8     8   -16     8  -342 0
            10.0.10.51@o2ib7            1    up    89     8     8    -8     8  -470 0
            10.0.10.103@o2ib7           1    up    31     8     8   -16     8  -311 0
            10.0.10.116@o2ib7           1    up    53     8     8     8     8     6 0
            10.0.10.108@o2ib7           2    up     9     8     7   -16     8  -261 0
            10.0.10.53@o2ib7            1    up    20     8     8    -8     8  -332 0
            

            Example of trafic on sh-rtr-fir-2-1 (10.0.10.209@o2ib7) at 10:03AM this morning:

            [root@sh-rtr-fir-2-1 ~]# collectl -sx
            waiting for 1 second sample...
            #<-----------InfiniBand----------->
            #   KBIn  PktIn   KBOut PktOut Errs 
              377711  2493K 9336463  2488K    0 
              397428  2070K 7594350  2066K    0 
              545513  2403K 8775817  2395K    0 
              521123  2781K  10089K  2775K    0 
              527162  2831K  10251K  2824K    0 
              416422  2516K 9368186  2511K    0 
              691311  2339K 8384500  2330K    0 
            

            At about the same time:

            fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.209@o2ib7 added to recovery queue. Health = 900
            fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 79 previous similar messages
            

            On the router sh-rtr-fir-2-1 itself, no new messages today (these messages are from yesterday). Occassional client timeout on the cluster fabric (o2ib4 here) is normal I think:

            Oct 19 13:26:59 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.106.54@o2ib4 (103): c: 7, oc: 0, rc: 8
            Oct 19 13:27:49 sh-rtr-fir-2-1.int kernel: LNetError: 10498:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
            Oct 19 13:27:56 sh-rtr-fir-2-1.int kernel: LNetError: 10501:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
            Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.105.12@o2ib4 (107): c: 7, oc: 0, rc: 8
            Oct 19 17:34:53 sh-rtr-fir-2-1.int kernel: LNetError: 10496:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
            Oct 19 17:35:07 sh-rtr-fir-2-1.int kernel: LNetError: 10495:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
            Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
            Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.107.3@o2ib4 (107): c: 8, oc: 0, rc: 8
            

            Our setup:

            • sh-rtr-fir-1-[1-4] are our o2ib6 (Sherlock 1 cluster FDR fabric) / o2ib7 (scratch storage, HDR100 fabric) routers (10.0.10.[201-204@o2ib7)
            • sh-rtr-fir-2-[1-4] are our o2ib4 (Sherlock 2 cluster EDR fabric) / o2ib7 (scratch storage, HDR100 fabric) routers (10.0.10.[209-212@o2ib7)
            sthiell Stephane Thiell added a comment - Thanks for checking Amir! No, I just see a constant (slow) flow of these messages on the server side, but all our routers are up and running. Attached kernel logs of an OSS as fir-io1-s1.log . The trafic on the routers can be pretty high though (close to 8-10 GB/s per EDR/HDR100 router right now). We're not using multi-rail lnet. According to lnet/peers, we might have ran out of rtr credits but at the moment it seems OK. BTW, is there a way to reset the counters of /sys/kernel/debug/lnet/peers ? that would be helpful # clush -w @rtr_fir_[1-2] -b "cat /sys/kernel/debug/lnet/peers | egrep '^nid|o2ib7'" --------------- sh-rtr-fir-1-1 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 36 8 6 -63 8 -1210 0 10.0.10.110@o2ib7 1 up 26 8 8 8 8 6 0 10.0.10.102@o2ib7 4 up 32 8 5 -70 8 -1157 0 10.0.10.107@o2ib7 1 up 94 8 8 -45 8 -1249 0 10.0.10.52@o2ib7 1 up 168 8 8 -16 8 -1645 0 10.0.10.104@o2ib7 2 up 170 8 7 -51 8 -1195 0 10.0.10.54@o2ib7 1 up 57 8 8 -8 8 -39 0 10.0.10.101@o2ib7 3 up 52 8 6 -54 8 -1150 0 10.0.10.106@o2ib7 1 up 7 8 8 -43 8 -1221 0 10.0.10.51@o2ib7 1 up 13 8 8 -16 8 -3546 0 10.0.10.103@o2ib7 4 up 152 8 5 -70 8 -1244 0 10.0.10.116@o2ib7 1 up 33 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 12 8 8 -43 8 -1151 0 10.0.10.53@o2ib7 1 up 165 8 8 -8 8 -46 0 --------------- sh-rtr-fir-1-2 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 1 up 166 8 8 -8 8 -39 0 10.0.10.110@o2ib7 1 up 82 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 83 8 8 -8 8 -73 0 10.0.10.107@o2ib7 1 up 32 8 8 -8 8 -30 0 10.0.10.52@o2ib7 1 up 61 8 8 -8 8 -741 0 10.0.10.104@o2ib7 3 up 45 8 6 -8 8 -33 0 10.0.10.54@o2ib7 1 up 26 8 8 -8 8 -18 0 10.0.10.101@o2ib7 1 up 120 8 8 -8 8 -19 0 10.0.10.106@o2ib7 1 up 35 8 8 -8 8 -34 0 10.0.10.51@o2ib7 1 up 100 8 8 -8 8 -123 0 10.0.10.103@o2ib7 2 up 24 8 7 -8 8 -28 0 10.0.10.116@o2ib7 1 up 82 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 18 8 8 -8 8 -72 0 10.0.10.53@o2ib7 2 up 111 8 7 -8 8 -51 0 --------------- sh-rtr-fir-1-3 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 2 up 179 8 7 -549 8 -1016 0 10.0.10.110@o2ib7 1 up 177 8 8 8 8 6 0 10.0.10.102@o2ib7 2 up 6 8 7 -793 8 -807 0 10.0.10.107@o2ib7 3 up 60 8 6 -664 8 -808 0 10.0.10.52@o2ib7 1 up 96 8 8 -1151 8 -1394 0 10.0.10.104@o2ib7 4 up 138 8 6 -789 8 -1020 0 10.0.10.54@o2ib7 1 up 58 8 8 -8 8 -36 0 10.0.10.101@o2ib7 2 up 17 8 7 -826 8 -827 0 10.0.10.106@o2ib7 1 up 176 8 8 -773 8 -782 0 10.0.10.51@o2ib7 1 up 33 8 8 -8 8 -3907 0 10.0.10.103@o2ib7 1 up 170 8 8 -1383 8 -1052 0 10.0.10.116@o2ib7 1 up 3 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 154 8 7 -688 8 -787 0 10.0.10.53@o2ib7 1 up 144 8 8 -8 8 -23 0 --------------- sh-rtr-fir-1-4 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 175 8 8 8 8 6 0 10.0.10.105@o2ib7 4 up 127 8 5 -8 8 -59 0 10.0.10.110@o2ib7 1 up 49 8 8 8 8 6 0 10.0.10.102@o2ib7 3 up 153 8 6 -8 8 -91 0 10.0.10.107@o2ib7 1 up 98 8 8 -8 8 -28 0 10.0.10.52@o2ib7 1 up 83 8 8 -8 8 -1007 0 10.0.10.104@o2ib7 3 up 122 8 6 -8 8 -42 0 10.0.10.54@o2ib7 1 up 43 8 8 -8 8 -39 0 10.0.10.101@o2ib7 2 up 160 8 7 -8 8 -22 0 10.0.10.114@o2ib7 1 down 9999 8 8 8 8 7 0 10.0.10.106@o2ib7 2 up 115 8 7 -8 8 -50 0 10.0.10.51@o2ib7 1 up 8 8 8 -8 8 -252 0 10.0.10.103@o2ib7 3 up 120 8 6 -8 8 -59 0 10.0.10.116@o2ib7 1 up 0 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 80 8 7 -8 8 -89 0 10.0.10.53@o2ib7 1 up 126 8 8 -8 8 -42 0 --------------- sh-rtr-fir-2-1 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 43 8 6 -16 8 -243 0 10.0.10.110@o2ib7 1 up 157 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 6 8 8 -16 8 -289 0 10.0.10.107@o2ib7 2 up 62 8 8 -16 7 -261 560 10.0.10.52@o2ib7 1 up 81 8 8 -8 8 -23 0 10.0.10.104@o2ib7 1 up 9 8 8 -8 8 -197 0 10.0.10.54@o2ib7 1 up 73 8 8 -8 8 -75 0 10.0.10.101@o2ib7 1 up 48 8 8 -16 8 -228 0 10.0.10.106@o2ib7 1 up 34 8 8 -16 8 -284 0 10.0.10.51@o2ib7 1 up 157 8 8 -8 8 -685 0 0.0.10.51@o2ib7 1 down 9999 8 8 8 8 7 0 10.0.10.103@o2ib7 2 up 38 8 7 -16 8 -259 0 10.0.10.116@o2ib7 1 up 158 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 37 8 8 -16 8 -376 0 10.0.10.53@o2ib7 1 up 87 8 8 -8 8 -410 0 --------------- sh-rtr-fir-2-2 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 58 8 6 -16 8 -237 0 10.0.10.110@o2ib7 1 up 12 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 56 8 8 -16 8 -297 0 10.0.10.107@o2ib7 3 up 84 8 6 -16 8 -229 0 10.0.10.52@o2ib7 1 up 132 8 8 -8 8 -14 0 10.0.10.104@o2ib7 4 up 95 8 6 -8 7 -217 72 10.0.10.54@o2ib7 1 up 130 8 8 -8 8 -72 0 10.0.10.101@o2ib7 1 up 99 8 8 -16 8 -250 0 10.0.10.106@o2ib7 2 up 58 8 7 -16 8 -308 0 10.0.10.51@o2ib7 1 up 141 8 8 -8 8 -353 0 10.0.10.103@o2ib7 2 up 35 8 7 -16 8 -313 0 10.0.10.116@o2ib7 1 up 15 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 37 8 7 -16 8 -267 0 10.0.10.53@o2ib7 1 up 117 8 8 -8 8 -323 0 --------------- sh-rtr-fir-2-3 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 2 up 42 8 7 -16 8 -212 0 10.0.10.110@o2ib7 1 up 107 8 8 8 8 6 0 10.0.10.102@o2ib7 2 up 59 8 7 -16 8 -338 0 10.0.10.107@o2ib7 2 up 91 8 7 -16 8 -215 0 10.0.10.52@o2ib7 1 up 1 8 8 -8 8 -17 0 10.0.10.104@o2ib7 2 up 42 8 7 -8 8 -226 0 10.0.10.54@o2ib7 1 up 76 8 8 -8 8 -86 0 10.0.10.101@o2ib7 1 up 55 8 8 -16 8 -248 0 10.0.10.106@o2ib7 1 up 21 8 8 -16 8 -358 0 10.0.10.51@o2ib7 1 up 136 8 8 -8 8 -424 0 10.0.10.103@o2ib7 2 up 13 8 7 -16 8 -259 0 10.0.10.116@o2ib7 1 up 107 8 8 8 8 6 0 10.0.10.108@o2ib7 1 up 24 8 8 -16 8 -396 0 10.0.10.53@o2ib7 1 up 157 8 8 -8 8 -549 0 --------------- sh-rtr-fir-2-4 --------------- nid refs state last max rtr min tx min queue 10.0.10.3@o2ib7 1 up 176 8 8 8 8 6 0 10.0.10.105@o2ib7 3 up 44 8 6 -16 8 -292 0 10.0.10.110@o2ib7 1 up 52 8 8 8 8 6 0 10.0.10.102@o2ib7 1 up 28 8 8 -16 8 -318 0 10.0.10.107@o2ib7 1 up 29 8 8 -16 8 -271 0 10.0.10.52@o2ib7 1 up 1 8 8 -8 8 -19 0 10.0.10.104@o2ib7 3 up 48 8 6 -16 8 -294 0 10.0.10.54@o2ib7 1 up 28 8 8 -8 8 -53 0 10.0.10.101@o2ib7 1 up 19 8 8 -16 8 -254 0 10.0.10.106@o2ib7 1 up 37 8 8 -16 8 -342 0 10.0.10.51@o2ib7 1 up 89 8 8 -8 8 -470 0 10.0.10.103@o2ib7 1 up 31 8 8 -16 8 -311 0 10.0.10.116@o2ib7 1 up 53 8 8 8 8 6 0 10.0.10.108@o2ib7 2 up 9 8 7 -16 8 -261 0 10.0.10.53@o2ib7 1 up 20 8 8 -8 8 -332 0 Example of trafic on sh-rtr-fir-2-1 (10.0.10.209@o2ib7) at 10:03AM this morning: [root@sh-rtr-fir-2-1 ~]# collectl -sx waiting for 1 second sample... #<-----------InfiniBand-----------> # KBIn PktIn KBOut PktOut Errs 377711 2493K 9336463 2488K 0 397428 2070K 7594350 2066K 0 545513 2403K 8775817 2395K 0 521123 2781K 10089K 2775K 0 527162 2831K 10251K 2824K 0 416422 2516K 9368186 2511K 0 691311 2339K 8384500 2330K 0 At about the same time: fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.209@o2ib7 added to recovery queue. Health = 900 fir-io3-s2: Oct 20 10:00:17 fir-io3-s2 kernel: LNetError: 61521:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 79 previous similar messages On the router sh-rtr-fir-2-1 itself, no new messages today (these messages are from yesterday). Occassional client timeout on the cluster fabric (o2ib4 here) is normal I think: Oct 19 13:26:59 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.106.54@o2ib4 (103): c: 7, oc: 0, rc: 8 Oct 19 13:27:49 sh-rtr-fir-2-1.int kernel: LNetError: 10498:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 13:27:56 sh-rtr-fir-2-1.int kernel: LNetError: 10501:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 19 17:34:37 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.105.12@o2ib4 (107): c: 7, oc: 0, rc: 8 Oct 19 17:34:53 sh-rtr-fir-2-1.int kernel: LNetError: 10496:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 17:35:07 sh-rtr-fir-2-1.int kernel: LNetError: 10495:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2) Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds Oct 19 21:28:39 sh-rtr-fir-2-1.int kernel: LNetError: 10492:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.9.107.3@o2ib4 (107): c: 8, oc: 0, rc: 8 Our setup: sh-rtr-fir-1- [1-4] are our o2ib6 (Sherlock 1 cluster FDR fabric) / o2ib7 (scratch storage, HDR100 fabric) routers ( 10.0.10.[201-204@o2ib7 ) sh-rtr-fir-2- [1-4] are our o2ib4 (Sherlock 2 cluster EDR fabric) / o2ib7 (scratch storage, HDR100 fabric) routers ( 10.0.10.[209-212@o2ib7 )

            That's the health feature. When sending to a peer fails the health of the interface is adjusted. If there are multiple interfaces available the healthiest one is selected. The interface is added on a recovery queue and pinged every 1 second. Until it's fully healthy.

            Are these particular routers down? Or did you see a burst of these messages early on and then they stopped?

            ashehata Amir Shehata (Inactive) added a comment - That's the health feature. When sending to a peer fails the health of the interface is adjusted. If there are multiple interfaces available the healthiest one is selected. The interface is added on a recovery queue and pinged every 1 second. Until it's fully healthy. Are these particular routers down? Or did you see a burst of these messages early on and then they stopped?
            pjones Peter Jones added a comment -

            Amir

            Could you please advise

            Thanks

            Peter

            pjones Peter Jones added a comment - Amir Could you please advise Thanks Peter

            People

              ashehata Amir Shehata (Inactive)
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: