Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13368

lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.3
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir:

      Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      

      All coming from removed routers from the lnet route config on the servers.

      We removed the routes using commands like:

      # clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7'
      

      The remaining active routes are:

      [root@fir-io7-s1 lnet_consumer]# lnetctl route show -v
      route:
          - net: o2ib1
            gateway: 10.0.10.216@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.218@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.219@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.217@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.227@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.226@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.225@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.224@o2ib7
            hop: -1
            priority: 0
            state: up
      

      Why is lnet trying to use the old routers?

      lctl dk shows:

      00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111)
      

      I'm attaching a dk with +net as fir-io7-s1_dk.log.gz

      Also attaching kernel logs as fir-io7-s1_kern.log and the output of lnetctl stats show as fir-io7-s1_lnetctl_stats.txt .

       

       

      Attachments

        1. fir-io7-s1_dk.log.gz
          8.86 MB
          Stephane Thiell
        2. fir-io7-s1_kern.log
          1.33 MB
          Stephane Thiell
        3. fir-io7-s1_lnetctl_stats.txt
          0.6 kB
          Stephane Thiell
        4. fir-io7-s1_sysrq_t_LU-13368.txt
          4.37 MB
          Stephane Thiell

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: