Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13368

lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.3
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir:

      Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      

      All coming from removed routers from the lnet route config on the servers.

      We removed the routes using commands like:

      # clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7'
      

      The remaining active routes are:

      [root@fir-io7-s1 lnet_consumer]# lnetctl route show -v
      route:
          - net: o2ib1
            gateway: 10.0.10.216@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.218@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.219@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.217@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.227@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.226@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.225@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.224@o2ib7
            hop: -1
            priority: 0
            state: up
      

      Why is lnet trying to use the old routers?

      lctl dk shows:

      00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111)
      

      I'm attaching a dk with +net as fir-io7-s1_dk.log.gz

      Also attaching kernel logs as fir-io7-s1_kern.log and the output of lnetctl stats show as fir-io7-s1_lnetctl_stats.txt .

       

       

      Attachments

        1. fir-io7-s1_dk.log.gz
          8.86 MB
        2. fir-io7-s1_kern.log
          1.33 MB
        3. fir-io7-s1_lnetctl_stats.txt
          0.6 kB
        4. fir-io7-s1_sysrq_t_LU-13368.txt
          4.37 MB

        Issue Links

          Activity

            [LU-13368] lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error
            adilger Andreas Dilger made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
            adilger Andreas Dilger added a comment - - edited

            simmonsja, please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference.

            Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.

            adilger Andreas Dilger added a comment - - edited simmonsja , please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference. Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.
            simmonsja James A Simmons made changes -
            Assignee Original: Amir Shehata [ ashehata ] New: Serguei Smirnov [ ssmirnov ]
            Resolution Original: Fixed [ 1 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to DDN-2213 [ DDN-2213 ]
            hornc Chris Horn added a comment - - edited

            I suspect this patch may responsible for the kernel panics we're seeing in LU-15125

            Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

            hornc Chris Horn added a comment - - edited I suspect this patch may responsible for the kernel panics we're seeing in LU-15125 Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/
            ys Yang Sheng made changes -
            Link Original: This issue is related to DDN-966 [ DDN-966 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to EX-2590 [ EX-2590 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-13638 [ LU-13638 ]

            This caused problems with shutdown, see LU-14499.

            adilger Andreas Dilger added a comment - This caused problems with shutdown, see LU-14499 .
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14499 [ LU-14499 ]

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: