Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13368

lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.3
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir:

      Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      

      All coming from removed routers from the lnet route config on the servers.

      We removed the routes using commands like:

      # clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7'
      

      The remaining active routes are:

      [root@fir-io7-s1 lnet_consumer]# lnetctl route show -v
      route:
          - net: o2ib1
            gateway: 10.0.10.216@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.218@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.219@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.217@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.227@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.226@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.225@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.224@o2ib7
            hop: -1
            priority: 0
            state: up
      

      Why is lnet trying to use the old routers?

      lctl dk shows:

      00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111)
      

      I'm attaching a dk with +net as fir-io7-s1_dk.log.gz

      Also attaching kernel logs as fir-io7-s1_kern.log and the output of lnetctl stats show as fir-io7-s1_lnetctl_stats.txt .

       

       

      Attachments

        1. fir-io7-s1_dk.log.gz
          8.86 MB
        2. fir-io7-s1_kern.log
          1.33 MB
        3. fir-io7-s1_lnetctl_stats.txt
          0.6 kB
        4. fir-io7-s1_sysrq_t_LU-13368.txt
          4.37 MB

        Issue Links

          Activity

            [LU-13368] lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error
            adilger Andreas Dilger added a comment - - edited

            simmonsja, please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference.

            Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.

            adilger Andreas Dilger added a comment - - edited simmonsja , please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference. Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.
            hornc Chris Horn added a comment - - edited

            I suspect this patch may responsible for the kernel panics we're seeing in LU-15125

            Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

            hornc Chris Horn added a comment - - edited I suspect this patch may responsible for the kernel panics we're seeing in LU-15125 Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

            This caused problems with shutdown, see LU-14499.

            adilger Andreas Dilger added a comment - This caused problems with shutdown, see LU-14499 .
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/
            Subject: LU-13368 lnet: discard the callback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: babf0232273467b7199ec9a7c36047b1968913df

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/ Subject: LU-13368 lnet: discard the callback Project: fs/lustre-release Branch: master Current Patch Set: Commit: babf0232273467b7199ec9a7c36047b1968913df

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845
            Subject: LU-13368 lnet: discard the callback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a1b70741d3977cb5359b5b6797041477d5392aec

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845 Subject: LU-13368 lnet: discard the callback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a1b70741d3977cb5359b5b6797041477d5392aec

            I could see this when I dumped the tasks:

            [Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M
            [Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl         R  running task        0 11793  11738 0x00000080^M
            [Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M
            [Tue Mar 17 18:02:51 2020][8592655.547647]  [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547650]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
            [Tue Mar 17 18:02:51 2020][8592655.547658]  [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547660]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
            [Tue Mar 17 18:02:51 2020][8592655.547668]  [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547678]  [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547681]  [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M
            [Tue Mar 17 18:02:51 2020][8592655.547689]  [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547696]  [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547702]  [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547705]  [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M
            [Tue Mar 17 18:02:51 2020][8592655.547712]  [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547714]  [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M
            [Tue Mar 17 18:02:51 2020][8592655.547715]  [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M
            [Tue Mar 17 18:02:51 2020][8592655.547717]  [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M
            

            I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as fir-io7-s1_sysrq_t_LU-13368.txt
            This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

            sthiell Stephane Thiell added a comment - I could see this when I dumped the tasks: [Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M [Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl R running task 0 11793 11738 0x00000080^M [Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M [Tue Mar 17 18:02:51 2020][8592655.547647] [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547650] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547658] [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547660] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547668] [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547678] [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547681] [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M [Tue Mar 17 18:02:51 2020][8592655.547689] [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547696] [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547702] [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547705] [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M [Tue Mar 17 18:02:51 2020][8592655.547712] [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547714] [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M [Tue Mar 17 18:02:51 2020][8592655.547715] [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M [Tue Mar 17 18:02:51 2020][8592655.547717] [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as  fir-io7-s1_sysrq_t_LU-13368.txt This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: