Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13368

lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.3
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir:

      Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      

      All coming from removed routers from the lnet route config on the servers.

      We removed the routes using commands like:

      # clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7'
      

      The remaining active routes are:

      [root@fir-io7-s1 lnet_consumer]# lnetctl route show -v
      route:
          - net: o2ib1
            gateway: 10.0.10.216@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.218@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.219@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.217@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.227@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.226@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.225@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.224@o2ib7
            hop: -1
            priority: 0
            state: up
      

      Why is lnet trying to use the old routers?

      lctl dk shows:

      00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111)
      

      I'm attaching a dk with +net as fir-io7-s1_dk.log.gz

      Also attaching kernel logs as fir-io7-s1_kern.log and the output of lnetctl stats show as fir-io7-s1_lnetctl_stats.txt .

       

       

      Attachments

        1. fir-io7-s1_dk.log.gz
          8.86 MB
        2. fir-io7-s1_kern.log
          1.33 MB
        3. fir-io7-s1_lnetctl_stats.txt
          0.6 kB
        4. fir-io7-s1_sysrq_t_LU-13368.txt
          4.37 MB

        Issue Links

          Activity

            [LU-13368] lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error
            adilger Andreas Dilger added a comment - - edited

            simmonsja, please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference.

            Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.

            adilger Andreas Dilger added a comment - - edited simmonsja , please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference. Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.
            hornc Chris Horn added a comment - - edited

            I suspect this patch may responsible for the kernel panics we're seeing in LU-15125

            Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

            hornc Chris Horn added a comment - - edited I suspect this patch may responsible for the kernel panics we're seeing in LU-15125 Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

            This caused problems with shutdown, see LU-14499.

            adilger Andreas Dilger added a comment - This caused problems with shutdown, see LU-14499 .
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/
            Subject: LU-13368 lnet: discard the callback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: babf0232273467b7199ec9a7c36047b1968913df

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/ Subject: LU-13368 lnet: discard the callback Project: fs/lustre-release Branch: master Current Patch Set: Commit: babf0232273467b7199ec9a7c36047b1968913df

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845
            Subject: LU-13368 lnet: discard the callback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a1b70741d3977cb5359b5b6797041477d5392aec

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845 Subject: LU-13368 lnet: discard the callback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a1b70741d3977cb5359b5b6797041477d5392aec

            I could see this when I dumped the tasks:

            [Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M
            [Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl         R  running task        0 11793  11738 0x00000080^M
            [Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M
            [Tue Mar 17 18:02:51 2020][8592655.547647]  [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547650]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
            [Tue Mar 17 18:02:51 2020][8592655.547658]  [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547660]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
            [Tue Mar 17 18:02:51 2020][8592655.547668]  [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547678]  [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547681]  [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M
            [Tue Mar 17 18:02:51 2020][8592655.547689]  [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547696]  [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547702]  [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547705]  [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M
            [Tue Mar 17 18:02:51 2020][8592655.547712]  [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547714]  [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M
            [Tue Mar 17 18:02:51 2020][8592655.547715]  [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M
            [Tue Mar 17 18:02:51 2020][8592655.547717]  [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M
            

            I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as fir-io7-s1_sysrq_t_LU-13368.txt
            This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

            sthiell Stephane Thiell added a comment - I could see this when I dumped the tasks: [Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M [Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl R running task 0 11793 11738 0x00000080^M [Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M [Tue Mar 17 18:02:51 2020][8592655.547647] [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547650] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547658] [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547660] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547668] [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547678] [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547681] [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M [Tue Mar 17 18:02:51 2020][8592655.547689] [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547696] [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547702] [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547705] [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M [Tue Mar 17 18:02:51 2020][8592655.547712] [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547714] [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M [Tue Mar 17 18:02:51 2020][8592655.547715] [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M [Tue Mar 17 18:02:51 2020][8592655.547717] [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as  fir-io7-s1_sysrq_t_LU-13368.txt This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

            Can you identify what is taking 100% of the cpu?

            Is it the user space lnetctl or some kernel thread?

            ashehata Amir Shehata (Inactive) added a comment - Can you identify what is taking 100% of the cpu? Is it the user space lnetctl or some kernel thread?

            Hi Amir,

            Thanks for your quick reply!

            Yes, you're right:

            [root@fir-io7-s1 ~]# lnetctl debug recovery --peer
            peer NI recovery:
                nid-0: 10.0.2.224@o2ib7
                nid-1: 10.0.10.235@o2ib7
                nid-2: 10.0.10.232@o2ib7
                nid-3: 10.0.2.237@o2ib7
                nid-4: 10.0.2.233@o2ib7
                nid-5: 10.0.10.202@o2ib7
                nid-6: 10.0.2.234@o2ib7
                nid-7: 10.0.2.232@o2ib7
                nid-8: 10.0.2.216@o2ib7
                nid-9: 10.0.10.233@o2ib7
                nid-10: 10.0.2.225@o2ib7
                nid-11: 10.0.2.235@o2ib7
                nid-12: 10.0.10.204@o2ib7
                nid-13: 10.0.10.239@o2ib7
                nid-14: 10.0.10.212@o2ib7
                nid-15: 10.0.2.239@o2ib7
                nid-16: 10.0.10.237@o2ib7
                nid-17: 10.0.10.209@o2ib7
                nid-18: 10.0.10.238@o2ib7
                nid-19: 10.0.10.236@o2ib7
                nid-20: 10.0.10.203@o2ib7
                nid-21: 10.0.2.218@o2ib7
                nid-22: 10.0.10.211@o2ib7
                nid-23: 10.0.2.236@o2ib7
                nid-24: 10.0.2.226@o2ib7
                nid-25: 10.0.10.210@o2ib7
                nid-26: 10.0.2.238@o2ib7
                nid-27: 10.0.2.219@o2ib7
                nid-28: 10.0.10.234@o2ib7
                nid-29: 10.0.2.227@o2ib7
                nid-30: 10.0.2.217@o2ib7
            

            Router NIDs like 10.0.10.212@o2ib7 have been removed but still show up there.

            The next command is unfortunately hanging, taking 100% cpu. It has been 30 minutes now.

            [root@fir-io7-s1 ~]# lnetctl peer show 
            <hung>
            

             
            I can't seem to be able to remove the bad NID neither at this point:

            [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.212@o2ib7
            del:
                - peer_ni:
                      errno: -19
                      descr: "cannot del peer ni: No such device"
            [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.211@o2ib7
            del:
                - peer_ni:
                      errno: -19
                      descr: "cannot del peer ni: No such device"
            

            A restart of the servers would, of course, resolve the issue. But I cannot do that right now. I just hope this won't trigger memory problems like I previously reported in LU-13351. But so far it doesn't seem to be the case. Thanks!

            sthiell Stephane Thiell added a comment - Hi Amir, Thanks for your quick reply! Yes, you're right: [root@fir-io7-s1 ~]# lnetctl debug recovery --peer peer NI recovery: nid-0: 10.0.2.224@o2ib7 nid-1: 10.0.10.235@o2ib7 nid-2: 10.0.10.232@o2ib7 nid-3: 10.0.2.237@o2ib7 nid-4: 10.0.2.233@o2ib7 nid-5: 10.0.10.202@o2ib7 nid-6: 10.0.2.234@o2ib7 nid-7: 10.0.2.232@o2ib7 nid-8: 10.0.2.216@o2ib7 nid-9: 10.0.10.233@o2ib7 nid-10: 10.0.2.225@o2ib7 nid-11: 10.0.2.235@o2ib7 nid-12: 10.0.10.204@o2ib7 nid-13: 10.0.10.239@o2ib7 nid-14: 10.0.10.212@o2ib7 nid-15: 10.0.2.239@o2ib7 nid-16: 10.0.10.237@o2ib7 nid-17: 10.0.10.209@o2ib7 nid-18: 10.0.10.238@o2ib7 nid-19: 10.0.10.236@o2ib7 nid-20: 10.0.10.203@o2ib7 nid-21: 10.0.2.218@o2ib7 nid-22: 10.0.10.211@o2ib7 nid-23: 10.0.2.236@o2ib7 nid-24: 10.0.2.226@o2ib7 nid-25: 10.0.10.210@o2ib7 nid-26: 10.0.2.238@o2ib7 nid-27: 10.0.2.219@o2ib7 nid-28: 10.0.10.234@o2ib7 nid-29: 10.0.2.227@o2ib7 nid-30: 10.0.2.217@o2ib7 Router NIDs like 10.0.10.212@o2ib7 have been removed but still show up there. The next command is unfortunately hanging, taking 100% cpu. It has been 30 minutes now. [root@fir-io7-s1 ~]# lnetctl peer show <hung>   I can't seem to be able to remove the bad NID neither at this point: [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.212@o2ib7 del: - peer_ni: errno: -19 descr: "cannot del peer ni: No such device" [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.211@o2ib7 del: - peer_ni: errno: -19 descr: "cannot del peer ni: No such device" A restart of the servers would, of course, resolve the issue. But I cannot do that right now. I just hope this won't trigger memory problems like I previously reported in LU-13351 . But so far it doesn't seem to be the case. Thanks!

            I think I know what the problem is.

            It's probably that at some point LNet attempted to use these routers after they were taken off line, but before the routes were deleted. When LNet failed to reach them, it added them to the recovery queue. They get pinged every recovery_interval. That's why you're seeing these errors. Can you confirm by doing this:

            lnetctl debug recovery --peer

            This should list all the peers on the recovery queue.

            They probably should be on the peer table as well. Confirm by:

            lnetctl peer show 

            What you can do to get rid of them is:

            lnetctl peer del --prim_nid 10.0.10.212@o2ib7

            Confirm that they are removed from the recovery queue:

            lnetctl debug recovery --peer

            We probably should get rid of the gateway peers when the route is removed. I need to think about the best way to do that.

            ashehata Amir Shehata (Inactive) added a comment - I think I know what the problem is. It's probably that at some point LNet attempted to use these routers after they were taken off line, but before the routes were deleted. When LNet failed to reach them, it added them to the recovery queue. They get pinged every recovery_interval. That's why you're seeing these errors. Can you confirm by doing this: lnetctl debug recovery --peer This should list all the peers on the recovery queue. They probably should be on the peer table as well. Confirm by: lnetctl peer show What you can do to get rid of them is: lnetctl peer del --prim_nid 10.0.10.212@o2ib7 Confirm that they are removed from the recovery queue: lnetctl debug recovery --peer We probably should get rid of the gateway peers when the route is removed. I need to think about the best way to do that.

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: