Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13368

lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.3
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir:

      Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
      Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      

      All coming from removed routers from the lnet route config on the servers.

      We removed the routes using commands like:

      # clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7'
      

      The remaining active routes are:

      [root@fir-io7-s1 lnet_consumer]# lnetctl route show -v
      route:
          - net: o2ib1
            gateway: 10.0.10.216@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.218@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.219@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib1
            gateway: 10.0.10.217@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.227@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.226@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.225@o2ib7
            hop: -1
            priority: 0
            state: up
          - net: o2ib2
            gateway: 10.0.10.224@o2ib7
            hop: -1
            priority: 0
            state: up
      

      Why is lnet trying to use the old routers?

      lctl dk shows:

      00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
      00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111)
      

      I'm attaching a dk with +net as fir-io7-s1_dk.log.gz

      Also attaching kernel logs as fir-io7-s1_kern.log and the output of lnetctl stats show as fir-io7-s1_lnetctl_stats.txt .

       

       

      Attachments

        1. fir-io7-s1_dk.log.gz
          8.86 MB
        2. fir-io7-s1_kern.log
          1.33 MB
        3. fir-io7-s1_lnetctl_stats.txt
          0.6 kB
        4. fir-io7-s1_sysrq_t_LU-13368.txt
          4.37 MB

        Issue Links

          Activity

            [LU-13368] lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error
            adilger Andreas Dilger added a comment - - edited

            simmonsja, please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference.

            Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.

            adilger Andreas Dilger added a comment - - edited simmonsja , please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference. Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.
            hornc Chris Horn added a comment - - edited

            I suspect this patch may responsible for the kernel panics we're seeing in LU-15125

            Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

            hornc Chris Horn added a comment - - edited I suspect this patch may responsible for the kernel panics we're seeing in LU-15125 Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

            This caused problems with shutdown, see LU-14499.

            adilger Andreas Dilger added a comment - This caused problems with shutdown, see LU-14499 .
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/
            Subject: LU-13368 lnet: discard the callback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: babf0232273467b7199ec9a7c36047b1968913df

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/ Subject: LU-13368 lnet: discard the callback Project: fs/lustre-release Branch: master Current Patch Set: Commit: babf0232273467b7199ec9a7c36047b1968913df

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845
            Subject: LU-13368 lnet: discard the callback
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a1b70741d3977cb5359b5b6797041477d5392aec

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845 Subject: LU-13368 lnet: discard the callback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a1b70741d3977cb5359b5b6797041477d5392aec

            I could see this when I dumped the tasks:

            [Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M
            [Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl         R  running task        0 11793  11738 0x00000080^M
            [Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M
            [Tue Mar 17 18:02:51 2020][8592655.547647]  [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547650]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
            [Tue Mar 17 18:02:51 2020][8592655.547658]  [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547660]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
            [Tue Mar 17 18:02:51 2020][8592655.547668]  [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547678]  [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M
            [Tue Mar 17 18:02:51 2020][8592655.547681]  [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M
            [Tue Mar 17 18:02:51 2020][8592655.547689]  [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547696]  [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547702]  [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547705]  [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M
            [Tue Mar 17 18:02:51 2020][8592655.547712]  [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M
            [Tue Mar 17 18:02:51 2020][8592655.547714]  [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M
            [Tue Mar 17 18:02:51 2020][8592655.547715]  [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M
            [Tue Mar 17 18:02:51 2020][8592655.547717]  [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M
            

            I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as fir-io7-s1_sysrq_t_LU-13368.txt
            This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

            sthiell Stephane Thiell added a comment - I could see this when I dumped the tasks: [Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M [Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl R running task 0 11793 11738 0x00000080^M [Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M [Tue Mar 17 18:02:51 2020][8592655.547647] [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547650] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547658] [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547660] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547668] [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547678] [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547681] [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M [Tue Mar 17 18:02:51 2020][8592655.547689] [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547696] [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547702] [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547705] [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M [Tue Mar 17 18:02:51 2020][8592655.547712] [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547714] [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M [Tue Mar 17 18:02:51 2020][8592655.547715] [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M [Tue Mar 17 18:02:51 2020][8592655.547717] [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as  fir-io7-s1_sysrq_t_LU-13368.txt This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

            Can you identify what is taking 100% of the cpu?

            Is it the user space lnetctl or some kernel thread?

            ashehata Amir Shehata (Inactive) added a comment - Can you identify what is taking 100% of the cpu? Is it the user space lnetctl or some kernel thread?

            Hi Amir,

            Thanks for your quick reply!

            Yes, you're right:

            [root@fir-io7-s1 ~]# lnetctl debug recovery --peer
            peer NI recovery:
                nid-0: 10.0.2.224@o2ib7
                nid-1: 10.0.10.235@o2ib7
                nid-2: 10.0.10.232@o2ib7
                nid-3: 10.0.2.237@o2ib7
                nid-4: 10.0.2.233@o2ib7
                nid-5: 10.0.10.202@o2ib7
                nid-6: 10.0.2.234@o2ib7
                nid-7: 10.0.2.232@o2ib7
                nid-8: 10.0.2.216@o2ib7
                nid-9: 10.0.10.233@o2ib7
                nid-10: 10.0.2.225@o2ib7
                nid-11: 10.0.2.235@o2ib7
                nid-12: 10.0.10.204@o2ib7
                nid-13: 10.0.10.239@o2ib7
                nid-14: 10.0.10.212@o2ib7
                nid-15: 10.0.2.239@o2ib7
                nid-16: 10.0.10.237@o2ib7
                nid-17: 10.0.10.209@o2ib7
                nid-18: 10.0.10.238@o2ib7
                nid-19: 10.0.10.236@o2ib7
                nid-20: 10.0.10.203@o2ib7
                nid-21: 10.0.2.218@o2ib7
                nid-22: 10.0.10.211@o2ib7
                nid-23: 10.0.2.236@o2ib7
                nid-24: 10.0.2.226@o2ib7
                nid-25: 10.0.10.210@o2ib7
                nid-26: 10.0.2.238@o2ib7
                nid-27: 10.0.2.219@o2ib7
                nid-28: 10.0.10.234@o2ib7
                nid-29: 10.0.2.227@o2ib7
                nid-30: 10.0.2.217@o2ib7
            

            Router NIDs like 10.0.10.212@o2ib7 have been removed but still show up there.

            The next command is unfortunately hanging, taking 100% cpu. It has been 30 minutes now.

            [root@fir-io7-s1 ~]# lnetctl peer show 
            <hung>
            

             
            I can't seem to be able to remove the bad NID neither at this point:

            [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.212@o2ib7
            del:
                - peer_ni:
                      errno: -19
                      descr: "cannot del peer ni: No such device"
            [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.211@o2ib7
            del:
                - peer_ni:
                      errno: -19
                      descr: "cannot del peer ni: No such device"
            

            A restart of the servers would, of course, resolve the issue. But I cannot do that right now. I just hope this won't trigger memory problems like I previously reported in LU-13351. But so far it doesn't seem to be the case. Thanks!

            sthiell Stephane Thiell added a comment - Hi Amir, Thanks for your quick reply! Yes, you're right: [root@fir-io7-s1 ~]# lnetctl debug recovery --peer peer NI recovery: nid-0: 10.0.2.224@o2ib7 nid-1: 10.0.10.235@o2ib7 nid-2: 10.0.10.232@o2ib7 nid-3: 10.0.2.237@o2ib7 nid-4: 10.0.2.233@o2ib7 nid-5: 10.0.10.202@o2ib7 nid-6: 10.0.2.234@o2ib7 nid-7: 10.0.2.232@o2ib7 nid-8: 10.0.2.216@o2ib7 nid-9: 10.0.10.233@o2ib7 nid-10: 10.0.2.225@o2ib7 nid-11: 10.0.2.235@o2ib7 nid-12: 10.0.10.204@o2ib7 nid-13: 10.0.10.239@o2ib7 nid-14: 10.0.10.212@o2ib7 nid-15: 10.0.2.239@o2ib7 nid-16: 10.0.10.237@o2ib7 nid-17: 10.0.10.209@o2ib7 nid-18: 10.0.10.238@o2ib7 nid-19: 10.0.10.236@o2ib7 nid-20: 10.0.10.203@o2ib7 nid-21: 10.0.2.218@o2ib7 nid-22: 10.0.10.211@o2ib7 nid-23: 10.0.2.236@o2ib7 nid-24: 10.0.2.226@o2ib7 nid-25: 10.0.10.210@o2ib7 nid-26: 10.0.2.238@o2ib7 nid-27: 10.0.2.219@o2ib7 nid-28: 10.0.10.234@o2ib7 nid-29: 10.0.2.227@o2ib7 nid-30: 10.0.2.217@o2ib7 Router NIDs like 10.0.10.212@o2ib7 have been removed but still show up there. The next command is unfortunately hanging, taking 100% cpu. It has been 30 minutes now. [root@fir-io7-s1 ~]# lnetctl peer show <hung>   I can't seem to be able to remove the bad NID neither at this point: [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.212@o2ib7 del: - peer_ni: errno: -19 descr: "cannot del peer ni: No such device" [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.211@o2ib7 del: - peer_ni: errno: -19 descr: "cannot del peer ni: No such device" A restart of the servers would, of course, resolve the issue. But I cannot do that right now. I just hope this won't trigger memory problems like I previously reported in LU-13351 . But so far it doesn't seem to be the case. Thanks!

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: