[LU-13368] lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.12.3
Labels:
- LTS12
Environment:
CentOS 7.6

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir:

Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error
Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error

All coming from removed routers from the lnet route config on the servers.

We removed the routes using commands like:

# clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7'

The remaining active routes are:

[root@fir-io7-s1 lnet_consumer]# lnetctl route show -v
route:
    - net: o2ib1
      gateway: 10.0.10.216@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib1
      gateway: 10.0.10.218@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib1
      gateway: 10.0.10.219@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib1
      gateway: 10.0.10.217@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib2
      gateway: 10.0.10.227@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib2
      gateway: 10.0.10.226@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib2
      gateway: 10.0.10.225@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib2
      gateway: 10.0.10.224@o2ib7
      hop: -1
      priority: 0
      state: up

Why is lnet trying to use the old routers?

lctl dk shows:

00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111)

I'm attaching a dk with +net as fir-io7-s1_dk.log.gz

Also attaching kernel logs as fir-io7-s1_kern.log and the output of lnetctl stats show as fir-io7-s1_lnetctl_stats.txt .

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-io7-s1_dk.log.gz
8.86 MB
17/Mar/20 9:49 PM
fir-io7-s1_kern.log
1.33 MB
17/Mar/20 9:49 PM
fir-io7-s1_lnetctl_stats.txt
0.6 kB
17/Mar/20 9:49 PM
fir-io7-s1_sysrq_t_LU-13368.txt
4.37 MB
18/Mar/20 6:06 PM

Issue Links

is related to

LU-13638 lnet: discard the callback

Open

LU-14499 o2iblnd: LU-13368 changes cause shutdown procedure to not complete

Resolved

Activity

[LU-13368] lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error

Andreas Dilger added a comment - 27/Sep/24 7:35 PM - edited

simmonsja, please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference.

Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.

Andreas Dilger added a comment - 27/Sep/24 7:35 PM - edited simmonsja , please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference. Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.

Chris Horn added a comment - 20/Oct/21 4:52 PM - edited

I suspect this patch may responsible for the kernel panics we're seeing in ~~LU-15125~~

Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

Chris Horn added a comment - 20/Oct/21 4:52 PM - edited I suspect this patch may responsible for the kernel panics we're seeing in LU-15125 Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

Andreas Dilger added a comment - 09/Mar/21 1:04 AM

This caused problems with shutdown, see ~~LU-14499~~.

Andreas Dilger added a comment - 09/Mar/21 1:04 AM This caused problems with shutdown, see LU-14499 .

Peter Jones added a comment - 09/Dec/20 2:02 PM

Landed for 2.14

Peter Jones added a comment - 09/Dec/20 2:02 PM Landed for 2.14

Gerrit Updater added a comment - 09/Dec/20 7:48 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/
Subject: ~~LU-13368~~ lnet: discard the callback
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: babf0232273467b7199ec9a7c36047b1968913df

Gerrit Updater added a comment - 09/Dec/20 7:48 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/ Subject: LU-13368 lnet: discard the callback Project: fs/lustre-release Branch: master Current Patch Set: Commit: babf0232273467b7199ec9a7c36047b1968913df

Gerrit Updater added a comment - 05/Jun/20 12:29 PM

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845
Subject: ~~LU-13368~~ lnet: discard the callback
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a1b70741d3977cb5359b5b6797041477d5392aec

Gerrit Updater added a comment - 05/Jun/20 12:29 PM Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845 Subject: LU-13368 lnet: discard the callback Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a1b70741d3977cb5359b5b6797041477d5392aec

Stephane Thiell added a comment - 18/Mar/20 6:08 PM

I could see this when I dumped the tasks:

[Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M
[Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl         R  running task        0 11793  11738 0x00000080^M
[Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M
[Tue Mar 17 18:02:51 2020][8592655.547647]  [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M
[Tue Mar 17 18:02:51 2020][8592655.547650]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
[Tue Mar 17 18:02:51 2020][8592655.547658]  [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M
[Tue Mar 17 18:02:51 2020][8592655.547660]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
[Tue Mar 17 18:02:51 2020][8592655.547668]  [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M
[Tue Mar 17 18:02:51 2020][8592655.547678]  [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M
[Tue Mar 17 18:02:51 2020][8592655.547681]  [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M
[Tue Mar 17 18:02:51 2020][8592655.547689]  [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M
[Tue Mar 17 18:02:51 2020][8592655.547696]  [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M
[Tue Mar 17 18:02:51 2020][8592655.547702]  [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M
[Tue Mar 17 18:02:51 2020][8592655.547705]  [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M
[Tue Mar 17 18:02:51 2020][8592655.547712]  [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M
[Tue Mar 17 18:02:51 2020][8592655.547714]  [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M
[Tue Mar 17 18:02:51 2020][8592655.547715]  [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M
[Tue Mar 17 18:02:51 2020][8592655.547717]  [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M

I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as fir-io7-s1_sysrq_t_LU-13368.txt
This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

Stephane Thiell added a comment - 18/Mar/20 6:08 PM I could see this when I dumped the tasks: [Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M [Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl R running task 0 11793 11738 0x00000080^M [Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M [Tue Mar 17 18:02:51 2020][8592655.547647] [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547650] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547658] [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547660] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547668] [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547678] [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547681] [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M [Tue Mar 17 18:02:51 2020][8592655.547689] [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547696] [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547702] [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547705] [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M [Tue Mar 17 18:02:51 2020][8592655.547712] [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547714] [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M [Tue Mar 17 18:02:51 2020][8592655.547715] [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M [Tue Mar 17 18:02:51 2020][8592655.547717] [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as fir-io7-s1_sysrq_t_LU-13368.txt This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

Amir Shehata (Inactive) added a comment - 18/Mar/20 12:55 AM

Can you identify what is taking 100% of the cpu?

Is it the user space lnetctl or some kernel thread?

Amir Shehata (Inactive) added a comment - 18/Mar/20 12:55 AM Can you identify what is taking 100% of the cpu? Is it the user space lnetctl or some kernel thread?

Stephane Thiell added a comment - 18/Mar/20 12:41 AM

Hi Amir,

Thanks for your quick reply!

Yes, you're right:

[root@fir-io7-s1 ~]# lnetctl debug recovery --peer
peer NI recovery:
    nid-0: 10.0.2.224@o2ib7
    nid-1: 10.0.10.235@o2ib7
    nid-2: 10.0.10.232@o2ib7
    nid-3: 10.0.2.237@o2ib7
    nid-4: 10.0.2.233@o2ib7
    nid-5: 10.0.10.202@o2ib7
    nid-6: 10.0.2.234@o2ib7
    nid-7: 10.0.2.232@o2ib7
    nid-8: 10.0.2.216@o2ib7
    nid-9: 10.0.10.233@o2ib7
    nid-10: 10.0.2.225@o2ib7
    nid-11: 10.0.2.235@o2ib7
    nid-12: 10.0.10.204@o2ib7
    nid-13: 10.0.10.239@o2ib7
    nid-14: 10.0.10.212@o2ib7
    nid-15: 10.0.2.239@o2ib7
    nid-16: 10.0.10.237@o2ib7
    nid-17: 10.0.10.209@o2ib7
    nid-18: 10.0.10.238@o2ib7
    nid-19: 10.0.10.236@o2ib7
    nid-20: 10.0.10.203@o2ib7
    nid-21: 10.0.2.218@o2ib7
    nid-22: 10.0.10.211@o2ib7
    nid-23: 10.0.2.236@o2ib7
    nid-24: 10.0.2.226@o2ib7
    nid-25: 10.0.10.210@o2ib7
    nid-26: 10.0.2.238@o2ib7
    nid-27: 10.0.2.219@o2ib7
    nid-28: 10.0.10.234@o2ib7
    nid-29: 10.0.2.227@o2ib7
    nid-30: 10.0.2.217@o2ib7

Router NIDs like 10.0.10.212@o2ib7 have been removed but still show up there.

The next command is unfortunately hanging, taking 100% cpu. It has been 30 minutes now.

[root@fir-io7-s1 ~]# lnetctl peer show 
<hung>

I can't seem to be able to remove the bad NID neither at this point:

[root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.212@o2ib7
del:
    - peer_ni:
          errno: -19
          descr: "cannot del peer ni: No such device"
[root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.211@o2ib7
del:
    - peer_ni:
          errno: -19
          descr: "cannot del peer ni: No such device"

A restart of the servers would, of course, resolve the issue. But I cannot do that right now. I just hope this won't trigger memory problems like I previously reported in LU-13351. But so far it doesn't seem to be the case. Thanks!

Stephane Thiell added a comment - 18/Mar/20 12:41 AM Hi Amir, Thanks for your quick reply! Yes, you're right: [root@fir-io7-s1 ~]# lnetctl debug recovery --peer peer NI recovery: nid-0: 10.0.2.224@o2ib7 nid-1: 10.0.10.235@o2ib7 nid-2: 10.0.10.232@o2ib7 nid-3: 10.0.2.237@o2ib7 nid-4: 10.0.2.233@o2ib7 nid-5: 10.0.10.202@o2ib7 nid-6: 10.0.2.234@o2ib7 nid-7: 10.0.2.232@o2ib7 nid-8: 10.0.2.216@o2ib7 nid-9: 10.0.10.233@o2ib7 nid-10: 10.0.2.225@o2ib7 nid-11: 10.0.2.235@o2ib7 nid-12: 10.0.10.204@o2ib7 nid-13: 10.0.10.239@o2ib7 nid-14: 10.0.10.212@o2ib7 nid-15: 10.0.2.239@o2ib7 nid-16: 10.0.10.237@o2ib7 nid-17: 10.0.10.209@o2ib7 nid-18: 10.0.10.238@o2ib7 nid-19: 10.0.10.236@o2ib7 nid-20: 10.0.10.203@o2ib7 nid-21: 10.0.2.218@o2ib7 nid-22: 10.0.10.211@o2ib7 nid-23: 10.0.2.236@o2ib7 nid-24: 10.0.2.226@o2ib7 nid-25: 10.0.10.210@o2ib7 nid-26: 10.0.2.238@o2ib7 nid-27: 10.0.2.219@o2ib7 nid-28: 10.0.10.234@o2ib7 nid-29: 10.0.2.227@o2ib7 nid-30: 10.0.2.217@o2ib7 Router NIDs like 10.0.10.212@o2ib7 have been removed but still show up there. The next command is unfortunately hanging, taking 100% cpu. It has been 30 minutes now. [root@fir-io7-s1 ~]# lnetctl peer show <hung> I can't seem to be able to remove the bad NID neither at this point: [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.212@o2ib7 del: - peer_ni: errno: -19 descr: "cannot del peer ni: No such device" [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.211@o2ib7 del: - peer_ni: errno: -19 descr: "cannot del peer ni: No such device" A restart of the servers would, of course, resolve the issue. But I cannot do that right now. I just hope this won't trigger memory problems like I previously reported in LU-13351 . But so far it doesn't seem to be the case. Thanks!

Amir Shehata (Inactive) added a comment - 17/Mar/20 11:56 PM

I think I know what the problem is.

It's probably that at some point LNet attempted to use these routers after they were taken off line, but before the routes were deleted. When LNet failed to reach them, it added them to the recovery queue. They get pinged every recovery_interval. That's why you're seeing these errors. Can you confirm by doing this:

lnetctl debug recovery --peer

This should list all the peers on the recovery queue.

They probably should be on the peer table as well. Confirm by:

lnetctl peer show

What you can do to get rid of them is:

lnetctl peer del --prim_nid 10.0.10.212@o2ib7

Confirm that they are removed from the recovery queue:

lnetctl debug recovery --peer

We probably should get rid of the gateway peers when the route is removed. I need to think about the best way to do that.

Amir Shehata (Inactive) added a comment - 17/Mar/20 11:56 PM I think I know what the problem is. It's probably that at some point LNet attempted to use these routers after they were taken off line, but before the routes were deleted. When LNet failed to reach them, it added them to the recovery queue. They get pinged every recovery_interval. That's why you're seeing these errors. Can you confirm by doing this: lnetctl debug recovery --peer This should list all the peers on the recovery queue. They probably should be on the peer table as well. Confirm by: lnetctl peer show What you can do to get rid of them is: lnetctl peer del --prim_nid 10.0.10.212@o2ib7 Confirm that they are removed from the recovery queue: lnetctl debug recovery --peer We probably should get rid of the gateway peers when the route is removed. I need to think about the best way to do that.

People

Assignee:: Serguei Smirnov

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 17/Mar/20 9:52 PM

Updated:: 27/Sep/24 7:36 PM

Resolved:: 27/Sep/24 7:35 PM