[LU-13368] lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error Created: 17/Mar/20 Updated: 10/Nov/21 Resolved: 09/Dec/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12 | ||
| Environment: |
CentOS 7.6 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir: Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error All coming from removed routers from the lnet route config on the servers. We removed the routes using commands like: # clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7' The remaining active routes are: [root@fir-io7-s1 lnet_consumer]# lnetctl route show -v
route:
- net: o2ib1
gateway: 10.0.10.216@o2ib7
hop: -1
priority: 0
state: up
- net: o2ib1
gateway: 10.0.10.218@o2ib7
hop: -1
priority: 0
state: up
- net: o2ib1
gateway: 10.0.10.219@o2ib7
hop: -1
priority: 0
state: up
- net: o2ib1
gateway: 10.0.10.217@o2ib7
hop: -1
priority: 0
state: up
- net: o2ib2
gateway: 10.0.10.227@o2ib7
hop: -1
priority: 0
state: up
- net: o2ib2
gateway: 10.0.10.226@o2ib7
hop: -1
priority: 0
state: up
- net: o2ib2
gateway: 10.0.10.225@o2ib7
hop: -1
priority: 0
state: up
- net: o2ib2
gateway: 10.0.10.224@o2ib7
hop: -1
priority: 0
state: up
Why is lnet trying to use the old routers? lctl dk shows: 00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error 00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111) I'm attaching a dk with +net as fir-io7-s1_dk.log.gz Also attaching kernel logs as fir-io7-s1_kern.log
|
| Comments |
| Comment by Amir Shehata (Inactive) [ 17/Mar/20 ] |
|
I think I know what the problem is. It's probably that at some point LNet attempted to use these routers after they were taken off line, but before the routes were deleted. When LNet failed to reach them, it added them to the recovery queue. They get pinged every recovery_interval. That's why you're seeing these errors. Can you confirm by doing this: lnetctl debug recovery --peer This should list all the peers on the recovery queue. They probably should be on the peer table as well. Confirm by: lnetctl peer show What you can do to get rid of them is: lnetctl peer del --prim_nid 10.0.10.212@o2ib7 Confirm that they are removed from the recovery queue: lnetctl debug recovery --peer We probably should get rid of the gateway peers when the route is removed. I need to think about the best way to do that. |
| Comment by Stephane Thiell [ 18/Mar/20 ] |
|
Hi Amir, Thanks for your quick reply! Yes, you're right: [root@fir-io7-s1 ~]# lnetctl debug recovery --peer
peer NI recovery:
nid-0: 10.0.2.224@o2ib7
nid-1: 10.0.10.235@o2ib7
nid-2: 10.0.10.232@o2ib7
nid-3: 10.0.2.237@o2ib7
nid-4: 10.0.2.233@o2ib7
nid-5: 10.0.10.202@o2ib7
nid-6: 10.0.2.234@o2ib7
nid-7: 10.0.2.232@o2ib7
nid-8: 10.0.2.216@o2ib7
nid-9: 10.0.10.233@o2ib7
nid-10: 10.0.2.225@o2ib7
nid-11: 10.0.2.235@o2ib7
nid-12: 10.0.10.204@o2ib7
nid-13: 10.0.10.239@o2ib7
nid-14: 10.0.10.212@o2ib7
nid-15: 10.0.2.239@o2ib7
nid-16: 10.0.10.237@o2ib7
nid-17: 10.0.10.209@o2ib7
nid-18: 10.0.10.238@o2ib7
nid-19: 10.0.10.236@o2ib7
nid-20: 10.0.10.203@o2ib7
nid-21: 10.0.2.218@o2ib7
nid-22: 10.0.10.211@o2ib7
nid-23: 10.0.2.236@o2ib7
nid-24: 10.0.2.226@o2ib7
nid-25: 10.0.10.210@o2ib7
nid-26: 10.0.2.238@o2ib7
nid-27: 10.0.2.219@o2ib7
nid-28: 10.0.10.234@o2ib7
nid-29: 10.0.2.227@o2ib7
nid-30: 10.0.2.217@o2ib7
Router NIDs like 10.0.10.212@o2ib7 have been removed but still show up there. The next command is unfortunately hanging, taking 100% cpu. It has been 30 minutes now. [root@fir-io7-s1 ~]# lnetctl peer show <hung> [root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.212@o2ib7
del:
- peer_ni:
errno: -19
descr: "cannot del peer ni: No such device"
[root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.211@o2ib7
del:
- peer_ni:
errno: -19
descr: "cannot del peer ni: No such device"
A restart of the servers would, of course, resolve the issue. But I cannot do that right now. I just hope this won't trigger memory problems like I previously reported in LU-13351. But so far it doesn't seem to be the case. Thanks! |
| Comment by Amir Shehata (Inactive) [ 18/Mar/20 ] |
|
Can you identify what is taking 100% of the cpu? Is it the user space lnetctl or some kernel thread? |
| Comment by Stephane Thiell [ 18/Mar/20 ] |
|
I could see this when I dumped the tasks: [Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M [Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl R running task 0 11793 11738 0x00000080^M [Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M [Tue Mar 17 18:02:51 2020][8592655.547647] [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547650] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547658] [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547660] [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M [Tue Mar 17 18:02:51 2020][8592655.547668] [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547678] [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M [Tue Mar 17 18:02:51 2020][8592655.547681] [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M [Tue Mar 17 18:02:51 2020][8592655.547689] [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547696] [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547702] [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547705] [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M [Tue Mar 17 18:02:51 2020][8592655.547712] [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M [Tue Mar 17 18:02:51 2020][8592655.547714] [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M [Tue Mar 17 18:02:51 2020][8592655.547715] [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M [Tue Mar 17 18:02:51 2020][8592655.547717] [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as fir-io7-s1_sysrq_t_LU-13368.txt |
| Comment by Gerrit Updater [ 05/Jun/20 ] |
|
Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845 |
| Comment by Gerrit Updater [ 09/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/ |
| Comment by Peter Jones [ 09/Dec/20 ] |
|
Landed for 2.14 |
| Comment by Andreas Dilger [ 09/Mar/21 ] |
|
This caused problems with shutdown, see LU-14499. |
| Comment by Chris Horn [ 20/Oct/21 ] |
|
I suspect this patch may responsible for the kernel panics we're seeing in Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/ |