[LU-13368] lnet may be trying to use deleted routes leading to errors kiblnd_rejected(): 10.0.10.212@o2ib7 rejected: consumer defined fatal error Created: 17/Mar/20  Updated: 10/Nov/21  Resolved: 09/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: LTS12
Environment:

CentOS 7.6


Attachments: File fir-io7-s1_dk.log.gz     Text File fir-io7-s1_kern.log     Text File fir-io7-s1_lnetctl_stats.txt     Text File fir-io7-s1_sysrq_t_LU-13368.txt    
Issue Links:
Related
is related to LU-14499 o2iblnd: LU-13368 changes cause shutd... Open
is related to LU-13638 lnet: discard the callback Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir:

Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error
Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error
Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error

All coming from removed routers from the lnet route config on the servers.

We removed the routes using commands like:

# clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7'

The remaining active routes are:

[root@fir-io7-s1 lnet_consumer]# lnetctl route show -v
route:
    - net: o2ib1
      gateway: 10.0.10.216@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib1
      gateway: 10.0.10.218@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib1
      gateway: 10.0.10.219@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib1
      gateway: 10.0.10.217@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib2
      gateway: 10.0.10.227@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib2
      gateway: 10.0.10.226@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib2
      gateway: 10.0.10.225@o2ib7
      hop: -1
      priority: 0
      state: up
    - net: o2ib2
      gateway: 10.0.10.224@o2ib7
      hop: -1
      priority: 0
      state: up

Why is lnet trying to use the old routers?

lctl dk shows:

00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111)

I'm attaching a dk with +net as fir-io7-s1_dk.log.gz

Also attaching kernel logs as fir-io7-s1_kern.log and the output of lnetctl stats show as fir-io7-s1_lnetctl_stats.txt .

 

 



 Comments   
Comment by Amir Shehata (Inactive) [ 17/Mar/20 ]

I think I know what the problem is.

It's probably that at some point LNet attempted to use these routers after they were taken off line, but before the routes were deleted. When LNet failed to reach them, it added them to the recovery queue. They get pinged every recovery_interval. That's why you're seeing these errors. Can you confirm by doing this:

lnetctl debug recovery --peer

This should list all the peers on the recovery queue.

They probably should be on the peer table as well. Confirm by:

lnetctl peer show 

What you can do to get rid of them is:

lnetctl peer del --prim_nid 10.0.10.212@o2ib7

Confirm that they are removed from the recovery queue:

lnetctl debug recovery --peer

We probably should get rid of the gateway peers when the route is removed. I need to think about the best way to do that.

Comment by Stephane Thiell [ 18/Mar/20 ]

Hi Amir,

Thanks for your quick reply!

Yes, you're right:

[root@fir-io7-s1 ~]# lnetctl debug recovery --peer
peer NI recovery:
    nid-0: 10.0.2.224@o2ib7
    nid-1: 10.0.10.235@o2ib7
    nid-2: 10.0.10.232@o2ib7
    nid-3: 10.0.2.237@o2ib7
    nid-4: 10.0.2.233@o2ib7
    nid-5: 10.0.10.202@o2ib7
    nid-6: 10.0.2.234@o2ib7
    nid-7: 10.0.2.232@o2ib7
    nid-8: 10.0.2.216@o2ib7
    nid-9: 10.0.10.233@o2ib7
    nid-10: 10.0.2.225@o2ib7
    nid-11: 10.0.2.235@o2ib7
    nid-12: 10.0.10.204@o2ib7
    nid-13: 10.0.10.239@o2ib7
    nid-14: 10.0.10.212@o2ib7
    nid-15: 10.0.2.239@o2ib7
    nid-16: 10.0.10.237@o2ib7
    nid-17: 10.0.10.209@o2ib7
    nid-18: 10.0.10.238@o2ib7
    nid-19: 10.0.10.236@o2ib7
    nid-20: 10.0.10.203@o2ib7
    nid-21: 10.0.2.218@o2ib7
    nid-22: 10.0.10.211@o2ib7
    nid-23: 10.0.2.236@o2ib7
    nid-24: 10.0.2.226@o2ib7
    nid-25: 10.0.10.210@o2ib7
    nid-26: 10.0.2.238@o2ib7
    nid-27: 10.0.2.219@o2ib7
    nid-28: 10.0.10.234@o2ib7
    nid-29: 10.0.2.227@o2ib7
    nid-30: 10.0.2.217@o2ib7

Router NIDs like 10.0.10.212@o2ib7 have been removed but still show up there.

The next command is unfortunately hanging, taking 100% cpu. It has been 30 minutes now.

[root@fir-io7-s1 ~]# lnetctl peer show 
<hung>

 
I can't seem to be able to remove the bad NID neither at this point:

[root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.212@o2ib7
del:
    - peer_ni:
          errno: -19
          descr: "cannot del peer ni: No such device"
[root@fir-io7-s1 ~]# lnetctl peer del --prim_nid 10.0.10.211@o2ib7
del:
    - peer_ni:
          errno: -19
          descr: "cannot del peer ni: No such device"

A restart of the servers would, of course, resolve the issue. But I cannot do that right now. I just hope this won't trigger memory problems like I previously reported in LU-13351. But so far it doesn't seem to be the case. Thanks!

Comment by Amir Shehata (Inactive) [ 18/Mar/20 ]

Can you identify what is taking 100% of the cpu?

Is it the user space lnetctl or some kernel thread?

Comment by Stephane Thiell [ 18/Mar/20 ]

I could see this when I dumped the tasks:

[Tue Mar 17 18:02:51 2020][8592655.547628] Task dump for CPU 8:^M
[Tue Mar 17 18:02:51 2020][8592655.547630] lnetctl         R  running task        0 11793  11738 0x00000080^M
[Tue Mar 17 18:02:51 2020][8592655.547631] Call Trace:^M
[Tue Mar 17 18:02:51 2020][8592655.547647]  [<ffffffffc0a83d07>] ? LNetCtl+0x5b7/0xe50 [lnet]^M
[Tue Mar 17 18:02:51 2020][8592655.547650]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
[Tue Mar 17 18:02:51 2020][8592655.547658]  [<ffffffffc0a820de>] ? LNetNIInit+0x7e/0xc10 [lnet]^M
[Tue Mar 17 18:02:51 2020][8592655.547660]  [<ffffffffb8569192>] ? mutex_lock+0x12/0x2f^M
[Tue Mar 17 18:02:51 2020][8592655.547668]  [<ffffffffc0a81499>] ? LNetNIFini+0x39/0x110 [lnet]^M
[Tue Mar 17 18:02:51 2020][8592655.547678]  [<ffffffffc0aa313b>] ? lnet_ioctl+0x6b/0x270 [lnet]^M
[Tue Mar 17 18:02:51 2020][8592655.547681]  [<ffffffffb801e6fb>] ? __kmalloc+0x1eb/0x230^M
[Tue Mar 17 18:02:51 2020][8592655.547689]  [<ffffffffc09fec8c>] ? libcfs_ioctl_getdata+0x22c/0x770 [libcfs]^M
[Tue Mar 17 18:02:51 2020][8592655.547696]  [<ffffffffc0a0263b>] ? libcfs_ioctl+0x4b/0x4c0 [libcfs]^M
[Tue Mar 17 18:02:51 2020][8592655.547702]  [<ffffffffc09fe797>] ? libcfs_psdev_ioctl+0x67/0xf0 [libcfs]^M
[Tue Mar 17 18:02:51 2020][8592655.547705]  [<ffffffffb8057840>] ? do_vfs_ioctl+0x3a0/0x5a0^M
[Tue Mar 17 18:02:51 2020][8592655.547712]  [<ffffffffc0a02792>] ? libcfs_ioctl+0x1a2/0x4c0 [libcfs]^M
[Tue Mar 17 18:02:51 2020][8592655.547714]  [<ffffffffb8057ae1>] ? SyS_ioctl+0xa1/0xc0^M
[Tue Mar 17 18:02:51 2020][8592655.547715]  [<ffffffffb7f3a176>] ? __audit_syscall_exit+0x1e6/0x280^M
[Tue Mar 17 18:02:51 2020][8592655.547717]  [<ffffffffb8577f75>] ? auditsys+0x14/0x46^M

I'm also attaching the full dump of the tasks when lnetctl show peer was taking 100% cpu as fir-io7-s1_sysrq_t_LU-13368.txt
This morning it looks like the command has completed but I didn't run it in a screen and I lost the output as I'm working from home. I'll try again and let you know if the command does eventually complete. Thx!

Comment by Gerrit Updater [ 05/Jun/20 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38845
Subject: LU-13368 lnet: discard the callback
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a1b70741d3977cb5359b5b6797041477d5392aec

Comment by Gerrit Updater [ 09/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38845/
Subject: LU-13368 lnet: discard the callback
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: babf0232273467b7199ec9a7c36047b1968913df

Comment by Peter Jones [ 09/Dec/20 ]

Landed for 2.14

Comment by Andreas Dilger [ 09/Mar/21 ]

This caused problems with shutdown, see LU-14499.

Comment by Chris Horn [ 20/Oct/21 ]

I suspect this patch may responsible for the kernel panics we're seeing in LU-15125

Edit - nevermind, found the regression in https://review.whamcloud.com/#/c/43419/

Generated at Sat Feb 10 03:00:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.