Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.3
-
CentOS 7.6
-
3
-
9223372036854775807
Description
After having removed a few lnet routes using lnetctl, we are seeing these constant messages on all Lustre servers on Fir:
Mar 17 13:59:02 fir-io7-s1 kernel: LNetError: 115948:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error Mar 17 14:09:02 fir-io7-s1 kernel: LNetError: 4245:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error Mar 17 14:19:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.202@o2ib7 rejected: consumer defined fatal error Mar 17 14:29:02 fir-io7-s1 kernel: LNetError: 5152:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.209@o2ib7 rejected: consumer defined fatal error Mar 17 14:39:02 fir-io7-s1 kernel: LNetError: 5966:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error
All coming from removed routers from the lnet route config on the servers.
We removed the routes using commands like:
# clush -w@mds,@oss 'lnetctl route del --net o2ib4 --gateway 10.0.10.212@o2ib7'
The remaining active routes are:
[root@fir-io7-s1 lnet_consumer]# lnetctl route show -v route: - net: o2ib1 gateway: 10.0.10.216@o2ib7 hop: -1 priority: 0 state: up - net: o2ib1 gateway: 10.0.10.218@o2ib7 hop: -1 priority: 0 state: up - net: o2ib1 gateway: 10.0.10.219@o2ib7 hop: -1 priority: 0 state: up - net: o2ib1 gateway: 10.0.10.217@o2ib7 hop: -1 priority: 0 state: up - net: o2ib2 gateway: 10.0.10.227@o2ib7 hop: -1 priority: 0 state: up - net: o2ib2 gateway: 10.0.10.226@o2ib7 hop: -1 priority: 0 state: up - net: o2ib2 gateway: 10.0.10.225@o2ib7 hop: -1 priority: 0 state: up - net: o2ib2 gateway: 10.0.10.224@o2ib7 hop: -1 priority: 0 state: up
Why is lnet trying to use the old routers?
lctl dk shows:
00000800:00020000:16.0:1584479347.555117:0:4883:0:(o2iblnd_cb.c:2923:kiblnd_rejected()) 10.0.10.212@o2ib7 rejected: consumer defined fatal error 00000800:00000200:16.0:1584479347.555118:0:4883:0:(o2iblnd_cb.c:2307:kiblnd_connreq_done()) 10.0.10.212@o2ib7: active(1), version(12), status(-111)
I'm attaching a dk with +net as fir-io7-s1_dk.log.gz
Also attaching kernel logs as fir-io7-s1_kern.log and the output of lnetctl stats show as fir-io7-s1_lnetctl_stats.txt
.
simmonsja, please file a new ticket for whatever issue is being hit, rather than reopening a ticket that was closed 4 years ago. You can link to this ticket for reference.
Even if the failure symptoms are the same or similar, there is no guarantee that the root cause of the new issue is the same as this one, so piling on to this ticket will just muddy the waters. For tracking purposes it is also better to leave this old 2.14 ticket alone and start a new one.