Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
Lustre 2.4.3
-
None
-
3
-
14961
Description
We have 2 IB fabrics connected with 2 lustre routers. One side of fabrics connected via obsidain longbows and the other fabrics is directed connected to routers via qdr switch.
Fabric1_o2ib233 <--->LONGBOW1<----<ROUTER1>---->QDR<--Fabric2_o2ib Fabric1_o2ib233 <--->LONGBOW2<----<ROUTER2>----->QDR<--Fabric2_o2ib
We get Router disconnects on the fabric2_o2ib side with errors like this on the routers
LNet: 1310:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.74@o2ib LNet: 1308:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.86@o2ib LNet: 1312:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.25.242@o2ib LNet: 1312:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.25.156@o2ib LNet: 1314:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.80@o2ib
ROUTER MODULE SETTINGS
options lnet networks="o2ib(ib1),o2ib233(ib0)" forwarding=enabled
options ko2iblnd require_privileged_port=0
options ko2iblnd use_privileged_port=0
options ko2iblnd timeout=150
options ko2iblnd retry_count=7
options ko2iblnd peer_timeout=0
options ptlrpc at_min=100
SERVERS SETTINGS
options ko2iblnd require_privileged_port=0 options ko2iblnd use_privileged_port=0 options lnet networks=o2ib(ib1),o2ib100(ib1) routes="o2ib233 10.151.27.[58,93]@o2ib" dead_router_check_interval=60 live_router_check_interval=60 # Get rid of messages for missing, special-purpose hardware (LU-1599) blacklist padlock-sha options ko2iblnd timeout=150 options ko2iblnd retry_count=7 options ko2iblnd peer_timeout=0 options ptlrpc at_min=100
CLIENTS
options ko2iblnd require_privileged_port=0
options ko2iblnd use_privileged_port=0
options lnet networks=o2ib233(ib1) routes="o2ib 10.153.27.[58,93]@o2ib233" dead_router_check_interval=60 live_router_check_interval=60