[LU-5364] Lustre Router connection hangs one side of fabric Created: 17/Jul/14  Updated: 12/Aug/14  Resolved: 22/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 14961

 Description   

We have 2 IB fabrics connected with 2 lustre routers. One side of fabrics connected via obsidain longbows and the other fabrics is directed connected to routers via qdr switch.

Fabric1_o2ib233 <--->LONGBOW1<----<ROUTER1>---->QDR<--Fabric2_o2ib
Fabric1_o2ib233 <--->LONGBOW2<----<ROUTER2>----->QDR<--Fabric2_o2ib

We get Router disconnects on the fabric2_o2ib side with errors like this on the routers

LNet: 1310:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.74@o2ib
LNet: 1308:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.86@o2ib
LNet: 1312:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.25.242@o2ib
LNet: 1312:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.25.156@o2ib
LNet: 1314:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.80@o2ib

ROUTER MODULE SETTINGS

options lnet networks="o2ib(ib1),o2ib233(ib0)" forwarding=enabled
options ko2iblnd require_privileged_port=0
options ko2iblnd use_privileged_port=0
options ko2iblnd timeout=150
options ko2iblnd retry_count=7
options ko2iblnd peer_timeout=0
options ptlrpc at_min=100

SERVERS SETTINGS

options ko2iblnd require_privileged_port=0
options ko2iblnd use_privileged_port=0
options lnet networks=o2ib(ib1),o2ib100(ib1) routes="o2ib233 10.151.27.[58,93]@o2ib" dead_router_check_interval=60 live_router_check_interval=60
# Get rid of messages for missing, special-purpose hardware (LU-1599)
blacklist padlock-sha
options ko2iblnd timeout=150
options ko2iblnd retry_count=7
options ko2iblnd peer_timeout=0
options ptlrpc at_min=100

CLIENTS

options ko2iblnd require_privileged_port=0
options ko2iblnd use_privileged_port=0
options lnet networks=o2ib233(ib1) routes="o2ib 10.153.27.[58,93]@o2ib233" dead_router_check_interval=60 live_router_check_interval=60


 Comments   
Comment by Peter Jones [ 17/Jul/14 ]

Amir

Could you please assist with this one?

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 18/Jul/14 ]

currently investigating. Will update when I have more information.

Comment by Amir Shehata (Inactive) [ 18/Jul/14 ]

The race the error message is referring to, is when the router receives an ib connect but there already exists a peer with the same nid (of the destination) in connecting state. This happens if the router is already in the process of establishing a connection with that nid. In this case the incoming connection gets dropped. There are multiple scenarios where that occurs:
1. if the router is in the process of transmitting a message to the destination nid, and is currently connecting
2. if the router receives 2 consecutive connects from the same nid (although, I'm not sure if this is a possible case).
3. if the router is reconnecting to the peer when it gets another connection request.

I'm still investigating the code more thoroughly to try and understand which scenario is more likely.

Would it be possible to grab syslog messages from the router to see the errors in context.

Also do you hit this issue rightaway or does the system work for a while before the problem is encountered?

Comment by Amir Shehata (Inactive) [ 21/Jul/14 ]

After examining the code some more, these "Conn race" should not result in hanging. The side that the router disconnected due to a race, only did so, because there is already another connection to that side in progress.

I just want to clarify if the symptoms being experienced are temporary disconnects or permanent hangs with no recovery?

Also, as indicated in my previous comments, if we could get the logs from both sides of the router, as well as logs from the router, when this problem occurs, that'll help in giving context to the problem.

Comment by Mahmoud Hanafi [ 22/Jul/14 ]

Further testing showed that this may have been due IB fabric.

You may close this for now. I will reopen it when we have more data.

Comment by Peter Jones [ 22/Jul/14 ]

ok thanks Mahmoud!

Generated at Sat Feb 10 01:50:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.