[LU-5364] Lustre Router connection hangs one side of fabric Created: 17/Jul/14 Updated: 12/Aug/14 Resolved: 22/Jul/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 14961 |
| Description |
|
We have 2 IB fabrics connected with 2 lustre routers. One side of fabrics connected via obsidain longbows and the other fabrics is directed connected to routers via qdr switch. Fabric1_o2ib233 <--->LONGBOW1<----<ROUTER1>---->QDR<--Fabric2_o2ib Fabric1_o2ib233 <--->LONGBOW2<----<ROUTER2>----->QDR<--Fabric2_o2ib We get Router disconnects on the fabric2_o2ib side with errors like this on the routers LNet: 1310:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.74@o2ib LNet: 1308:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.86@o2ib LNet: 1312:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.25.242@o2ib LNet: 1312:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.25.156@o2ib LNet: 1314:0:(o2iblnd_cb.c:2360:kiblnd_passive_connect()) Conn race 10.151.27.80@o2ib ROUTER MODULE SETTINGS options lnet networks="o2ib(ib1),o2ib233(ib0)" forwarding=enabled
options ko2iblnd require_privileged_port=0
options ko2iblnd use_privileged_port=0
options ko2iblnd timeout=150
options ko2iblnd retry_count=7
options ko2iblnd peer_timeout=0
options ptlrpc at_min=100
SERVERS SETTINGS options ko2iblnd require_privileged_port=0 options ko2iblnd use_privileged_port=0 options lnet networks=o2ib(ib1),o2ib100(ib1) routes="o2ib233 10.151.27.[58,93]@o2ib" dead_router_check_interval=60 live_router_check_interval=60 # Get rid of messages for missing, special-purpose hardware (LU-1599) blacklist padlock-sha options ko2iblnd timeout=150 options ko2iblnd retry_count=7 options ko2iblnd peer_timeout=0 options ptlrpc at_min=100 CLIENTS options ko2iblnd require_privileged_port=0
options ko2iblnd use_privileged_port=0
options lnet networks=o2ib233(ib1) routes="o2ib 10.153.27.[58,93]@o2ib233" dead_router_check_interval=60 live_router_check_interval=60
|
| Comments |
| Comment by Peter Jones [ 17/Jul/14 ] |
|
Amir Could you please assist with this one? Thanks Peter |
| Comment by Amir Shehata (Inactive) [ 18/Jul/14 ] |
|
currently investigating. Will update when I have more information. |
| Comment by Amir Shehata (Inactive) [ 18/Jul/14 ] |
|
The race the error message is referring to, is when the router receives an ib connect but there already exists a peer with the same nid (of the destination) in connecting state. This happens if the router is already in the process of establishing a connection with that nid. In this case the incoming connection gets dropped. There are multiple scenarios where that occurs: I'm still investigating the code more thoroughly to try and understand which scenario is more likely. Would it be possible to grab syslog messages from the router to see the errors in context. Also do you hit this issue rightaway or does the system work for a while before the problem is encountered? |
| Comment by Amir Shehata (Inactive) [ 21/Jul/14 ] |
|
After examining the code some more, these "Conn race" should not result in hanging. The side that the router disconnected due to a race, only did so, because there is already another connection to that side in progress. I just want to clarify if the symptoms being experienced are temporary disconnects or permanent hangs with no recovery? Also, as indicated in my previous comments, if we could get the logs from both sides of the router, as well as logs from the router, when this problem occurs, that'll help in giving context to the problem. |
| Comment by Mahmoud Hanafi [ 22/Jul/14 ] |
|
Further testing showed that this may have been due IB fabric. You may close this for now. I will reopen it when we have more data. |
| Comment by Peter Jones [ 22/Jul/14 ] |
|
ok thanks Mahmoud! |