[LU-5874] DLC: the ongoing traffic was interrupted after adding a new network interface Created: 05/Nov/14 Updated: 19/Jan/15 Resolved: 19/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sarah Liu | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 16426 | ||||||||||||
| Description |
|
1. setup the system and run sanity == sanity test 27B: call setstripe on open unlinked file/rename victim == 12:18:00 (1415218680) Lustre: DEBUG MARKER: == sanity test 27B: call setstripe on open unlinked file/rename victim == 12:18:00 (1415218680) LNet: Added LNI 192.168.4.74@o2ib [8/256/0/180] LNet: No route to 192.168.4.47@o2ib via from 10.2.4.74@tcp Lustre: 4806:0:(client.c:1934:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1415218690/real 1415218690] req@ffff880824129000 x1483963522156376/t0(0) o400->lustre-MDT0000-mdc-ffff880434a40800@192.168.4.47@o2ib:12/10 lens 224/224 e 0 to 1 dl 1415218753 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 4806:0:(client.c:1934:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Lustre: lustre-MDT0000-mdc-ffff880434a40800: Connection to lustre-MDT0000 (at 192.168.4.47@o2ib) was lost; in progress operations using this service will wait for recovery to complete LNet: Skipped 5 previous similar messages LustreError: 166-1: MGC192.168.4.47@o2ib: Connection to MGS (at 192.168.4.47@o2ib) was lost; in progress operations using this service will fail LNet: Removed LNI 192.168.4.74@o2ib Lustre: lustre-OST0000-osc-ffff880434a40800: Connection restored to lustre-OST0000 (at 192.168.4.47@o2ib) Lustre: Skipped 2 previous similar messages LL_IOC_LOV_SETSTRIPE: No such file or directory LL_IOC_LOV_SETSTRIPE: No such file or directory Resetting fail_loc on all nodes...done. PASS 27B (26s) |
| Comments |
| Comment by Jodi Levi (Inactive) [ 06/Nov/14 ] |
|
Amir, |
| Comment by Amir Shehata (Inactive) [ 06/Nov/14 ] |
|
Can you please exact steps used to reproduce this issue? What would be ideal is the set of lnetctl commands used, and any show output to see the change. |
| Comment by Sarah Liu [ 08/Nov/14 ] |
|
1. setup a lustre filesystem with 1 MDT and 1 OST, servers use o2ib; a router; 1 client uses tcp; mount the system and run sanity on the client side then you can see the above errors on the client side |
| Comment by Amir Shehata (Inactive) [ 24/Nov/14 ] |
|
The issue here is that both the client and the servers are on ib0 with the addition of the network dynamically. this makes the configuration invalid due to the presence of the route that bridges the ib0 and tcp. There are a couple of options to fix this: Currently investigating the best solution. |
| Comment by Isaac Huang (Inactive) [ 25/Nov/14 ] |
|
If ib0 is added on the client, I thought lnet would automatically switch to ib0 to talk with the servers in ib0. There shouldn't be those "No route to ......" errors. Maybe I missed something? It's not a good idea to remove the route, because if the ib0 interface is brought down later then the TCP client would not be able to talk to the servers any more (as the route was removed). |
| Comment by Amir Shehata (Inactive) [ 27/Nov/14 ] |
|
If I'm reading the following code correctly, from lnet_send() /* Is this for someone on a local network? */ local_ni = lnet_net2ni_locked(LNET_NIDNET(dst_nid), cpt); if (local_ni != NULL) { if (src_ni == NULL) { src_ni = local_ni; src_nid = src_ni->ni_nid; } else if (src_ni == local_ni) { lnet_ni_decref_locked(local_ni, cpt); } else { lnet_ni_decref_locked(local_ni, cpt); lnet_ni_decref_locked(src_ni, cpt); lnet_net_unlock(cpt); LCONSOLE_WARN("No route to %s via from %s\n", libcfs_nid2str(dst_nid), libcfs_nid2str(src_nid)); return -EINVAL; } It seems to say if you're trying to send to a local ni, which is the case in this test case with the addition of ib0, then it is expecting that the local_ni and src_ni be the same. However, the src_nid is still the @tcp nid, which is as far as I could tell is stored in ptlrpc (not 100% sure yet). But if so, then what would trigger it to update the src_nid? I'm also leaning towards rejecting the network addition. Basically, what I'm doing is before adding an NI dynamically, I check if it's a remote net, and if so, I reject adding the NI. |
| Comment by Isaac Huang (Inactive) [ 01/Dec/14 ] |
|
Agree, and I'd suggest: |
| Comment by Amir Shehata (Inactive) [ 01/Dec/14 ] |
|
Some more details: mdc_setup() is only triggered on startup. This essentially picks the src_nid and sticks with that through out. So the addition of a new "closer NI" doesn't retrigger updating the connection hash maintained in ptlrpc. Another option (but not as part of this bug) when a network is added connections should be evaluated and updated if there exists an NI which creates a more preferred path to the destination. This will allow LNet to take advantage of going directly over the shortest path when updates occur. Ideally, however, it seems that NIDs shouldn't be visible outside of LNet. I realize however, that this would be a major change. |
| Comment by Gerrit Updater [ 03/Dec/14 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/12912 |
| Comment by Gerrit Updater [ 19/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12912/ |
| Comment by Peter Jones [ 19/Jan/15 ] |
|
Landed for 2.7 |