[LU-8943] Enable Multiple IB/OPA Endpoints Between Nodes Created: 15/Dec/16 Updated: 05/Dec/17 Resolved: 12/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Improvement | Priority: | Critical |
| Reporter: | Doug Oucharek (Inactive) | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lnet | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
OPA driver optimizations are based on the MPI model where it is expected to have multiple endpoints between two given nodes. To enable this optimization for Lustre, we need to make it possible, via an LND-specific tuneable, to create multiple endpoints and to balance the traffic over them. I have already created an experimental patch to test this theory out. I was able to push OPA performance to 12.4GB/s by just having 2 QPs between the nodes and round robin messages between them. This Jira ticket is for productizing my patch and testing it out thoroughly for OPA and IB. Test results will be posted to this ticket. |
| Comments |
| Comment by Gerrit Updater [ 31/Jan/17 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/25168 |
| Comment by Doug Oucharek (Inactive) [ 14/Feb/17 ] |
|
To activate this patch, you need to use the following option: options ko2iblnd conns_per_peer=<n> Where <n> is the number QPs you want per peer connection. At the moment, both sides of the connection must have the same setting (I need to fix this in the patch...only the client side should need this). I found that setting <n> to 6 gave me amazing performance. Note: I have not tried this patch yet with the recommended hfi tunings. They "will" interfere with this patch and should initially be avoided. Another note: I believe there is a race condition in the hfi driver we trigger when there is too much parallelism. A couple of times running this patch, I found the hfi driver "missed" an event. I am talking to the OPA developers about this. |
| Comment by Doug Oucharek (Inactive) [ 28/Feb/17 ] |
|
The patch for this ticket is showing a lot of promise. To productize it so we can land it to master, I need to do the following:
In addition, testing needs to be done to see how much more CPU this feature consumes when it is activated. We need to measure the costs as well as the benefits. This needs to all be done with MLX hardware as well as OPA just to see what happens if this is activated on MLX-based networks. |
| Comment by Doug Oucharek (Inactive) [ 19/Apr/17 ] |
|
I have attached an Excel spreadsheet showing the performance changes with different conns_per_peer settings for both OPA and MLX-QDR. For OPA, there is a tab showing the change without any HFI1 tunings (i.e. just the defaults) and with the recommended HFI1 tunings. Summary: Using this patch with conns_per_peer of 3 and the recommended HFI1 tunings provides good and consistent performance. Still to be done: Testing this patch for backwards compatibility. |
| Comment by Doug Oucharek (Inactive) [ 24/Apr/17 ] |
|
Backwards compatibility testing looks good. An upgraded node who initiates connections will create conns_per_peer connections and the non-upgraded receiver node will allow that many connections to be created. However, the non-upgraded node will not "use" all the connections to send messages, only the first one. So performance will not improve. If things are reversed (non-upgraded initiator to upgraded receiver) will work as if neither side is upgraded because it is the initiator who decides how many connections to have and in this case, it will just be one. So, to get the performance benefit, both sides of a connection need to be upgraded with this patch and the initiator needs to have conns_per_peer set > 1. Based on the attached spreadsheet, I recommend OPA systems with many cores use conns_per_peer = 3 and these HFI1 parameters: options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 However, if you are on a VM or have a limited number of cores, change conns_per_peer = 4 and krcvqs = 4 in the HFI1 parameters. |
| Comment by Andreas Dilger [ 24/Apr/17 ] |
Are you going to add these to the /usr/sbin/ko2iblnd-probe script, or be set by default in some other manner, or will this be up to the user to discover and set? At a very minimum there should be an update to the Lustre User Manual (see https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual), but providing good performance out of the box is preferred. |
| Comment by Doug Oucharek (Inactive) [ 04/May/17 ] |
|
I did update the OPA defaults to set conns_per_peer to 4 when OPA is detected. I'll also update the manual under I bumped the conns_per_peer to 4 from 3 because OPA team is going to start recommending a krcvqs default of 4 especially for a low number of cores (i.e. VMs). Having a conns_per_peer of 4 helps to compensate for the lower krcvqs number so we should work well out of the box whether krcvqs is 4 or 8. |
| Comment by Gerrit Updater [ 12/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25168/ |
| Comment by Peter Jones [ 12/May/17 ] |
|
Landed for 2.10 |
| Comment by Dmitry Eremin (Inactive) [ 13/May/17 ] |
|
I observed strange behavior. It looks after this commit I cannot unload ko2iblnd module. LNet is busy even all unmounted successfully. Only reboot helps.
|
| Comment by Andreas Dilger [ 14/May/17 ] |
|
Does "lctl network down" or "lnetctl lnet unconfigure" help? |
| Comment by Dmitry Eremin (Inactive) [ 14/May/17 ] |
|
No, as I mentoined before only reboot helps. # lustre_rmmod
rmmod: ERROR: Module ko2iblnd is in use
# lsmod|less
Module Size Used by
ko2iblnd 233790 1
ptlrpc 1343928 0
obdclass 1744518 1 ptlrpc
lnet 483843 3 ko2iblnd,obdclass,ptlrpc
libcfs 416336 4 lnet,ko2iblnd,obdclass,ptlrpc
[...]
# lctl network down
LNET busy
lnetctl > lnet unconfigure
unconfigure:
- lnet:
errno: -16
descr: "LNet unconfigure error: Device or resource busy"
lnetctl > lnet unconfigure --all
unconfigure:
- lnet:
errno: -16
descr: "LNet unconfigure error: Device or resource busy"
# lustre_rmmod
rmmod: ERROR: Module ko2iblnd is in use
|
| Comment by Doug Oucharek (Inactive) [ 15/May/17 ] |
|
When I created the performance spreadsheet, I needed to keep changing conns_per_peer. I had no problems taking down and brining up LNet using these commands: Up: modprobe lnet lctl network configure modprobe lnet-selftest Down: rmmod lnet-selftest lctl network down rmmod ko2iblnd rmmod lnet There must be something different about what you are doing which is triggering ref counters to not be reduced. Are you using DLC? What is your environment? Are both nodes running the latest code with this patch? |
| Comment by Dmitry Eremin (Inactive) [ 15/May/17 ] |
|
I'm using new Lustre client with this patch and old Lustre servers without this patch. So, I just mount lustre FS then use it and then try to unload after umount. I don't use DLC. I have CentOS 7.3 in both sites.
|
| Comment by Doug Oucharek (Inactive) [ 15/May/17 ] |
|
That might be the reason. The client will create multiple connections, but the server will only have one they are all talking to. When one connection on the client is closed, the connection on the server will be closed. I suspect the remaining connections on the client can't be closed. I'll have to look at the code to see what I can do in this situation. I suspect if the server has the patch, you would not have a problem. |
| Comment by Doug Oucharek (Inactive) [ 16/May/17 ] |
|
I just tried to reproduce with the passive node being unpatched. Was not able to reproduce your issue. The "lctl network down" takes a long time, but does succeed. There must be something else here. Do you know if your parameters like map_on_demand are different? Is a reconnection happening to renegotiate the parameters? This is something I have not tried. |
| Comment by Doug Oucharek (Inactive) [ 16/May/17 ] |
|
Dmitry, when you get the file system mounted, can you issue the following sequence on both nodes to ensure we are creating 4 connections on each:
lctl > network o2ib > conn_list You should see 4 connections to the peer if the initiator (usually the client) has the MultiQP patch, and 1 connection to the peer if it doesn't. |
| Comment by Dmitry Eremin (Inactive) [ 16/May/17 ] |
# lctl
lctl > network o2ib
lctl > conn_list
192.168.213.125@o2ib mtu -1
192.168.213.125@o2ib mtu -1
192.168.213.125@o2ib mtu -1
192.168.213.125@o2ib mtu -1
192.168.213.125@o2ib mtu -1
192.168.213.125@o2ib mtu -1
192.168.213.125@o2ib mtu -1
192.168.213.125@o2ib mtu -1
192.168.213.231@o2ib mtu -1
192.168.213.231@o2ib mtu -1
192.168.213.231@o2ib mtu -1
192.168.213.231@o2ib mtu -1
192.168.213.232@o2ib mtu -1
192.168.213.232@o2ib mtu -1
192.168.213.232@o2ib mtu -1
192.168.213.232@o2ib mtu -1
192.168.213.233@o2ib mtu -1
192.168.213.233@o2ib mtu -1
192.168.213.233@o2ib mtu -1
192.168.213.233@o2ib mtu -1
192.168.213.234@o2ib mtu -1
192.168.213.234@o2ib mtu -1
192.168.213.234@o2ib mtu -1
192.168.213.234@o2ib mtu -1
192.168.213.235@o2ib mtu -1
192.168.213.235@o2ib mtu -1
192.168.213.235@o2ib mtu -1
192.168.213.235@o2ib mtu -1
192.168.213.236@o2ib mtu -1
192.168.213.236@o2ib mtu -1
192.168.213.236@o2ib mtu -1
192.168.213.236@o2ib mtu -1
# lnetctl lnet unconfigure --all
unconfigure:
- lnet:
errno: -16
descr: "LNet unconfigure error: Device or resource busy"
Client: 2.9.57_48_g0386263 |
| Comment by Dmitry Eremin (Inactive) [ 16/May/17 ] |
|
192.168.213.125@o2ib - client |
| Comment by Dmitry Eremin (Inactive) [ 16/May/17 ] |
|
From server: # lctl lctl > network o2ib lctl > conn_list 192.168.213.125@o2ib mtu -1 192.168.213.125@o2ib mtu -1 192.168.213.125@o2ib mtu -1 192.168.213.125@o2ib mtu -1 ... |
| Comment by Doug Oucharek (Inactive) [ 17/May/17 ] |
|
Cliff is seeing this same problem on the soak cluster but there is no OPA, only MLX IB. I'm beginning to wonder if this is a problem with the Mutli-Rail drop rather than this change. |
| Comment by Peter Jones [ 17/May/17 ] |
|
Would it be a good idea to track all this under a new ticket instead of tacking onto an already closed one? |
| Comment by Doug Oucharek (Inactive) [ 17/May/17 ] |
|
Cliff created a ticket for this already: Summary: this appears to have been introduced in patch: https://review.whamcloud.com/#/c/26959/ and not the change under this ticket. ptlrpc is not longer being unloaded with lustre_rmmod so lnet won't unload. |