[LU-7650] ko2iblnd map_on_demand can't negotitate when page sizes are different between nodes. Created: 11/Jan/16 Updated: 14/Jun/19 Resolved: 09/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | ko2iblnd, lnet, ppc | ||
| Environment: |
Power8 client nodes running Ubuntu 14.04 and Lustre servers running RHE6.6 with all running the latest pre-2.8 lustre stack. |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
With the mlx5 driver support mostly complete I have been testing on our Power8 nodes the latest 2.8 stack plus some ko2iblnd patches to make it work. With the work from |
| Comments |
| Comment by James A Simmons [ 11/Jan/16 ] |
|
Jeremy I have been looking at this code and I figure their are two ways to resolve this but both involve changing WIRE_ATTR kib_connparams_t which is why this needs to land for 2.8 to avoid compatibility issues if we wait until 2.9. One solution involves changing ibcp_max_frags to a u32 field and having it represent the total bytes consumed by the fragments. This limits the total memory usage to 4GB for the fragments. The second solution is add a new field, u16 ibcp_page_size to WIRE_ATTR kib_connparams_t which reports the page size used by the remote connection. In this case we can continue to support the 64k * 4K page = 256 GB memory size available for RDMA. In the case of Power8 it would be 64K * 64K = 4TB of memory. What do you think is the best approach? |
| Comment by Doug Oucharek (Inactive) [ 11/Jan/16 ] |
|
If you configure for 4K pages on Power8, is this still a problem? |
| Comment by Andreas Dilger [ 12/Jan/16 ] |
|
Why not just report the number of pages in 4KB units for the PPC nodes? I don't think there are any PPC servers in use anywhere, so this shouldn't affect any compatibility, and it avoids the need to change the protocol. |
| Comment by James A Simmons [ 12/Jan/16 ] |
|
Actually this affect PowerPC client nodes which are in use. Reporting in 4KB is hackish to me. Also even on x86 you can have 1MB with hugetbl support. Sound like reporting back bytes total is the best solution. |
| Comment by Andreas Dilger [ 13/Jan/16 ] |
|
James, I don't understand how this could affect compatibility for PPC nodes which are in use, when it doesn't currently work? The only concern would be if there are PPC clients and servers using a large PAGE_SIZE that would suddenly get interop issues between different versions of Lustre. You can't just change the size of kib_connparams_t nor the meaning of the fields significantly since this affects the IB wire protocol and would break interoperability for IB networks. I don't see that reporting the number of 4KB fragments is too hackish, since this is what all the current x86 clients do, and indeed the peer shouldn't know the details of the PAGE_SIZE for the node. If this was an x86 host it would have the reported maximum number of fragments, while the PPC host will just not have this many fragments in any RDMA. This also reminds me of |
| Comment by James A Simmons [ 13/Jan/16 ] |
|
The reason is that most PowerPC client nodes are using mlx4 based hardware. When you move to mlx5 based hardware you need to turn on map_on_demand to make it work with other external mlx4 systems. Since kib_connparams_t has been introduced in 2.8 we don't have to worry about interoperability issues yet. That is why I'm purposing this approach. If you have to adjust to 4KB units the peers still have to know the default page size since the peers are the ones telling the connections what number of 4K pages to use.You can't avoid know having to probe the page size on the native platform. My worry also is the that ibpc_max_frags is u16 which by 4K pages means a limit of 256GB. Now that might seem like enough today but a few years down the road this will be a problem. |
| Comment by Jeremy Filizetti [ 13/Jan/16 ] |
|
My opinion is that the lingering o2iblnd issues need to become a priority to be resolved regardless of which release they are introduced in. I'm ok with breaking protocol compatibility but I think we should do it by incrementing IBLND_MSG_VERSION and allow all the new functionality to only be leveraged by the new messaging version. That probably should be tied to a major release and well advertised so people know. There are enough significant changes to ko2iblnd in Lustre that we should stop trying to hack in fixes and come up with a solid plan to restructure things so they make sense (different o2ib parameters, memory registration, channel bonding, driver bugs/features, etc). I noticed that in the patch for |
| Comment by James A Simmons [ 16/Feb/16 ] |
|
Jeremy a separate ticket exist for those fragment issues, |
| Comment by James A Simmons [ 06/Jul/16 ] |
|
Now that I'm back to testing on the Power8 platform again I'm again looking at this issue. I worked a few of the kinks but I have found problems in kiblnd_init_rdma(). In that function. For the power8 node which has a page size of 64K the number of fragments will be 16. On the x86 nodes it will be 256. This mean conn->ibc_max_frags will be 16. When looping in kiblnd_init_rdma() the minimum number of bits processed will be the smallest fragment size, which is the destination nodes 4K page. Each ib_sge, which is 64K in size, will be consumed processing only 4K. We only have 16 ib_sge to work but the loop expects 256 pages to processed. This of course fails. Looking at it the solution looks like I should allocate 256 ib_sge on the Power8 nodes like the x86 nodes and just use the first 4K of each page to send to the x86 node. This seems waste. Does anyone see another way to handle this? |
| Comment by Gerrit Updater [ 14/Jul/16 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: http://review.whamcloud.com/21304 |
| Comment by Olaf Weber [ 22/Jul/16 ] |
|
Looking over the code, properly supporting LNET_MAX_PAYLOAD != LNET_MTU requires adding code that splits a request to move more than LNET_MTU data into multiple messages. There is some code related to this in lnet_parse(), where clearly the idea is that router never sees messages larger than LNET_MTU. But code that does the actual splitting and recombining appears to be absent. Looking at the patch, the bulk of the work appears to be done by defining LNET_MAX_IOV to a smaller value, ensuring that no messages bigger than LNET_MTU will be generated. But seems mostly to be an implication of the code (# iov entries * page size) rather than something explicitly checked and enforced. I'd be inclined to change the LNET_MAX_PAYLOAD checks to be < LNET_MTU and > LNET_MTU, to explicitly indicate that only == LNET_MTU works until message fragments are implemented. LNET_MAX_IOV I'd be inclined to define as (LNET_MTU >> PAGE_SHIFT) for the same reason. |
| Comment by James A Simmons [ 26/Jul/16 ] |
|
I scaled back the patch to just handle the fragment count issues of the ko2iblnd with different page size nodes communicating. With the scaled back patch the LNet layer will be transmitting 16MB LNet packet instead of normal 1MB packet but that doesn't appear to break anything. On our Power8 nodes it is consuming nearly a extra GB of memory for no reason tho. I found fixing up LNET_MAX_IOV/LNET_MTU to be far more challenging since various Lustre kernel and utilities code assumes these hard coded values. |
| Comment by Gerrit Updater [ 22/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21304/ |
| Comment by James A Simmons [ 22/Aug/16 ] |
|
ko2iblnd now works for x86 <-> Power8 |
| Comment by Doug Oucharek (Inactive) [ 01/Sep/16 ] |
|
I tried reproducing the original problem reported in I'm not suggesting changing this patch now that it has landed, but rather increase the pressure to fix |
| Comment by James A Simmons [ 01/Sep/16 ] |
|
Excellent that we have a reproducer. Strange we have never encountered any of these problems. Mind you an error not so productive either |
| Comment by Gerrit Updater [ 02/Sep/16 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/22281 |
| Comment by James A Simmons [ 02/Sep/16 ] |
|
Since a check that break support for Power8 is being added back in this ticket needs to be reopened until power8 support is supported again. |
| Comment by Gerrit Updater [ 10/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22281/ |
| Comment by Peter Jones [ 10/Sep/16 ] |
|
Landed for 2.9 |
| Comment by James A Simmons [ 29/Oct/16 ] |
|
Appears support for Power8 broke things. |
| Comment by Peter Jones [ 29/Oct/16 ] |
|
Yep. We'll need to take another run at this. |
| Comment by James A Simmons [ 29/Oct/16 ] |
|
With the new RDMA RW api and map_on_demand being broken not only on Power8 but in general it looks like a big rewrite of the ko2ibllnd driver is needed |
| Comment by Doug Oucharek (Inactive) [ 31/Oct/16 ] |
|
James: how far back does the new RDMA RW api go (kernel version)? My understanding is the new api starts about Linux 4.6 which is very far out for current Lustre users. We need to band-aid ko2iblnd to keep going for the next few years. I agree that ko2iblnd is going to need a rewrite to stay relevant for the upstream kernel. |
| Comment by James A Simmons [ 07/Nov/16 ] |
|
Actually I just got a copy of the RHEL7.3 source and the new RDMA generic api is there. So the answer is the solution is ready for todays customers. |
| Comment by Doug Oucharek (Inactive) [ 07/Nov/16 ] |
|
Is there any good "architecture" style document about this API? I don't find header files very useful in understanding "how" the API should be used. |
| Comment by James A Simmons [ 26/Apr/17 ] |
|
Doug do you have time to take another shot at this? |
| Comment by Peter Jones [ 26/Apr/17 ] |
|
James Doug is focused on some of the upstream changes in the coming weeks so will not get to this in the 2.10 timeframe Peter |
| Comment by James A Simmons [ 09/Apr/18 ] |
|
This is now a duplicate of |