[LU-7650] ko2iblnd map_on_demand can't negotitate when page sizes are different between nodes. Created: 11/Jan/16  Updated: 14/Jun/19  Resolved: 09/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: ko2iblnd, lnet, ppc
Environment:

Power8 client nodes running Ubuntu 14.04 and Lustre servers running RHE6.6 with all running the latest pre-2.8 lustre stack.


Issue Links:
Related
is related to LU-3322 ko2iblnd support for different map_on... Resolved
is related to LU-8573 IOR: niobuf.c:319:ptlrpc_register_bul... Resolved
is related to LU-10157 LNET_MAX_IOV hard coded to 256 Resolved
is related to LU-5718 RDMA too fragmented with router Resolved
is related to LU-10129 map-on-demand set to 32 doesn't work ... Resolved
is related to LU-12419 ppc64le: "LNetError: RDMA has too man... Closed
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With the mlx5 driver support mostly complete I have been testing on our Power8 nodes the latest 2.8 stack plus some ko2iblnd patches to make it work. With the work from LU-5783 I can now set the peer_credits to 63 and using the patch from LU-7101 set the map_on_demand to 16 pages. The problem is that the Power8 clients can't connect to our servers which have map_on_demand set to 256 pages. The reason is the set test to validate the connection compares the page counts on each side. On Power8 16 pages equals 256 pages on x86 platforms so comparing page count is an invalid test.



 Comments   
Comment by James A Simmons [ 11/Jan/16 ]

Jeremy I have been looking at this code and I figure their are two ways to resolve this but both involve changing WIRE_ATTR kib_connparams_t which is why this needs to land for 2.8 to avoid compatibility issues if we wait until 2.9. One solution involves changing ibcp_max_frags to a u32 field and having it represent the total bytes consumed by the fragments. This limits the total memory usage to 4GB for the fragments. The second solution is add a new field, u16 ibcp_page_size to WIRE_ATTR kib_connparams_t which reports the page size used by the remote connection. In this case we can continue to support the 64k * 4K page = 256 GB memory size available for RDMA. In the case of Power8 it would be 64K * 64K = 4TB of memory. What do you think is the best approach?

Comment by Doug Oucharek (Inactive) [ 11/Jan/16 ]

If you configure for 4K pages on Power8, is this still a problem?

Comment by Andreas Dilger [ 12/Jan/16 ]

Why not just report the number of pages in 4KB units for the PPC nodes? I don't think there are any PPC servers in use anywhere, so this shouldn't affect any compatibility, and it avoids the need to change the protocol.

Comment by James A Simmons [ 12/Jan/16 ]

Actually this affect PowerPC client nodes which are in use. Reporting in 4KB is hackish to me. Also even on x86 you can have 1MB with hugetbl support. Sound like reporting back bytes total is the best solution.

Comment by Andreas Dilger [ 13/Jan/16 ]

James, I don't understand how this could affect compatibility for PPC nodes which are in use, when it doesn't currently work? The only concern would be if there are PPC clients and servers using a large PAGE_SIZE that would suddenly get interop issues between different versions of Lustre.

You can't just change the size of kib_connparams_t nor the meaning of the fields significantly since this affects the IB wire protocol and would break interoperability for IB networks.

I don't see that reporting the number of 4KB fragments is too hackish, since this is what all the current x86 clients do, and indeed the peer shouldn't know the details of the PAGE_SIZE for the node. If this was an x86 host it would have the reported maximum number of fragments, while the PPC host will just not have this many fragments in any RDMA. This also reminds me of LU-7385, which is going to have to deal with a similar issue if it is using higher-order allocations to reduce the maximum number of fragments in a single transfer.

Comment by James A Simmons [ 13/Jan/16 ]

The reason is that most PowerPC client nodes are using mlx4 based hardware. When you move to mlx5 based hardware you need to turn on map_on_demand to make it work with other external mlx4 systems. Since kib_connparams_t has been introduced in 2.8 we don't have to worry about interoperability issues yet. That is why I'm purposing this approach. If you have to adjust to 4KB units the peers still have to know the default page size since the peers are the ones telling the connections what number of 4K pages to use.You can't avoid know having to probe the page size on the native platform. My worry also is the that ibpc_max_frags is u16 which by 4K pages means a limit of 256GB. Now that might seem like enough today but a few years down the road this will be a problem.

Comment by Jeremy Filizetti [ 13/Jan/16 ]

My opinion is that the lingering o2iblnd issues need to become a priority to be resolved regardless of which release they are introduced in. I'm ok with breaking protocol compatibility but I think we should do it by incrementing IBLND_MSG_VERSION and allow all the new functionality to only be leveraged by the new messaging version. That probably should be tied to a major release and well advertised so people know. There are enough significant changes to ko2iblnd in Lustre that we should stop trying to hack in fixes and come up with a solid plan to restructure things so they make sense (different o2ib parameters, memory registration, channel bonding, driver bugs/features, etc).

I noticed that in the patch for LU-3322 I never accounted for the fact that in the IBLND_MSG_PUT_ACK case of kiblnd_reply(), we do not map the tx descriptor to reduce the number of fragments. After patching this with a call to kiblnd_map_tx() first I'm still finding issues with RDMA too fragmented errors in some cases. I think for map_on_demand to really work as desired some more work is going to have to be done because alignment issues can still create problems with the RDMAs.

Comment by James A Simmons [ 16/Feb/16 ]

Jeremy a separate ticket exist for those fragment issues, LU-5718. I took a deeper look at the code and noticed we already have all the info we need from kconn_params_t. It has max_msg_size which is what we are concern about. I noticed a broken relationship between LNET_MAX_IOV and LNET_MAX_PAYLOAD. The logic is make LNET_MAX_IOV a set value i.e 256 and LNET_MAX_PAYLOAD a variable. The truth is LNET_MAX_PAYLOAD is not really changeable, see lib-types.h to see what I mean. It really should be the reverse with LNET_MAX_IOV being determined by the LNET_MAX_PAYLOAD value.

Comment by James A Simmons [ 06/Jul/16 ]

Now that I'm back to testing on the Power8 platform again I'm again looking at this issue. I worked a few of the kinks but I have found problems in kiblnd_init_rdma(). In that function. For the power8 node which has a page size of 64K the number of fragments will be 16. On the x86 nodes it will be 256. This mean conn->ibc_max_frags will be 16. When looping in kiblnd_init_rdma() the minimum number of bits processed will be the smallest fragment size, which is the destination nodes 4K page. Each ib_sge, which is 64K in size, will be consumed processing only 4K. We only have 16 ib_sge to work but the loop expects 256 pages to processed. This of course fails. Looking at it the solution looks like I should allocate 256 ib_sge on the Power8 nodes like the x86 nodes and just use the first 4K of each page to send to the x86 node. This seems waste. Does anyone see another way to handle this?

Comment by Gerrit Updater [ 14/Jul/16 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: http://review.whamcloud.com/21304
Subject: LU-7650 lnet: handle mixed page size configurations.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e5c42aa7233a28cd07c01ecb1fcebbcf4b3ff0b0

Comment by Olaf Weber [ 22/Jul/16 ]

Looking over the code, properly supporting LNET_MAX_PAYLOAD != LNET_MTU requires adding code that splits a request to move more than LNET_MTU data into multiple messages. There is some code related to this in lnet_parse(), where clearly the idea is that router never sees messages larger than LNET_MTU. But code that does the actual splitting and recombining appears to be absent.

Looking at the patch, the bulk of the work appears to be done by defining LNET_MAX_IOV to a smaller value, ensuring that no messages bigger than LNET_MTU will be generated. But seems mostly to be an implication of the code (# iov entries * page size) rather than something explicitly checked and enforced.

I'd be inclined to change the LNET_MAX_PAYLOAD checks to be < LNET_MTU and > LNET_MTU, to explicitly indicate that only == LNET_MTU works until message fragments are implemented. LNET_MAX_IOV I'd be inclined to define as (LNET_MTU >> PAGE_SHIFT) for the same reason.

Comment by James A Simmons [ 26/Jul/16 ]

I scaled back the patch to just handle the fragment count issues of the ko2iblnd with different page size nodes communicating. With the scaled back patch the LNet layer will be transmitting 16MB LNet packet instead of normal 1MB packet but that doesn't appear to break anything. On our Power8 nodes it is consuming nearly a extra GB of memory for no reason tho. I found fixing up LNET_MAX_IOV/LNET_MTU to be far more challenging since various Lustre kernel and utilities code assumes these hard coded values.

Comment by Gerrit Updater [ 22/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21304/
Subject: LU-7650 o2iblnd: handle mixed page size configurations.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 399a5ac1fc73343c69e0fd737032adf5329df1b2

Comment by James A Simmons [ 22/Aug/16 ]

ko2iblnd now works for x86 <-> Power8

Comment by Doug Oucharek (Inactive) [ 01/Sep/16 ]

I tried reproducing the original problem reported in LU-5718 (using modified lnet-selftest) with this patch in place. Before this patch, we would get an error message and a bulk write operation with an offset would fail. With this patch, we crash. The reason is this patch removes the check which protects us from using too many work queue items.

I'm not suggesting changing this patch now that it has landed, but rather increase the pressure to fix LU-5718 properly as soon as possible.

Comment by James A Simmons [ 01/Sep/16 ]

Excellent that we have a reproducer. Strange we have never encountered any of these problems. Mind you an error not so productive either

Comment by Gerrit Updater [ 02/Sep/16 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/22281
Subject: LU-7650 o2iblnd: Put back work queue check previously removed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cecd3abd6f387926374130814a177b24c28c747e

Comment by James A Simmons [ 02/Sep/16 ]

Since a check that break support for Power8 is being added back in this ticket needs to be reopened until power8 support is supported again.

Comment by Gerrit Updater [ 10/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22281/
Subject: LU-7650 o2iblnd: Put back work queue check previously removed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bde1da1ec098450f40887587b0a46c9eb86a4f6c

Comment by Peter Jones [ 10/Sep/16 ]

Landed for 2.9

Comment by James A Simmons [ 29/Oct/16 ]

Appears support for Power8 broke things.

Comment by Peter Jones [ 29/Oct/16 ]

Yep. We'll need to take another run at this.

Comment by James A Simmons [ 29/Oct/16 ]

With the new RDMA RW api and map_on_demand being broken not only on Power8 but in general it looks like a big rewrite of the ko2ibllnd driver is needed

Comment by Doug Oucharek (Inactive) [ 31/Oct/16 ]

James: how far back does the new RDMA RW api go (kernel version)? My understanding is the new api starts about Linux 4.6 which is very far out for current Lustre users. We need to band-aid ko2iblnd to keep going for the next few years.

I agree that ko2iblnd is going to need a rewrite to stay relevant for the upstream kernel.

Comment by James A Simmons [ 07/Nov/16 ]

Actually I just got a copy of the RHEL7.3 source and the new RDMA generic api is there. So the answer is the solution is ready for todays customers.

Comment by Doug Oucharek (Inactive) [ 07/Nov/16 ]

Is there any good "architecture" style document about this API? I don't find header files very useful in understanding "how" the API should be used.

Comment by James A Simmons [ 26/Apr/17 ]

Doug do you have time to take another shot at this?

Comment by Peter Jones [ 26/Apr/17 ]

James

Doug is focused on some of the upstream changes in the coming weeks so will not get to this in the 2.10 timeframe

Peter

Comment by James A Simmons [ 09/Apr/18 ]

This is now a duplicate of LU-10157

Generated at Sat Feb 10 02:10:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.